Table of Contents

cs.CL [Back]

[1] ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation cs.CL | cs.AIPDF

Hyeong Kyu Choi, Sharon Li

TL;DR: 本文提出了ModeX,一种无需外部评估器的Best-of-N选择框架,用于开放域文本生成任务。该方法通过构建生成文本之间的相似度图,并应用谱聚类来识别代表语义共识的模态输出,从而从多个随机生成中选择高质量结果。论文还提出了高效版本ModeX-Lite。在文本摘要、代码生成和数学推理等任务上,该方法超越了单路径和多路径基线。

Details

Motivation: 解决在开放域生成任务中,从多个随机生成中选择单一高质量输出的挑战。现有方法依赖外部评估器、奖励模型或精确字符串匹配投票,限制了其适用性和效率。

Result: 在文本摘要、代码生成和数学推理等开放域任务上,ModeX和ModeX-Lite一致地超越了标准的单路径和多路径基线方法,提供了计算高效的解决方案。

Insight: 核心创新是将多数投票泛化到开放域文本生成,通过图相似性和谱聚类识别语义共识,无需额外推理或辅助模型。ModeX-Lite通过早期剪枝进一步提升了效率。这是一种无监督、模型无关的选择策略。

Abstract: Selecting a single high-quality output from multiple stochastic generations remains a fundamental challenge for large language models (LLMs), particularly in open-ended tasks where no canonical answer exists. While Best-of-N and self-consistency methods show that aggregating multiple generations can improve performance, existing approaches typically rely on external evaluators, reward models, or exact string-match voting, limiting their applicability and efficiency. We propose Mode Extraction (ModeX), an evaluator-free Best-of-N selection framework that generalizes majority voting to open-ended text generation by identifying the modal output representing the dominant semantic consensus among generated texts. ModeX constructs a similarity graph over candidate generations and recursively applies spectral clustering to select a representative centroid, without requiring additional inference or auxiliary models. We further instantiate this selection principle as ModeX-Lite, an improved version of ModeX with early pruning for efficiency. Across open-ended tasks – including text summarization, code generation, and mathematical reasoning – our approaches consistently outperform standard single- and multi-path baselines, providing a computationally efficient solution for robust open-ended text generation. Code is released in https://github.com/deeplearning-wisc/ModeX.


[2] LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference cs.CLPDF

Hossein Rajabzadeh, Maryam Dialameh, Chul B. Park, Il-Min Kim, Hyock Ju Kwon

TL;DR: LoRA-Drop是一种即插即用的LLM推理加速框架,通过为中间层子集设计时间计算调度来减少解码成本:在大多数解码步骤中,选定层重用前一token的隐藏状态并应用低秩LoRA修正,而周期性的刷新步骤执行完整模型以防止漂移。该方法无需路由网络,兼容标准KV缓存,并能通过跳过可丢弃层的KV更新来减少KV缓存占用。

Details

Motivation: 自回归大语言模型受限于顺序解码,每个新token通常需要执行所有Transformer层,现有动态深度和层跳过方法虽能降低成本,但常依赖辅助路由机制或在跳过层未补偿时导致精度下降。

Result: 在LLaMA2-7B、LLaMA3-8B、Qwen2.5-7B和Qwen2.5-14B模型上,LoRA-Drop实现了高达2.6倍的解码加速和45-55%的KV缓存减少,同时保持与基线精度相差在0.5个百分点以内。在推理、代码生成和长上下文/多语言基准测试中,识别出能保持质量同时提供显著效率提升的调度配置安全区。

Insight: 创新点在于提出了一种无需路由网络的时间调度机制,结合LoRA修正和周期性刷新,在保持模型精度的同时大幅提升推理效率并减少内存占用,为LLM的自适应容量推理提供了简单路径。

Abstract: Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic-depth and layer-skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed layers are left uncompensated. We present \textbf{LoRA-Drop}, a plug-and-play inference framework that accelerates decoding by applying a \emph{temporal compute schedule} to a fixed subset of intermediate layers: on most decoding steps, selected layers reuse the previous-token hidden state and apply a low-rank LoRA correction, while periodic \emph{refresh} steps execute the full model to prevent drift. LoRA-Drop requires no routing network, is compatible with standard KV caching, and can reduce KV-cache footprint by skipping KV updates in droppable layers during LoRA steps and refreshing periodically. Across \textbf{LLaMA2-7B}, \textbf{LLaMA3-8B}, \textbf{Qwen2.5-7B}, and \textbf{Qwen2.5-14B}, LoRA-Drop achieves up to \textbf{2.6$\times$ faster decoding} and \textbf{45–55% KV-cache reduction} while staying within \textbf{0.5 percentage points (pp)} of baseline accuracy. Evaluations on reasoning (GSM8K, MATH, BBH), code generation (HumanEval, MBPP), and long-context/multilingual benchmarks (LongBench, XNLI, XCOPA) identify a consistent \emph{safe zone} of scheduling configurations that preserves quality while delivering substantial efficiency gains, providing a simple path toward adaptive-capacity inference in LLMs. Codes are available at https://github.com/hosseinbv/LoRA-Drop.git.


[3] Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency cs.CL | cs.AIPDF

Haoran Wang, Maryam Khalid, Qiong Wu, Jian Gao, Cheng Cao

TL;DR: 本文提出了一种名为概率确定性与一致性(PCC)的框架,用于提升大语言模型(LLM)的事实核查能力。该框架通过联合建模LLM的概率确定性和推理一致性来估计其对事实声明的置信度,并据此自适应地决定是直接回答、触发针对性检索还是进行深度搜索,从而在减少无关噪声的同时提高效率和可靠性。

Details

Motivation: 现有的事实核查方法通常不加区分地检索外部证据,忽略了模型的内部知识并可能引入无关噪声,且缺乏针对性机制来解决模型推理中的特定不确定性。受人类事实核查方式的启发,本文旨在让LLM能根据其对特定声明的置信度,自适应地决定是依赖内部知识还是启动检索。

Result: 在三个具有挑战性的基准测试上进行的大量实验表明,PCC在不确定性量化方面优于口头表达的置信度方法,并且始终优于基于LLM的强事实核查基线模型。

Insight: 核心创新点在于提出了一个联合建模概率确定性和推理一致性的置信度估计框架,并基于此设计了一个置信度引导的自适应验证策略(路由机制)。这允许模型智能地、按需地利用内部知识和外部检索,而非盲目检索,从而在效率和准确性之间取得更好平衡。该框架被证明具有良好的跨LLM泛化能力。

Abstract: Large language models (LLMs) are increasingly used in applications requiring factual accuracy, yet their outputs often contain hallucinated responses. While fact-checking can mitigate these errors, existing methods typically retrieve external evidence indiscriminately, overlooking the model’s internal knowledge and potentially introducing irrelevant noise. Moreover, current systems lack targeted mechanisms to resolve specific uncertainties in the model’s reasoning. Inspired by how humans fact-check, we argue that LLMs should adaptively decide whether to rely on internal knowledge or initiate retrieval based on their confidence in a given claim. We introduce Probabilistic Certainty and Consistency (PCC), a framework that estimates factual confidence by jointly modeling an LLM’s probabilistic certainty and reasoning consistency. These confidence signals enable an adaptive verification strategy: the model answers directly when confident, triggers targeted retrieval when uncertain or inconsistent, and escalates to deep search when ambiguity is high. Our confidence-guided routing mechanism ensures that retrieval is invoked only when necessary, improving both efficiency and reliability. Extensive experiments across three challenging benchmarks show that PCC achieves better uncertainty quantification than verbalized confidence and consistently outperforms strong LLM-based fact-checking baselines. Furthermore, we demonstrate that PCC generalizes well across various LLMs.


[4] FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions cs.CL | cs.AIPDF

Kris W Pan, Yongmin Yoo

TL;DR: 本文提出FlowPlan-G2P框架,将科学论文转化为专利描述,通过概念图归纳、段落与章节规划、图条件生成三个阶段,模拟专家起草的认知流程,以提升逻辑连贯性与法律合规性。

Details

Motivation: 解决将科学论文转化为专利描述时,因修辞风格差异和严格法律要求带来的挑战,克服黑盒文本到文本方法在建模结构推理和法律约束上的不足。

Result: 实验表明,FlowPlan-G2P在逻辑连贯性和法律合规性上显著优于端到端LLM基线,为论文到专利生成建立了新范式。

Insight: 创新点在于将任务重构为结构化推理流程,通过概念图诱导和规划来对齐专利章节,实现领域特定的结构化文本生成;客观分析认为其分阶段方法可有效整合领域知识,提升生成质量。

Abstract: Over 3.5 million patents are filed annually, with drafting patent descriptions requiring deep technical and legal expertise. Transforming scientific papers into patent descriptions is particularly challenging due to their differing rhetorical styles and stringent legal requirements. Unlike black-box text-to-text approaches that struggle to model structural reasoning and legal constraints, we propose FlowPlan-G2P, a novel framework that mirrors the cognitive workflow of expert drafters by reformulating this task into three stages: (1) Concept Graph Induction, extracting technical entities and relationships into a directed graph via expert-like reasoning; (2) Paragraph and Section Planning, reorganizing the graph into coherent clusters aligned with canonical patent sections; and (3) Graph-Conditioned Generation, producing legally compliant paragraphs using section-specific subgraphs and tailored prompts. Experiments demonstrate that FlowPlan-G2P significantly improves logical coherence and legal compliance over end-to-end LLM baselines. Our framework establishes a new paradigm for paper-to-patent generation and advances structured text generation for specialized domains.


[5] Scalable Construction of a Lung Cancer Knowledge Base: Profiling Semantic Reasoning in LLMs cs.CLPDF

Cesar Felipe Martínez Cisneros, Jesús Ulises Quiroz Bautista, Claudia Anahí Guzmán Solano, Bogdan Kaleb García Rivera, Iván García Pacheco

TL;DR: 本研究提出了一种利用开放信息抽取技术构建肺癌知识库的流程,通过识别医学概念、筛选文献、抽取三元组和实体识别等步骤,生成了一个用于微调大语言模型的领域特定资源。评估显示,基于该知识库微调的T5模型在语义一致性和性能上均有显著提升。

Details

Motivation: 在生物医学领域,特别是肿瘤学中,大语言模型的性能高度依赖于训练数据的语义质量,而构建结构化知识库对于实现精确和可解释的领域特定推理至关重要。

Result: 在ROUGE和BERTScore评估指标上,基于该知识库微调的T5模型表现出显著改进的性能和语义一致性,证明了开放信息抽取资源作为可扩展、低成本解决方案的潜力。

Insight: 创新点在于利用开放信息抽取和命名实体识别相结合的方法,从开放获取的PubMed文献中自动化构建大规模、噪声感知的肺癌知识库,为生物医学NLP任务提供了高效的微调资源。

Abstract: The integration of Large Language Models (LLMs) into biomedical research offers new opportunities for domainspecific reasoning and knowledge representation. However, their performance depends heavily on the semantic quality of training data. In oncology, where precision and interpretability are vital, scalable methods for constructing structured knowledge bases are essential for effective fine-tuning. This study presents a pipeline for developing a lung cancer knowledge base using Open Information Extraction (OpenIE). The process includes: (1) identifying medical concepts with the MeSH thesaurus; (2) filtering open-access PubMed literature with permissive licenses (CC0); (3) extracting (subject, relation, object) triplets using OpenIE method; and (4) enriching triplet sets with Named Entity Recognition (NER) to ensure biomedical relevance. The resulting triplet sets provide a domain-specific, large-scale, and noise-aware resource for fine-tuning LLMs. We evaluated T5 models finetuned on this dataset through Supervised Semantic Fine-Tuning. Comparative assessments with ROUGE and BERTScore show significantly improved performance and semantic coherence, demonstrating the potential of OpenIE-derived resources as scalable, low-cost solutions for enhancing biomedical NLP.


[6] When Do Tools and Planning Help LLMs Think? A Cost- and Latency-Aware Benchmark cs.CL | cs.AIPDF

Subha Ghoshal, Ali Al-Bustami

TL;DR: 本文通过两个真实世界任务(基于图结构知识的事件问答Event-QA和Reddit ChangeMyView的说服性回复生成CMV),系统评估了LLM在推理时使用规划与外部工具(如SPARQL查询、检索、网络搜索)对性能、延迟和成本的影响。研究发现,在Event-QA上工具增强能显著提升准确率但大幅增加延迟,而在CMV上单次提示效果最佳,复杂工具编排反而可能引入失败模式。

Details

Motivation: 现代大语言模型(LLM)越来越多地依赖推理时规划和外部工具来提升推理能力,但不同任务中这些方法的有效性、延迟和成本效益尚不明确,需要任务特定的、成本感知的评估。

Result: 在Event-QA上,最佳工具增强配置将GPT-4o的准确率从47.5%提升至67.5%,但延迟从约8秒增至约317秒;在CMV上,单次提示(如GPT-4o-mini达到75%准确率,约6秒延迟)表现最强,规划+搜索大幅增加延迟且未带来一致收益。

Insight: 论文创新点在于构建了一个综合考虑准确率、端到端延迟和令牌成本的基准,揭示了工具和规划的有效性高度依赖于任务特性;客观来看,其提供了量化权衡框架,强调了模型大小与代理/工具复杂度的选择需基于任务和成本进行精细化决策。

Abstract: Modern large language models (LLMs) increasingly rely on inference-time planning and external tools to improve reasoning. We benchmark this behavior on two real-world settings: event-centric question answering over graph-structured knowledge (Event-QA) and persuasive response generation in Reddit ChangeMyView (CMV). Using LangChain and LangGraph, we compare a one-shot baseline against a plan-execute-replan agent equipped with task-specific tools (DBpedia SPARQL/lookup/schema exploration, Wikipedia-focused retrieval, and topical web search). We evaluate on 60 examples each from Event-QA and CMV (3 splits of 20), and report both mean end-to-end latency and per-example token cost estimates. We evaluate GPT-4o and GPT-4o-mini under identical workflows and report accuracy and end-to-end latency. On Event-QA, the best tool-augmented configuration improves accuracy (e.g., 47.5% $\rightarrow$ 67.5% for GPT-4o) while increasing latency by orders of magnitude ($\sim$8s $\rightarrow$ $\sim$317s per example). On CMV, one-shot prompting is strongest (e.g., GPT-4o-mini achieves 75% at $\sim$6s), and planning+search increases latency substantially without consistent gains. However, complex multi-tool orchestration exposes failure modes where the smaller model degrades. Overall, the findings highlight the need for task-specific, cost-aware choices of both model size and agent/tooling complexity.


[7] Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking cs.CLPDF

Hongzhan Lin, Zixin Chen, Zhiqi Shen, Ziyang Luo, Zhen Ye

TL;DR: 该论文提出了FactArena,一个全自动的竞技场式评估框架,用于对大型语言模型在完整事实核查流程中进行全面的、分阶段的基准测试。该框架通过标准化的声明分解、工具增强的证据检索和基于理由的裁决预测,结合竞技场式评判机制和自适应声明演化模块,对16个最先进的LLM进行了评估,揭示了静态声明验证与端到端事实核查能力之间的显著差异。

Details

Motivation: 现有评估主要关注声明验证,而忽略了更广泛的事实核查工作流程(如声明提取和证据检索),这阻碍了揭示LLMs的系统性推理失败、事实盲点和鲁棒性限制。

Result: 在涵盖七个模型家族的16个最先进LLM上,FactArena产生了稳定且可解释的排名。分析揭示了静态声明验证准确性与端到端事实核查能力之间的显著差异。

Insight: 创新点在于提出了一个涵盖完整事实核查流程的、分阶段的、自动化的竞技场式评估框架,通过标准化的流程、无偏见的成对比较机制和自适应生成更具挑战性声明的模块,为诊断LLMs的事实推理提供了可扩展且可信赖的范式。

Abstract: Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems, yet existing evaluations focus predominantly on claim verification and overlook the broader fact-checking workflow, including claim extraction and evidence retrieval. This narrow focus prevents current benchmarks from revealing systematic reasoning failures, factual blind spots, and robustness limitations of modern LLMs. To bridge this gap, we present FactArena, a fully automated arena-style evaluation framework that conducts comprehensive, stage-wise benchmarking of LLMs across the complete fact-checking pipeline. FactArena integrates three key components: (i) an LLM-driven fact-checking process that standardizes claim decomposition, evidence retrieval via tool-augmented interactions, and justification-based verdict prediction; (ii) an arena-styled judgment mechanism guided by consolidated reference guidelines to ensure unbiased and consistent pairwise comparisons across heterogeneous judge agents; and (iii) an arena-driven claim-evolution module that adaptively generates more challenging and semantically controlled claims to probe LLMs’ factual robustness beyond fixed seed data. Across 16 state-of-the-art LLMs spanning seven model families, FactArena produces stable and interpretable rankings. Our analyses further reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence, highlighting the necessity of holistic evaluation. The proposed framework offers a scalable and trustworthy paradigm for diagnosing LLMs’ factual reasoning, guiding future model development, and advancing the reliable deployment of LLMs in safety-critical fact-checking applications.


[8] Mitigating Prompt-Induced Hallucinations in Large Language Models via Structured Reasoning cs.CLPDF

Jinbo Hao, Kai Yang, Qingzhen Su, Yang Chen, Yifan Li

TL;DR: 本文提出了一种通过结构化推理来缓解大语言模型(LLM)中提示诱导幻觉的方法。该方法基于知识蒸馏链式模型,引入了一个代码模块来引导知识图谱探索,并将代码作为思维链提示的一部分,形成外部知识输入,为模型提供更准确、结构化的信息。基于此设计,作者开发了一个改进的知识蒸馏链式模型,用于分析和约束LLM的推理过程,从而提高推理准确性。

Details

Motivation: 解决大语言模型(LLM)中因提示(prompt)而引发的幻觉问题,即模型生成不准确或无法验证的信息。

Result: 在多个公共数据集上使用GPT-4和LLaMA-3.3进行实证评估。实验结果表明,引入代码模块显著提升了模型捕捉上下文信息的能力,有效缓解了提示诱导幻觉。具体而言,HIT@1、HIT@3和HIT@5分别提升了15.64%、13.38%和13.28%。在多个评估设置中,所提方法的HIT@1、HIT@3和HIT@5得分均超过95%。

Insight: 主要创新点在于将代码模块与知识图谱探索相结合,作为结构化外部知识整合到思维链提示中,从而引导和约束LLM的推理过程。从客观角度看,该方法通过引入可执行、结构化的代码逻辑来增强推理的可验证性和准确性,是一种将程序化思维与知识表示相结合的创新思路,可能为缓解LLM幻觉提供新的技术路径。

Abstract: To address hallucination issues in large language models (LLMs), this paper proposes a method for mitigating prompt-induced hallucinations. Building on a knowledge distillation chain-style model, we introduce a code module to guide knowledge-graph exploration and incorporate code as part of the chain-of-thought prompt, forming an external knowledge input that provides more accurate and structured information to the model. Based on this design, we develop an improved knowledge distillation chain-style model and leverage it to analyze and constrain the reasoning process of LLMs, thereby improving inference accuracy. We empirically evaluate the proposed approach using GPT-4 and LLaMA-3.3 on multiple public datasets. Experimental results demonstrate that incorporating code modules significantly enhances the model’s ability to capture contextual information and effectively mitigates prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 improve by 15.64%, 13.38%, and 13.28%, respectively. Moreover, the proposed method achieves HIT@1, HIT@3, and HIT@5 scores exceeding 95% across several evaluation settings. These results indicate that the proposed approach substantially reduces hallucination behavior while improving the accuracy and verifiability of large language models.


[9] SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation cs.CLPDF

Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You

TL;DR: 本文提出了Synapse(协同关联处理语义编码)记忆架构,通过模拟认知科学中的扩散激活机制,构建动态记忆图以增强LLM智能体的长期记忆能力。该方法结合横向抑制和时间衰减机制,动态筛选相关记忆子图,并采用三元混合检索策略融合几何嵌入与基于图遍历的激活检索。在LoCoMo基准测试中,Synapse在复杂时序和多跳推理任务上显著优于现有方法。

Details

Motivation: 针对传统检索增强方法在长期智能体记忆中存在记忆片段割裂(Contextual Tunneling)的问题,受认知科学启发,旨在构建动态关联的记忆系统以提升LLM智能体的时序与多跳推理能力。

Result: 在LoCoMo基准测试中,Synapse在复杂时序和多跳推理任务上显著超越现有最优方法(SOTA),有效解决了上下文隧道化问题。

Insight: 创新点包括:1)将记忆建模为基于扩散激活的动态图而非静态向量相似性;2)引入横向抑制与时间衰减的认知机制实现动态记忆筛选;3)提出融合几何嵌入与图遍历的三元混合检索策略。从客观角度看,该研究将认知科学原理与LLM架构深度结合,为动态记忆系统提供了可解释的计算框架。

Abstract: While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory. To bridge this gap, we introduce Synapse (Synergistic Associative Processing Semantic Encoding), a unified memory architecture that transcends static vector similarity. Drawing from cognitive science, Synapse models memory as a dynamic graph where relevance emerges from spreading activation rather than pre-computed links. By integrating lateral inhibition and temporal decay, the system dynamically highlights relevant sub-graphs while filtering interference. We implement a Triple Hybrid Retrieval strategy that fuses geometric embeddings with activation-based graph traversal. Comprehensive evaluations on the LoCoMo benchmark show that Synapse significantly outperforms state-of-the-art methods in complex temporal and multi-hop reasoning tasks, offering a robust solution to the “Contextual Tunneling” problem. Our code and data will be made publicly available upon acceptance.


[10] EComStage: Stage-wise and Orientation-specific Benchmarking for Large Language Models in E-commerce cs.CLPDF

Kaiyan Zhao, Zijie Meng, Zheyong Xie, Jin Duan, Yao Hu

TL;DR: 本文提出了EComStage基准,用于评估大语言模型在电子商务场景中的分阶段推理能力,涵盖感知、规划和行动三个阶段,并同时考虑顾客导向和商家导向的任务。

Details

Motivation: 现有基准主要评估大语言模型代理是否成功完成最终任务,忽略了中间推理阶段对有效决策的重要性,因此需要一个新的基准来全面评估模型在分阶段推理过程中的能力。

Result: 在涵盖七个代表性电子商务任务的EComStage基准上,评估了超过30个参数量从1B到200B+的开源和闭源大语言模型,揭示了模型在不同阶段和导向任务上的具体优势和弱点。

Insight: 创新点在于提出了一个统一的、分阶段(感知、规划、行动)且区分导向(顾客/商家)的电子商务基准,提供了比传统最终任务评估更细粒度的、可操作的见解,有助于针对性地设计和优化实际应用中的LLM智能体。

Abstract: Large Language Model (LLM)-based agents are increasingly deployed in e-commerce applications to assist customer services in tasks such as product inquiries, recommendations, and order management. Existing benchmarks primarily evaluate whether these agents successfully complete the final task, overlooking the intermediate reasoning stages that are crucial for effective decision-making. To address this gap, we propose EComStage, a unified benchmark for evaluating agent-capable LLMs across the comprehensive stage-wise reasoning process: Perception (understanding user intent), Planning (formulating an action plan), and Action (executing the decision). EComStage evaluates LLMs through seven separate representative tasks spanning diverse e-commerce scenarios, with all samples human-annotated and quality-checked. Unlike prior benchmarks that focus only on customer-oriented interactions, EComStage also evaluates merchant-oriented scenarios, including promotion management, content review, and operational support relevant to real-world applications. We evaluate a wide range of over 30 LLMs, spanning from 1B to over 200B parameters, including open-source models and closed-source APIs, revealing stage/orientation-specific strengths and weaknesses. Our results provide fine-grained, actionable insights for designing and optimizing LLM-based agents in real-world e-commerce settings.


[11] MiMo-V2-Flash Technical Report cs.CL | cs.AIPDF

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen

TL;DR: 本文介绍了MiMo-V2-Flash,这是一个总参数量为3090亿、激活参数量为150亿的混合专家模型,旨在实现快速、强大的推理和智能体能力。模型采用滑动窗口注意力与全局注意力混合的架构,在27000亿token上通过多token预测进行预训练,并引入了一种新颖的多教师在线策略蒸馏方法进行高效的后训练扩展。该模型在性能上可与DeepSeek-V3.2和Kimi-K2等顶级开源模型相媲美,但总参数量仅为它们的一半或三分之一,并通过将多token预测用作推测解码的草稿模型,实现了高达3.6的接受长度和2.6倍的解码加速。

Details

Motivation: 旨在开发一个参数高效、推理快速且具备强大智能体能力的混合专家模型,以解决当前大型模型参数量庞大、推理成本高的问题。

Result: 模型性能与DeepSeek-V3.2和Kimi-K2等顶级开源模型相当,但总参数量仅为它们的一半或三分之一。在推理时,通过将多token预测用作推测解码的草稿模型,实现了高达3.6的接受长度和2.6倍的解码加速。

Insight: 创新点包括:1) 采用滑动窗口注意力与全局注意力混合的架构以平衡效率与全局建模能力;2) 引入多教师在线策略蒸馏范式,利用领域专家教师提供密集的token级奖励,高效扩展后训练计算;3) 将预训练中的多token预测机制重新用作推测解码的草稿模型,显著提升推理速度。这些方法在保持高性能的同时,大幅提升了模型的参数效率和推理效率。

Abstract: We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.


[12] LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark cs.CL | cs.AIPDF

Ziyang Chen, Xing Wu, Junlong Jia, Chaochen Gao, Qi Fu

TL;DR: 本文提出了LongBench Pro,一个更真实、更全面的双语长上下文评估基准,包含1500个自然发生的英文和中文长上下文样本,覆盖11个主要任务和25个次要任务,输入长度从8k到256k token。通过人机协作构建流程,结合前沿LLM生成和专家验证,平衡了质量与可扩展性。在评估46个广泛使用的长上下文LLM后,发现了长上下文优化、有效上下文长度跨语言不对齐以及思维范式的影响等关键见解。

Details

Motivation: 当前长上下文基准在可扩展性和真实性之间存在权衡:合成任务无法充分代表现实世界的复杂性,而完全手动标注成本高昂,难以扩展到极端长度和多样场景。因此,需要构建一个更真实、全面的长上下文评估基准。

Result: 在LongBench Pro上评估了46个广泛使用的长上下文LLM,发现长上下文优化对理解能力的贡献大于参数缩放,有效上下文长度通常短于声称的长度且存在跨语言不对齐,思维范式主要帮助原生推理训练的模型,而混合思维设计提供了有前景的帕累托权衡。

Insight: 创新点包括:提出一个基于自然样本、支持细粒度分析的双语长上下文基准;引入人机协作构建流程以平衡质量与可扩展性;通过评估揭示了长上下文模型的关键性能特征,如优化重要性、有效长度限制和思维范式的影响,为未来研究提供了实用见解。

Abstract: The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world complexity, while fully manual annotation is costly to scale to extreme lengths and diverse scenarios. We present LongBench Pro, a more realistic and comprehensive bilingual benchmark of 1,500 naturally occurring long-context samples in English and Chinese spanning 11 primary tasks and 25 secondary tasks, with input lengths from 8k to 256k tokens. LongBench Pro supports fine-grained analysis with task-specific metrics and a multi-dimensional taxonomy of context requirement (full vs. partial dependency), length (six levels), and difficulty (four levels calibrated by model performance). To balance quality with scalability, we propose a Human-Model Collaborative Construction pipeline: frontier LLMs draft challenging questions and reference answers, along with design rationales and solution processes, to reduce the cost of expert verification. Experts then rigorously validate correctness and refine problematic cases. Evaluating 46 widely used long-context LLMs on LongBench Pro yields three findings: (1) long-context optimization contributes more to long-context comprehension than parameter scaling; (2) effective context length is typically shorter than the claimed context length, with pronounced cross-lingual misalignment; and (3) the “thinking” paradigm helps primarily models trained with native reasoning, while mixed-thinking designs offer a promising Pareto trade-off. In summary, LongBench Pro provides a robust testbed for advancing long-context understanding.


[13] Revisiting Data Compression with Language Modeling cs.CLPDF

Chen-Han Tsai

TL;DR: 本文探索了利用大型语言模型(LLM)进行数据压缩的潜力,通过不同方法在无需额外训练的情况下,在enwik9数据集上实现了约18%的调整压缩率的新SOTA,并验证了LLM在非英语文本、代码和字节流序列压缩中的竞争力。

Details

Motivation: 尽管先前工作已展示LLM在文本和多模态数据压缩中的潜力,但将其替代现有压缩算法仍面临实际挑战,本文旨在探索如何利用LLM实现更低的调整压缩率。

Result: 在enwik9数据集上实现了约18%的调整压缩率,达到新的SOTA水平,且无需额外模型训练;在非自然文本序列(如代码、字节流)压缩中,通过适当配置仍能保持竞争力。

Insight: 创新点在于通过方法优化(如调整配置)而非模型再训练来提升LLM的压缩性能,并扩展了LLM在非英语、代码等领域的压缩应用验证,为实用化提供了参考。

Abstract: In this report, we investigate the potential use of large language models (LLM’s) in the task of data compression. Previous works have demonstrated promising results in applying LLM’s towards compressing not only text, but also a wide range of multi-modal data. Despite the favorable performance achieved, there still remains several practical questions that pose a challenge towards replacing existing data compression algorithms with LLM’s. In this work, we explore different methods to achieve a lower adjusted compression rate using LLM’s as data compressors. In comparison to previous works, we were able to achieve a new state-of-the-art (SOTA) adjusted compression rate of around $18%$ on the enwik9 dataset without additional model training. Furthermore, we explore the use of LLM’s in compressing non-English data, code data, byte stream sequences. We show that while LLM’s excel in compressing data in text-dominant domains, their ability in compressing non-natural text sequences still remain competitive if configured in the right way.


[14] Memorization, Emergence, and Explaining Reversal Failures: A Controlled Study of Relational Semantics in LLMs cs.CLPDF

Yihua Zhu, Qianying Liu, Jiaxin Wang, Fei Cheng, Chaoran Liu

TL;DR: 本文通过一个基于知识图谱的合成框架,研究了自回归LLMs在关系语义任务上的表现,特别是对称和逆关系逻辑的学习能力。研究发现,在足够的逻辑监督下,即使浅层模型也能涌现出关系语义,且泛化成功与中间层稳定信号相关;同时,反转失败主要源于自回归的顺序偏差,而非逆关系语义的缺失。

Details

Motivation: 探究自回归LLMs是否学习关系词(如父子、朋友)的逻辑语义(如对称性和逆逻辑),以及反转类失败是由于关系语义缺失还是从左到右的顺序偏差所致。

Result: 在合成框架上训练GPT风格模型,发现足够的逻辑监督下关系语义会涌现,泛化与中间层稳定信号对齐;通过顺序匹配的前向/反向测试和扩散基线表明,反转失败主要由自回归顺序偏差驱动。

Insight: 创新点在于提出可控合成框架来分离关系语义学习与顺序偏差,揭示语义涌现的相变现象和中间层信号的重要性,为理解LLMs的逻辑推理失败提供了新视角。

Abstract: Autoregressive LLMs perform well on relational tasks that require linking entities via relational words (e.g., father/son, friend), but it is unclear whether they learn the logical semantics of such relations (e.g., symmetry and inversion logic) and, if so, whether reversal-type failures arise from missing relational semantics or left-to-right order bias. We propose a controlled Knowledge Graph-based synthetic framework that generates text from symmetric/inverse triples, train GPT-style autoregressive models from scratch, and evaluate memorization, logical inference, and in-context generalization to unseen entities to address these questions. We find a sharp phase transition in which relational semantics emerge with sufficient logic-bearing supervision, even in shallow (2-3 layer) models, and that successful generalization aligns with stable intermediate-layer signals. Finally, order-matched forward/reverse tests and a diffusion baseline indicate that reversal failures are primarily driven by autoregressive order bias rather than deficient inversion semantics.


[15] Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning cs.CL | cs.LGPDF

Junseok Kim, Nakyeong Yang, Kyungmin Min, Kyomin Jung

TL;DR: 本文提出了一种名为可靠性感知自适应自洽性(ReASC)的方法,旨在提高大语言模型推理的可靠性,同时显著降低推理成本。ReASC通过将自适应采样从基于响应计数的策略重构为基于证据充分性的策略,并利用响应级别的置信度进行信息聚合,从而在多个模型和数据集上实现了最佳的准确性与成本权衡。

Details

Motivation: 自洽性方法通过多样本聚合提高了推理的可靠性,但带来了巨大的推理成本;现有的自适应自洽性方法虽然通过调整采样预算来缓解这一问题,但依赖于基于计数的停止规则,平等对待所有响应,往往导致不必要的采样。

Result: 在五个模型和四个数据集上的实验表明,ReASC相比现有基线方法,始终实现了最佳的准确性与成本权衡,在参数量从3B到27B的模型规模上均提高了推理效率。例如,在使用Gemma-3-4B-it模型和GSM8K数据集时,ReASC在保持准确性的同时,将推理成本相对于自洽性方法降低了高达70%。

Insight: ReASC的创新点在于将自适应采样的核心从简单的响应计数转向证据充分性判断,并引入两阶段处理:单样本决策阶段解决可从单个响应中自信回答的实例,以及可靠性感知累积阶段通过联合利用响应的频率和置信度进行聚合。这提供了一种更原则性的信息聚合方式,可有效减少不必要的采样,提升效率。

Abstract: Self-Consistency improves reasoning reliability through multi-sample aggregation, but incurs substantial inference cost. Adaptive self-consistency methods mitigate this issue by adjusting the sampling budget; however, they rely on count-based stopping rules that treat all responses equally, often leading to unnecessary sampling. We propose Reliability-Aware Adaptive Self-Consistency (ReASC), which addresses this limitation by reframing adaptive sampling from response counting to evidence sufficiency, leveraging response-level confidence for principled information aggregation. ReASC operates in two stages: a single-sample decision stage that resolves instances confidently answerable from a single response, and a reliability-aware accumulation stage that aggregates responses by jointly leveraging their frequency and confidence. Across five models and four datasets, ReASC consistently achieves the best accuracy-cost trade-off compared to existing baselines, yielding improved inference efficiency across model scales from 3B to 27B parameters. As a concrete example, ReASC reduces inference cost by up to 70% relative to self-consistency while preserving accuracy on GSM8K using Gemma-3-4B-it.


[16] Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning cs.CL | cs.AIPDF

Nathanaël Carraz Rakotonirina, Ren Pang, Neha Anna John, Michael Bohlke-Schneider, Momchil Hardalov

TL;DR: 本文提出了一种多阶段高效推理方法,旨在解决大语言模型(LLMs)在思维链(CoT)推理中存在的“过度思考”问题,即生成长度不必要且可能损害性能的中间推理过程。该方法结合了监督微调(通过拒绝采样或推理轨迹重构)和带有自适应长度惩罚的强化学习,引入了一个轻量级奖励函数,以鼓励在第一个正确答案后停止生成,并仅在有益时进行自我验证。

Details

Motivation: 动机是解决LLMs在CoT推理中产生的“过度思考”现象,即中间推理过程过长,增加了计算成本却未能提升甚至可能降低准确率,从而需要一种方法在保持准确性的同时显著减少响应长度。

Result: 在七个不同的推理任务上进行评估,该方法使8B模型的响应长度平均减少28%,32B模型减少40%,而准确率仅分别轻微下降1.6和2.5个百分点。在Overthinking-Adjusted Accuracy曲线下面积(AUC_OAA)指标上达到76.6分,比基础模型高5分,比次优方法高2.5分,实现了优于现有高效推理方法的权衡。

Insight: 创新点在于将监督微调与基于自适应长度惩罚的强化学习相结合,设计了一个轻量级奖励函数来动态控制推理长度,鼓励早期停止生成,从而在减少计算开销的同时最小化性能损失。从客观角度看,该方法通过多阶段训练策略,有效平衡了推理的准确性和效率,为高效推理提供了简洁而有效的解决方案。

Abstract: The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking’’. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning – via rejection sampling or reasoning trace reformatting – with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy-response length trade-off. Our approach reduces response length by an average of 28% for 8B models and 40% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) – 5 points above the base model and 2.5 points above the second-best approach.


[17] Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy cs.CLPDF

Hosein Hasani, Mohammadali Banayeeanzade, Ali Nafisi, Sadegh Mohammadian, Fatemeh Askari

TL;DR: 该论文研究了大型语言模型(LLMs)在处理大规模计数任务时的系统性限制,并提出了一种受System-2认知过程启发的测试时策略。该策略将大规模计数任务分解为模型可可靠解决的更小、独立的子问题。通过观测性和因果中介分析,论文揭示了该策略的底层机制,包括潜在计数的计算、存储、传输和聚合过程。实验表明,该方法使LLMs能够超越架构限制,在大规模计数任务上实现高精度。

Details

Motivation: 尽管LLMs在复杂数学问题上表现强劲,但在计数任务上存在系统性限制,这源于Transformer架构的深度约束导致跨层计数时精度下降。论文旨在解决LLMs处理大规模计数任务时的这一局限性。

Result: 实验结果表明,提出的System-2策略使LLMs能够超越架构限制,在大规模计数任务上实现高精度。

Insight: 论文的创新点在于提出了一种可泛化的测试时策略,通过任务分解模拟System-2认知过程,并利用机制可解释性方法(如因果中介分析)深入理解了LLMs中类似System-2计数的内部工作机制,为改进和理解其推理行为提供了新途径。

Abstract: Large language models (LLMs), despite strong performance on complex mathematical problems, exhibit systematic limitations in counting tasks. This issue arises from architectural limits of transformers, where counting is performed across layers, leading to degraded precision for larger counting problems due to depth constraints. To address this limitation, we propose a simple test-time strategy inspired by System-2 cognitive processes that decomposes large counting tasks into smaller, independent sub-problems that the model can reliably solve. We evaluate this approach using observational and causal mediation analyses to understand the underlying mechanism of this System-2-like strategy. Our mechanistic analysis identifies key components: latent counts are computed and stored in the final item representations of each part, transferred to intermediate steps via dedicated attention heads, and aggregated in the final stage to produce the total count. Experimental results demonstrate that this strategy enables LLMs to surpass architectural limitations and achieve high accuracy on large-scale counting tasks. This work provides mechanistic insight into System-2 counting in LLMs and presents a generalizable approach for improving and understanding their reasoning behavior.


[18] Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation cs.CLPDF

Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng

TL;DR: 本文提出Stable-RAG方法,旨在缓解检索增强生成(RAG)中因检索文档顺序变化而引发的幻觉问题。该方法通过估计排列敏感性,在多个检索顺序下运行生成器,对隐藏状态进行聚类,并从捕获主导推理模式的聚类中心表示进行解码,从而对齐幻觉输出,提高答案的一致性和准确性。

Details

Motivation: 现有RAG方法主要关注提升大语言模型对低质量检索的鲁棒性和缓解长上下文中的位置偏差,但未直接解决模型对检索文档排列顺序的敏感性,这种敏感性会导致答案在不同排列下出现显著变化和幻觉。

Result: 在三个问答数据集上的实验表明,与基线方法相比,Stable-RAG显著提高了答案准确性、推理一致性,并在不同数据集、检索器和输入长度上展现出更强的鲁棒泛化能力。

Insight: 创新点在于首次系统性地识别并量化了RAG中的检索排列敏感性问题,并提出了一种基于多排列推理和隐藏状态聚类中心解码的稳定化方法,通过捕捉主导推理模式来对齐输出,从而减少排列诱导的幻觉。这为提升RAG系统的稳定性和可靠性提供了新思路。

Abstract: Retrieval-Augmented Generation (RAG) has become a key paradigm for reducing factual hallucinations in large language models (LLMs), yet little is known about how the order of retrieved documents affects model behavior. We empirically show that under Top-5 retrieval with the gold document included, LLM answers vary substantially across permutations of the retrieved set, even when the gold document is fixed in the first position. This reveals a previously underexplored sensitivity to retrieval permutations. Although robust RAG methods primarily focus on enhancing LLM robustness to low-quality retrieval and mitigating positional bias to distribute attention fairly over long contexts, neither approach directly addresses permutation sensitivity. In this paper, we propose Stable-RAG, which exploits permutation sensitivity estimation to mitigate permutation-induced hallucinations. Stable-RAG runs the generator under multiple retrieval orders, clusters hidden states, and decodes from a cluster-center representation that captures the dominant reasoning pattern. It then uses these reasoning results to align hallucinated outputs toward the correct answer, encouraging the model to produce consistent and accurate predictions across document permutations. Experiments on three QA datasets show that Stable-RAG significantly improves answer accuracy, reasoning consistency and robust generalization across datasets, retrievers, and input lengths compared with baselines.


[19] Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners cs.CLPDF

Yihong Liu, Raoyuan Zhao, Hinrich Schütze, Michael A. Hedderich

TL;DR: 本文系统研究了大型推理模型在11种语言中的潜在推理能力,发现模型在生成完整思维链之前就能得出正确答案,表明存在多语言潜在推理现象,但该能力在资源丰富语言中较强,在低资源语言中较弱,且内部推理路径呈现以英语为中心的倾向。

Details

Motivation: 探究大型推理模型在多语言环境下的潜在推理行为,即模型是否能在未完成显式思维链文本生成时,通过内部隐藏状态进行非语言计算并得出正确答案,以弥补当前研究主要集中于英语的不足。

Result: 在11种语言的数学推理任务上,通过截断策略评估逐步潜在预测形成,发现多语言潜在推理存在但不均衡:在资源丰富语言(如英语)中表现强,在低资源语言中较弱,且在更难的基准测试上可观测性降低;内部表示分析显示预测演化跨语言高度一致,且与英语对齐。

Insight: 创新点在于首次系统量化多语言潜在推理,揭示了模型推理能力对语言资源的依赖性,并发现尽管表面表现差异大,但内部推理机制可能共享以英语为中心的潜在路径,这为理解多语言模型的知识表示与迁移提供了新视角。

Abstract: Large reasoning models (LRMs) achieve strong performance on mathematical reasoning tasks, often attributed to their capability to generate explicit chain-of-thought (CoT) explanations. However, recent work shows that LRMs often arrive at the correct answer before completing these textual reasoning steps, indicating the presence of latent reasoning – internal, non-verbal computation encoded in hidden states. While this phenomenon has been explored in English, its multilingual behavior remains largely unknown. In this paper, we conduct a systematic investigation of multilingual latent reasoning in LRMs across 11 languages. Using a truncation-based strategy, we examine how the correct answer emerges as the model is given only partial reasoning traces, allowing us to measure stepwise latent prediction formation. Our results reveal clear evidence of multilingual latent reasoning, though unevenly: strong in resource-rich languages, weaker in low-resource ones, and broadly less observable on harder benchmarks. To understand whether these differences reflect distinct internal mechanisms, we further perform representational analyses. Despite surface-level disparities, we find that the internal evolution of predictions is highly consistent across languages and broadly aligns with English – a pattern suggesting an English-centered latent reasoning pathway.


[20] SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering cs.CL | cs.AIPDF

Junli Liang, Pengfei Zhou, Wangqiu Zhou, Wenjie Qing, Qi Zhao

TL;DR: 本文提出SentGraph,一种基于句子级图结构的检索增强生成框架,用于解决多跳问答任务中传统基于文档块检索方法存在的证据链不完整和逻辑不连贯问题。该框架通过构建层次化句子图,显式建模句子间的细粒度逻辑关系,并利用图引导的证据选择和路径扩展来检索相关证据。

Details

Motivation: 传统检索增强生成方法在单跳问答中有效,但在需要结合多个文档证据的多跳问答任务中,基于文档块的检索常提供不相关且逻辑不连贯的上下文,导致证据链不完整和推理错误。

Result: 在四个多跳问答基准测试上的大量实验证明了SentGraph的有效性,验证了显式建模句子级逻辑依赖关系对多跳推理的重要性。

Insight: 创新点在于将修辞结构理论应用于区分核心句与卫星句,并构建具有跨文档实体桥接的主题级子图,从而实现了细粒度的句子级逻辑关系建模和图引导的证据检索路径扩展。

Abstract: Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces significant limitations in multi-hop question answering tasks, which require combining evidence from multiple documents. Existing chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning during answer generation. To address these challenges, we propose SentGraph, a sentence-level graph-based RAG framework that explicitly models fine-grained logical relationships between sentences for multi-hop question answering. Specifically, we construct a hierarchical sentence graph offline by first adapting Rhetorical Structure Theory to distinguish nucleus and satellite sentences, and then organizing them into topic-level subgraphs with cross-document entity bridges. During online retrieval, SentGraph performs graph-guided evidence selection and path expansion to retrieve fine-grained sentence-level evidence. Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.


[21] MMFormalizer: Multimodal Autoformalization in the Wild cs.CLPDF

Jing Xiong, Qi Han, Yunta Hsieh, Hui Shen, Huajian Xin

TL;DR: 本文提出了MMFormalizer,一种多模态自动形式化方法,旨在将现实世界中的自然语言数学问题(尤其是涉及物理场景的问题)转化为形式化陈述。该方法通过自适应实体接地、递归构造和公理组合,从视觉元素中推断隐藏约束(如质量或能量),从而将自动形式化从纯文本扩展到多模态领域。

Details

Motivation: 解决现实世界中自动形式化面临的根本挑战,即物理世界具有多模态特性,需要从视觉元素中推断隐藏的物理约束,而传统的纯文本自动形式化方法难以处理此类问题。

Result: 在提出的新基准PhyX-AF(包含115个来自MathVerse、PhyX、Synthetic Geometry和Analytic Geometry的样本)上评估。结果表明,前沿模型如GPT-5和Gemini-3-Pro在编译和语义准确性上表现最佳,其中GPT-5在物理推理方面表现优异,而几何领域仍然最具挑战性。

Insight: 创新点在于首次提出了一个统一的多模态自动形式化框架,通过递归接地和公理组合,将感知与形式推理相桥接,能够处理经典力学(源自哈密顿量)、相对论、量子力学和热力学,这是首个具备此能力的多模态自动形式化方法。

Abstract: Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io


[22] Dementia-R1: Reinforced Pretraining and Reasoning from Unstructured Clinical Notes for Real-World Dementia Prognosis cs.CL | cs.AI | cs.LGPDF

Choonghan Kim, Hyunmin Hwang, Hangeol Chang, Jaemin Kim, Jinse Park

TL;DR: 论文提出Dementia-R1框架,一种基于强化学习的纵向痴呆症预后预测方法,通过冷启动强化学习策略预训练模型预测临床指标,增强对疾病进展的推理能力,以解决大型语言模型在非单调症状轨迹推理上的不足。

Details

Motivation: 大型语言模型在临床文本理解上表现良好,但在纵向预测任务如痴呆症预后中表现不佳,因为这些任务需要跨多次就诊推理复杂、非单调的症状轨迹,而标准监督训练缺乏症状演变的明确标注,直接强化学习则受稀疏二元奖励限制。

Result: 在真实世界非结构化临床数据集上,Dementia-R1达到77.03%的F1分数;在ADNI基准测试中,其7B模型性能与GPT-4o相当,能有效捕捉波动的认知轨迹。

Insight: 创新点包括冷启动强化学习策略,通过预训练预测可验证临床指标来增强疾病进展推理,从而克服稀疏奖励问题;从客观角度看,该方法将强化学习与临床索引提取结合,为纵向医疗预测提供了新思路。

Abstract: While Large Language Models (LLMs) have shown strong performance on clinical text understanding, they struggle with longitudinal prediction tasks such as dementia prognosis, which require reasoning over complex, non-monotonic symptom trajectories across multiple visits. Standard supervised training lacks explicit annotations for symptom evolution, while direct Reinforcement Learning (RL) is hindered by sparse binary rewards. To address this challenge, we introduce Dementia-R1, an RL-based framework for longitudinal dementia prognosis from unstructured clinical notes. Our approach adopts a Cold-Start RL strategy that pre-trains the model to predict verifiable clinical indices extracted from patient histories, enhancing the capability to reason about disease progression before determining the final clinical status. Extensive experiments demonstrate that Dementia-R1 achieves an F1 score of 77.03% on real-world unstructured clinical datasets. Notably, on the ADNI benchmark, our 7B model rivals GPT-4o, effectively capturing fluctuating cognitive trajectories. Code is available at https://anonymous.4open.science/r/dementiar1-CDB5


[23] MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models cs.CL | cs.HCPDF

Lecheng Gong, Weimin Fang, Ting Yang, Dongjie Tao, Chunxiao Guo

TL;DR: 本文提出了MedDialogRubrics,一个用于评估大型语言模型在多轮医疗咨询中信息收集和诊断推理能力的综合基准与评估框架。该基准包含5,200个合成构建的患者案例和超过60,000个由LLM生成并经临床专家细化的细粒度评估标准,通过多智能体系统合成真实患者记录,避免了真实电子健康记录带来的隐私问题,并对当前最先进模型进行了全面评估。

Details

Motivation: 现有用于评估医疗大型语言模型信息收集和诊断推理能力的基准和评估框架尚未经过严格评估,存在不足,需要一个新的、全面的基准来填补这一空白。

Result: 在多个评估维度上对最先进模型进行了全面评估,结果表明当前模型面临重大挑战,改进医疗对话需要对话管理架构的进步,而不仅仅是基础模型的增量调整。

Insight: 创新点包括:1) 使用多智能体系统合成真实患者案例,通过动态引导机制检测和纠正幻觉,确保案例的内部一致性和临床合理性;2) 提出了一个结构化的、基于LLM和专家标注的评估标准生成流程,结合循证医学指南和拒绝采样来推导每个案例的优先评估项(“必须询问”项);3) 构建了一个大规模、细粒度的评估基准,为医疗对话AI的发展提供了新的评估工具和方向。

Abstract: Medical conversational AI (AI) plays a pivotal role in the development of safer and more effective medical dialogue systems. However, existing benchmarks and evaluation frameworks for assessing the information-gathering and diagnostic reasoning abilities of medical large language models (LLMs) have not been rigorously evaluated. To address these gaps, we present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine-grained evaluation rubrics generated by LLMs and subsequently refined by clinical experts, specifically designed to assess the multi-turn diagnostic capabilities of LLM. Our framework employs a multi-agent system to synthesize realistic patient records and chief complaints from underlying disease knowledge without accessing real-world electronic health records, thereby mitigating privacy and data-governance concerns. We design a robust Patient Agent that is limited to a set of atomic medical facts and augmented with a dynamic guidance mechanism that continuously detects and corrects hallucinations throughout the dialogue, ensuring internal coherence and clinical plausibility of the simulated cases. Furthermore, we propose a structured LLM-based and expert-annotated rubric-generation pipeline that retrieves Evidence-Based Medicine (EBM) guidelines and utilizes the reject sampling to derive a prioritized set of rubric items (“must-ask” items) for each case. We perform a comprehensive evaluation of state-of-the-art models and demonstrate that, across multiple assessment dimensions, current models face substantial challenges. Our results indicate that improving medical dialogue will require advances in dialogue management architectures, not just incremental tuning of the base-model.


[24] Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning cs.CLPDF

Sindhuja Chaduvula, Ahmed Y. Radwan, Azib Farooq, Yani Ioannou, Shaina Raza

TL;DR: 本文提出了一种名为F-DPO(Factuality-aware Direct Preference Optimization)的方法,旨在减少大语言模型(LLM)中的幻觉。该方法是对直接偏好优化(DPO)的简单扩展,仅使用二元事实性标签,通过标签翻转转换和事实性感知边界来校正偏好对,强调正确性差异,从而提升模型的事实性并降低幻觉率。

Details

Motivation: 现有偏好对齐方法(如RLHF和DPO)虽然能提升指令遵循能力,但可能因奖励流畅性和置信度而非事实正确性而加剧幻觉,因此需要一种能直接优化事实性的方法。

Result: 在七个开源LLM(1B-14B)上,F-DPO相比基础模型和标准DPO,一致性地提高了事实性并降低了幻觉率。例如,在Qwen3-8B上,幻觉率降低了五倍(从0.424降至0.084),事实性得分提升了50%(从5.26升至7.90);在TruthfulQA上,Qwen2.5-14B的MC1准确率提升了17%(0.500至0.585),MC2准确率提升了49%(0.357至0.531)。

Insight: 创新点在于仅使用二元事实性标签,通过简单的标签翻转和边界调整来校正偏好对,无需辅助奖励模型、token级标注或多阶段训练,即可有效提升模型的事实性并减少幻觉,方法简洁且通用性强。

Abstract: Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We introduce F-DPO (Factuality-aware Direct Preference Optimization), a simple extension of DPO that uses only binary factuality labels. F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin that emphasizes pairs with clear correctness differences, while reducing to standard DPO when both responses share the same factuality. We construct factuality-aware preference data by augmenting DPO pairs with binary factuality indicators and synthetic hallucinated variants. Across seven open-weight LLMs (1B-14B), F-DPO consistently improves factuality and reduces hallucination rates relative to both base models and standard DPO. On Qwen3-8B, F-DPO reduces hallucination rates by five times (from 0.424 to 0.084) while improving factuality scores by 50 percent (from 5.26 to 7.90). F-DPO also generalizes to out-of-distribution benchmarks: on TruthfulQA, Qwen2.5-14B achieves plus 17 percent MC1 accuracy (0.500 to 0.585) and plus 49 percent MC2 accuracy (0.357 to 0.531). F-DPO requires no auxiliary reward model, token-level annotations, or multi-stage training.


[25] Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage cs.CL | cs.AI | cs.LGPDF

Junhao Hu, Fangze Li, Mingtao Xu, Feifan Meng, Shiju Zhao

TL;DR: 本文研究发现,在大型语言模型(LLM)的解码阶段应用稀疏注意力算法时,信息损失可能导致生成序列显著变长,从而增加端到端复杂度,即“越少越糟”(Less is Less, Lil)现象。为缓解此问题,论文提出了一种早停算法,在稀疏解码过程中检测信息损失超过信息增益的阈值,从而大幅减少令牌消耗。

Details

Motivation: 动机在于LLM推理效率需求高,解码阶段占主导延迟,现有稀疏注意力算法旨在降低其复杂度,但作者发现这些算法可能因信息损失导致序列变长,反而增加端到端成本。

Result: 在推理密集型基准测试上,提出的早停算法将令牌消耗减少高达90%,同时准确率下降小于2%。

Insight: 创新点在于揭示了稀疏注意力在解码阶段可能适得其反的“Lil”现象,并提出了一种基于信息损失与增益权衡的早停检测机制,为高效稀疏解码提供了新思路。

Abstract: Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term ``Less is Less’’ (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to 90% with a marginal accuracy degradation of less than 2% across reasoning-intensive benchmarks.


[26] Detecting Hallucinations in Retrieval-Augmented Generation via Semantic-level Internal Reasoning Graph cs.CLPDF

Jianpeng Hu, Yanzeng Li, Jialun Zhong, Wenfa Qi, Lei Zou

TL;DR: 本文提出了一种基于语义级内部推理图的检测方法,用于识别检索增强生成(RAG)系统中的忠实性幻觉问题。该方法通过扩展层间相关性传播算法至语义层面,构建基于归因向量的内部推理图,以更准确地表示依赖关系,并设计了一个基于小型预训练语言模型的通用框架,利用LLM推理中的依赖关系进行训练和检测,通过阈值动态调整正确样本的通过率。

Details

Motivation: 基于大型语言模型(LLM)的检索增强生成(RAG)系统虽能有效减少事实性幻觉,但忠实性幻觉仍然存在。现有检测方法往往忽略捕捉模型内部推理过程或粗糙处理相关特征,导致判别器难以有效学习。

Result: 实验结果表明,该方法在RAGTruth和Dolly-15k基准测试上相比现有最先进基线取得了更好的整体性能。

Insight: 创新点在于将层间相关性传播从词元级扩展到语义级,构建更忠实的语义依赖表示;同时设计了一个轻量级通用框架,能动态调整阈值以优化检测,为幻觉检测提供了更精细的内部推理视角。

Abstract: The Retrieval-augmented generation (RAG) system based on Large language model (LLM) has made significant progress. It can effectively reduce factuality hallucinations, but faithfulness hallucinations still exist. Previous methods for detecting faithfulness hallucinations either neglect to capture the models’ internal reasoning processes or handle those features coarsely, making it difficult for discriminators to learn. This paper proposes a semantic-level internal reasoning graph-based method for detecting faithfulness hallucination. Specifically, we first extend the layer-wise relevance propagation algorithm from the token level to the semantic level, constructing an internal reasoning graph based on attribution vectors. This provides a more faithful semantic-level representation of dependency. Furthermore, we design a general framework based on a small pre-trained language model to utilize the dependencies in LLM’s reasoning for training and hallucination detection, which can dynamically adjust the pass rate of correct samples through a threshold. Experimental results demonstrate that our method achieves better overall performance compared to state-of-the-art baselines on RAGTruth and Dolly-15k.


[27] Do LLMs Encode Functional Importance of Reasoning Tokens? cs.CL | cs.AI | cs.LGPDF

Janvijay Singh, Dilek Hakkani-Tür

TL;DR: 这篇论文提出了一种名为’贪婪剪枝’的诊断性方法,用于探究大语言模型是否在内部对推理链中的令牌编码了功能重要性。该方法通过迭代移除对模型似然影响最小的推理令牌来生成长度可控的推理链,并在知识蒸馏框架中验证了其有效性。

Details

Motivation: 为了解决大语言模型生成长推理链导致计算成本高、难以识别功能相关推理的问题,并探究模型内部是否编码了令牌级别的功能重要性,而现有方法对此提供有限洞察。

Result: 在知识蒸馏框架中评估,使用剪枝后推理链训练的学生模型,在匹配的推理长度下,性能优于基于前沿模型监督的压缩基线。分析还表明注意力分数可以预测贪婪剪枝的排序。

Insight: 创新点在于提出了贪婪剪枝这一诊断性方法来压缩推理链并探究模型内部表示,其核心洞察是模型内部确实编码了非平凡的、与功能重要性相关的推理令牌结构,且注意力机制与此相关。

Abstract: Large language models solve complex tasks by generating long reasoning chains, achieving higher accuracy at the cost of increased computational cost and reduced ability to isolate functionally relevant reasoning. Prior work on compact reasoning shortens such chains through probabilistic sampling, heuristics, or supervision from frontier models, but offers limited insight into whether models internally encode token-level functional importance for answer generation. We address this gap diagnostically and propose greedy pruning, a likelihood-preserving deletion procedure that iteratively removes reasoning tokens whose removal minimally degrades model likelihood under a specified objective, yielding length-controlled reasoning chains. We evaluate pruned reasoning in a distillation framework and show that students trained on pruned chains outperform a frontier-model-supervised compression baseline at matched reasoning lengths. Finally, our analysis reveals systematic pruning patterns and shows that attention scores can predict greedy pruning ranks, further suggesting that models encode a nontrivial functional importance structure over reasoning tokens.


[28] ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation cs.CL | cs.AI | cs.LGPDF

Peiran Li, Jan Fillies, Adrian Paschke

TL;DR: 本文提出ToxiGAN,一种结合对抗生成与大型语言模型(LLM)语义引导的类别感知文本增强框架,用于可控且类别特定的毒性语言数据增强,以提升毒性分类的鲁棒性。

Details

Motivation: 解决毒性语言数据增强中因监督有限和分布偏斜导致的可控性和类别特异性挑战,以及传统基于GAN的增强方法存在的模式崩溃和语义漂移问题。

Result: 在四个仇恨言论基准测试中,ToxiGAN在宏平均F1和仇恨F1指标上均取得了最强的平均性能,持续优于传统和基于LLM的增强方法。

Insight: 创新点包括引入两步定向训练策略和利用LLM生成的中性文本作为语义压舱物,动态选择中性示例以提供平衡引导,并显式优化毒性样本以偏离这些示例,从而强化类别特定的对比信号。

Abstract: Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that ToxiGAN achieves the strongest average performance in both macro-F1 and hate-F1, consistently outperforming traditional and LLM-based augmentation methods. Ablation and sensitivity analyses further confirm the benefits of semantic ballast and directional training in enhancing classifier robustness.


[29] Limited Linguistic Diversity in Embodied AI Datasets cs.CL | cs.AI | cs.ROPDF

Selma Wanna, Agnes Luhtaru, Jonathan Salfity, Ryan Barron, Juston Moore

TL;DR: 本文对多个广泛使用的视觉-语言-动作(VLA)模型数据集进行了系统性审计,量化了指令语言的词汇多样性、重复性、语义相似性和句法复杂性,发现当前数据集普遍依赖高度重复的模板化指令,语言多样性有限。

Details

Motivation: 解决VLA模型中语言信号多样性缺乏系统文档化的问题,旨在揭示训练和评估数据中指令语言的实际特征与局限性。

Result: 分析表明,许多数据集包含高度重复、模板化的命令,结构变化有限,导致指令形式分布狭窄;该结果为描述性文档,未提及具体基准测试或SOTA比较。

Insight: 创新点在于首次系统量化VLA数据集的语言多样性,为更详细的数据集报告、更原则性的数据集选择以及针对性的数据增强策略提供了基础,有助于拓宽语言覆盖范围。

Abstract: Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions-including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.


[30] Self-Verification is All You Need To Pass The Japanese Bar Examination cs.CL | cs.AIPDF

Andrew Shin

TL;DR: 本文提出了一种基于自验证的单一模型方法,首次在保持原始试题结构和评分规则的前提下,使大语言模型通过了日本司法考试。该方法通过构建忠实于考试格式和评分标准的数据集进行训练,并超越了官方及格分数线。

Details

Motivation: 解决大语言模型在高度专业化和结构化考试(如日本司法考试)中表现不可靠的问题,特别是需要复杂法律推理和严格遵循多命题联合评估答案格式的挑战。

Result: 在真实考试评分标准下,模型得分超过了官方及格线,这是首个在不改变原始问题结构或评分规则的情况下通过日本司法考试的LLM。与多智能体推理和基于分解的监督等替代策略相比,该方法表现更优。

Insight: 创新点在于强调格式忠实的监督和一致性验证的重要性,表明精心设计的单一模型方法在高压专业推理任务中可以超越更复杂的系统。

Abstract: Despite rapid advances in large language models (LLMs), achieving reliable performance on highly professional and structured examinations remains a significant challenge. The Japanese bar examination is a particularly demanding benchmark, requiring not only advanced legal reasoning but also strict adherence to complex answer formats that involve joint evaluation of multiple propositions. While recent studies have reported improvements by decomposing such questions into simpler true–false judgments, these approaches have not been systematically evaluated under the original exam format and scoring scheme, leaving open the question of whether they truly capture exam-level competence. In this paper, we present a self-verification model trained on a newly constructed dataset that faithfully replicates the authentic format and evaluation scale of the exam. Our model is able to exceed the official passing score when evaluated on the actual exam scale, marking the first demonstration, to our knowledge, of an LLM passing the Japanese bar examination without altering its original question structure or scoring rules. We further conduct extensive comparisons with alternative strategies, including multi-agent inference and decomposition-based supervision, and find that these methods fail to achieve comparable performance. Our results highlight the importance of format-faithful supervision and consistency verification, and suggest that carefully designed single-model approaches can outperform more complex systems in high-stakes professional reasoning tasks. Our dataset and codes are publicly available.


[31] Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective cs.CLPDF

Beiduo Chen, Tiancheng Hu, Caiqi Zhang, Robert Litschko, Anna Korhonen

TL;DR: 本研究通过系统解耦实验探究了推理调优大语言模型在分布型任务中对人类标签变异的建模能力,发现思维链主要作为顶层决策器提升准确性,而非细粒度分布校准器。

Details

Motivation: 解决思维链推理在捕捉概率模糊性(人类标签变异)而非消除模糊性方面的能力不足问题。

Result: 在分布型任务中,思维链内容决定最终准确性(贡献99%方差),而模型先验主导分布排序(贡献超80%方差)。

Insight: 揭示了思维链在决策与分布校准中的解耦机制:思维链逐步增强准确性,但分布结构主要由模型内在先验决定,这为设计更有效的模糊任务处理模型提供了新视角。

Abstract: Reasoning-tuned LLMs utilizing long Chain-of-Thought (CoT) excel at single-answer tasks, yet their ability to model Human Label Variation–which requires capturing probabilistic ambiguity rather than resolving it–remains underexplored. We investigate this through systematic disentanglement experiments on distribution-based tasks, employing Cross-CoT experiments to isolate the effect of reasoning text from intrinsic model priors. We observe a distinct “decoupled mechanism”: while CoT improves distributional alignment, final accuracy is dictated by CoT content (99% variance contribution), whereas distributional ranking is governed by model priors (over 80%). Step-wise analysis further shows that while CoT’s influence on accuracy grows monotonically during the reasoning process, distributional structure is largely determined by LLM’s intrinsic priors. These findings suggest that long CoT serves as a decisive LLM decision-maker for the top option but fails to function as a granular distribution calibrator for ambiguous tasks.


[32] WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning cs.CLPDF

Yu Xinmiao, Zhang Liwen, Feng Xiaocheng, Jiang Yong, Qin Bing

TL;DR: 本文提出Anchor-GRPO,一个两阶段强化学习框架,用于解决基于大语言模型的智能体在长视野网页推理任务中规划不稳定的问题。该方法通过解耦规划和执行,首先优化第一步规划,然后确保后续执行与初始计划对齐,从而提升任务成功率和工具使用效率。

Details

Motivation: 现有基于强化学习的智能体优化方法在长视野网页推理任务中,由于未能考虑第一步规划对下游行为的决定性影响(即’计划锚定’现象),导致规划成为瓶颈。

Result: 在BrowseComp、BrowseComp-Zh、GAIA和XBench-DeepSearch四个基准测试上,Anchor-GRPO超越了基线GRPO和First-step GRPO,提升了任务成功率和工具效率。例如,WebAnchor-30B在BrowseComp上达到46.0% pass@1,在GAIA上达到76.4%。该方法还展现出良好的可扩展性,随着模型规模和上下文长度增加,准确率进一步提升。

Insight: 核心创新点是识别并利用’计划锚定’现象,提出两阶段强化学习框架(Anchor-GRPO),将规划与执行解耦。通过第一阶段基于自博弈经验和人工校准的细粒度标准优化第一步规划,第二阶段通过稀疏奖励确保执行与计划对齐,从而稳定长视野推理。这为优化智能体在复杂、多步任务中的规划提供了新思路。

Abstract: Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle with long-horizon strategies. Our analysis reveals a critical phenomenon, plan anchor, where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks. Current RL algorithms, fail to account for this by uniformly distributing rewards across the trajectory. To address this, we propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution. In Stage 1, the agent optimizes its first-step planning using fine-grained rubrics derived from self-play experiences and human calibration. In Stage 2, execution is aligned with the initial plan through sparse rewards, ensuring stable and efficient tool usage. We evaluate Anchor-GRPO on four benchmarks: BrowseComp, BrowseComp-Zh, GAIA, and XBench-DeepSearch. Across models from 3B to 30B, Anchor-GRPO outperforms baseline GRPO and First-step GRPO, improving task success and tool efficiency. Notably, WebAnchor-30B achieves 46.0% pass@1 on BrowseComp and 76.4% on GAIA. Anchor-GRPO also demonstrates strong scalability, getting higher accuracy as model size and context length increase.


[33] MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory cs.CLPDF

Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng

TL;DR: MemRL是一种基于情景记忆的运行时强化学习框架,旨在使大型语言模型(LLM)驱动的智能体能够通过检索和利用过去经验来持续自我进化,而无需进行权重更新。该框架通过分离冻结LLM的稳定推理与可塑性记忆,并采用两阶段检索机制(基于语义相关性和学习到的Q值)来优化经验选择,从而在多个基准测试中显著优于现有方法。

Details

Motivation: 解决LLM在自我进化方面的局限性:微调计算成本高且易导致灾难性遗忘,而现有基于记忆的方法依赖被动语义匹配,常检索到噪声,无法有效模拟人类通过情景模拟学习新技能的能力。

Result: 在HLE、BigCodeBench、ALFWorld和Lifelong Agent Bench等基准测试中,MemRL显著优于最先进的基线方法,实现了持续运行时改进,且无需权重更新。

Insight: 创新点包括:将稳定推理与可塑性记忆分离以解决稳定性-可塑性困境;引入两阶段检索机制(语义过滤+基于强化学习Q值的效用选择),通过环境反馈持续优化记忆效用,从而有效区分高价值策略与噪声,实现非参数化强化学习驱动的自我进化。

Abstract: The hallmark of human intelligence is the ability to master new skills through Constructive Episodic Simulation-retrieving past experiences to synthesize solutions for novel tasks. While Large Language Models possess strong reasoning capabilities, they struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a framework that enables agents to self-evolve via non-parametric reinforcement learning on episodic memory. MemRL explicitly separates the stable reasoning of a frozen LLM from the plastic, evolving memory. Unlike traditional methods, MemRL employs a Two-Phase Retrieval mechanism that filters candidates by semantic relevance and then selects them based on learned Q-values (utility). These utilities are continuously refined via environmental feedback in an trial-and-error manner, allowing the agent to distinguish high-value strategies from similar noise. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines. Our analysis experiments confirm that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.


[34] X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework cs.CLPDF

Mohammad Zia Ur Rehman, Sai Kartheek Reddy Kasu, Shashivardhan Reddy Koppula, Sai Rithwik Reddy Chirra, Shwetank Shekhar Singh

TL;DR: 本文提出了一个名为X-MuTeST的多语言可解释仇恨言论检测基准及一种新颖的LLM咨询解释框架。该研究针对英语、印地语和泰卢固语,通过结合大型语言模型的高级语义推理与传统注意力增强技术,旨在提升检测的准确性和可解释性。

Details

Motivation: 社交媒体上的仇恨言论检测在准确性和可解释性方面面临挑战,特别是对于资源不足的印度语言(如印地语和泰卢固语),现有研究不足。

Result: 实验表明,在训练中利用人工标注的理性依据(rationales)能同时提升分类性能和可解释性;进一步结合提出的可解释性方法优化模型注意力可带来额外改进。评估使用了合理性指标(如Token-F1和IOU-F1)和忠实性指标(如Comprehensiveness和Sufficiency)。

Insight: 创新点在于提出了一种结合LLM解释与基于n-gram概率差异的X-MuTeST解释的混合框架,并通过引入多语言、细粒度(词级)理性标注数据集,为资源不足语言的仇恨言论检测提供了新的基准和可解释性增强方法。

Abstract: Hate speech detection on social media faces challenges in both accuracy and explainability, especially for underexplored Indic languages. We propose a novel explainability-guided training framework, X-MuTeST (eXplainable Multilingual haTe Speech deTection), for hate speech detection that combines high-level semantic reasoning from large language models (LLMs) with traditional attention-enhancing techniques. We extend this research to Hindi and Telugu alongside English by providing benchmark human-annotated rationales for each word to justify the assigned class label. The X-MuTeST explainability method computes the difference between the prediction probabilities of the original text and those of unigrams, bigrams, and trigrams. Final explanations are computed as the union between LLM explanations and X-MuTeST explanations. We show that leveraging human rationales during training enhances both classification performance and explainability. Moreover, combining human rationales with our explainability method to refine the model attention yields further improvements. We evaluate explainability using Plausibility metrics such as Token-F1 and IOU-F1 and Faithfulness metrics such as Comprehensiveness and Sufficiency. By focusing on under-resourced languages, our work advances hate speech detection across diverse linguistic contexts. Our dataset includes token-level rationale annotations for 6,004 Hindi, 4,492 Telugu, and 6,334 English samples. Data and code are available on https://github.com/ziarehman30/X-MuTeST


[35] UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward cs.CL | cs.AIPDF

Yile Liu, Yixian Liu, Zongwei Li, Yufei Huang, Xinhua Feng

TL;DR: 本文提出UltraLogic框架,通过基于代码的求解方法将问题的逻辑核心与自然语言表达解耦,以自动化生成大规模、高质量、难度校准的通用推理数据,并引入双极浮点奖励机制来缓解二元奖励稀疏性和非负奖励陷阱问题,从而增强大语言模型的推理能力。

Details

Motivation: 解决大语言模型在需要多步逻辑、规划和验证的复杂通用推理方面的瓶颈,以及缺乏大规模、高质量、难度校准的通用推理数据的问题。

Result: 实验表明,任务多样性是推理能力提升的主要驱动力,而双极浮点奖励结合难度匹配策略显著提高了训练效率,引导模型达到全局逻辑最优。

Insight: 创新点在于通过代码解耦实现自动化高质量数据合成,以及引入分级惩罚的双极浮点奖励机制来精细区分完美答案与存在逻辑缺陷的答案,从而更有效地训练模型进行复杂推理。

Abstract: While Large Language Models (LLMs) have demonstrated significant potential in natural language processing , complex general-purpose reasoning requiring multi-step logic, planning, and verification remains a critical bottleneck. Although Reinforcement Learning with Verifiable Rewards (RLVR) has succeeded in specific domains , the field lacks large-scale, high-quality, and difficulty-calibrated data for general reasoning. To address this, we propose UltraLogic, a framework that decouples the logical core of a problem from its natural language expression through a Code-based Solving methodology to automate high-quality data production. The framework comprises hundreds of unique task types and an automated calibration pipeline across ten difficulty levels. Furthermore, to mitigate binary reward sparsity and the Non-negative Reward Trap, we introduce the Bipolar Float Reward (BFR) mechanism, utilizing graded penalties to effectively distinguish perfect responses from those with logical flaws. Our experiments demonstrate that task diversity is the primary driver for reasoning enhancement , and that BFR, combined with a difficulty matching strategy, significantly improves training efficiency, guiding models toward global logical optima.


[36] MalruleLib: Large-Scale Executable Misconception Reasoning with Step Traces for Modeling Student Thinking in Mathematics cs.CLPDF

Xinghe Chen, Naiming Liu, Shashank Sonkar

TL;DR: 本文提出了MalruleLib框架,将数学学习中的常见错误概念转化为可执行的错误规则,并生成包含正确推理和错误推理步骤的双路径追踪,用于大规模建模学生的数学思维过程。

Details

Motivation: 动机在于解决学生数学错误往往具有系统性且可重复的问题,旨在通过可执行的错误规则来建模学生的错误思维过程,以支持教育AI进行精准诊断和反馈。

Result: 在9个语言模型(4B-120B参数)上测试,模型在直接问题解决上的准确率为66%,而在跨模板错误概念预测任务上准确率降至40%;使用MalruleLib后,跨模板性能下降10-21%,但提供学生步骤追踪可将预测准确率提升3-15%。

Insight: 创新点在于将学习科学中的错误概念形式化为可执行的规则库,并生成参数化的问题模板和双路径步骤追踪,实现了对学生错误思维的大规模、可控模拟与评估,为教育AI提供了可扩展的基础设施。

Abstract: Student mistakes in mathematics are often systematic: a learner applies a coherent but wrong procedure and repeats it across contexts. We introduce MalruleLib, a learning-science-grounded framework that translates documented misconceptions into executable procedures, drawing on 67 learning-science and mathematics education sources, and generates step-by-step traces of malrule-consistent student work. We formalize a core student-modeling problem as Malrule Reasoning Accuracy (MRA): infer a misconception from one worked mistake and predict the student’s next answer under cross-template rephrasing. Across nine language models (4B-120B), accuracy drops from 66% on direct problem solving to 40% on cross-template misconception prediction. MalruleLib encodes 101 malrules over 498 parameterized problem templates and produces paired dual-path traces for both correct reasoning and malrule-consistent student reasoning. Because malrules are executable and templates are parameterizable, MalruleLib can generate over one million instances, enabling scalable supervision and controlled evaluation. Using MalruleLib, we observe cross-template degradations of 10-21%, while providing student step traces improves prediction by 3-15%. We release MalruleLib as infrastructure for educational AI that models student procedures across contexts, enabling diagnosis and feedback that targets the underlying misconception.


[37] STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning cs.CLPDF

Juntong Ni, Shiyu Wang, Ming Jin, Qi He, Wei Jin

TL;DR: 本文提出了STReasoner框架,通过空间感知强化学习增强大语言模型在时间序列数据中的时空推理能力,并构建了ST-Bench基准测试,涵盖病因推理、实体识别、相关性推理和上下文预测四个核心任务。

Details

Motivation: 现有研究多关注预测准确性而忽视推理过程,导致时空推理能力不足,难以支撑交通网络、电网和疾病传播等高风险决策系统。

Result: STReasoner在ST-Bench基准上实现了17%至135%的平均准确率提升,成本仅为专有模型的0.004倍,并在真实数据上展现出鲁棒的泛化能力。

Insight: 创新点包括基于网络SDE的多智能体数据合成流程构建基准,以及S-GRPO强化学习算法,该算法通过奖励空间信息带来的性能增益来促进空间逻辑的建立。

Abstract: Spatio-temporal reasoning in time series involves the explicit synthesis of temporal dynamics, spatial dependencies, and textual context. This capability is vital for high-stakes decision-making in systems such as traffic networks, power grids, and disease propagation. However, the field remains underdeveloped because most existing works prioritize predictive accuracy over reasoning. To address the gap, we introduce ST-Bench, a benchmark consisting of four core tasks, including etiological reasoning, entity identification, correlation reasoning, and in-context forecasting, developed via a network SDE-based multi-agent data synthesis pipeline. We then propose STReasoner, which empowers LLM to integrate time series, graph structure, and text for explicit reasoning. To promote spatially grounded logic, we introduce S-GRPO, a reinforcement learning algorithm that rewards performance gains specifically attributable to spatial information. Experiments show that STReasoner achieves average accuracy gains between 17% and 135% at only 0.004X the cost of proprietary models and generalizes robustly to real-world data.


cs.CV [Back]

[38] MIAR: Modality Interaction and Alignment Representation Fuison for Multimodal Emotion cs.CV | cs.AIPDF

Jichao Zhu, Jun Yu

TL;DR: 本文提出了一种名为MIAR(Modality Interaction and Alignment Representation)的新型多模态情感识别方法,通过特征交互和对比学习来整合并对齐语言、视觉和音频模态,以解决模态间分布差异和贡献度不均的问题,并在CMU-MOSI和CMU-MOSEI基准测试中取得了SOTA性能。

Details

Motivation: 解决现有多模态情感识别方法在模态融合时未能充分处理模态间显著分布差异、未考虑各模态对任务的不同贡献,以及缺乏对不同文本模型特征的鲁棒泛化能力的问题。

Result: 在CMU-MOSI和CMU-MOSEI两个基准数据集上的实验结果表明,MIAR方法超越了当前最先进的多模态情感识别方法。

Insight: 创新点在于通过特征交互生成代表各模态全局表示的特征令牌,并利用对比学习和归一化策略进行模态对齐,从而更有效地整合跨模态上下文信息并提升模型泛化能力。

Abstract: Multimodal Emotion Recognition (MER) aims to perceive human emotions through three modes: language, vision, and audio. Previous methods primarily focused on modal fusion without adequately addressing significant distributional differences among modalities or considering their varying contributions to the task. They also lacked robust generalization capabilities across diverse textual model features, thus limiting performance in multimodal scenarios. Therefore, we propose a novel approach called Modality Interaction and Alignment Representation (MIAR). This network integrates contextual features across different modalities using a feature interaction to generate feature tokens to represent global representations of this modality extracting information from other modalities. These four tokens represent global representations of how each modality extracts information from others. MIAR aligns different modalities using contrastive learning and normalization strategies. We conduct experiments on two benchmarks: CMU-MOSI and CMU-MOSEI datasets, experimental results demonstrate the MIAR outperforms state-of-the-art MER methods.


[39] Multimodal Sentiment Analysis based on Multi-channel and Symmetric Mutual Promotion Feature Fusion cs.CV | cs.AIPDF

Wangyuan Zhu, Jun Yu

TL;DR: 本文提出了一种基于多通道和对称互促特征融合的多模态情感分析方法,通过提取多通道特征增强单模态表示,并设计对称互促跨模态特征融合机制来促进模态间有用信息交换,同时考虑特征差异性与互补性,在基准数据集上验证了方法的有效性和优越性。

Details

Motivation: 解决多模态情感分析中单模态特征有限且不够丰富,以及现有研究多关注模态间特征一致性而忽略特征差异,导致特征融合不充分的问题。

Result: 在两个基准数据集上的实验证明了所提方法的有效性和优越性。

Insight: 创新点在于同时采用多通道特征增强单模态表示,以及提出结合对称跨模态注意力与自注意力的互促融合机制,在促进模态间信息交换的同时建模上下文并考虑特征差异,实现了更充分的多模态特征融合。

Abstract: Multimodal sentiment analysis is a key technology in the fields of human-computer interaction and affective computing. Accurately recognizing human emotional states is crucial for facilitating smooth communication between humans and machines. Despite some progress in multimodal sentiment analysis research, numerous challenges remain. The first challenge is the limited and insufficiently rich features extracted from single modality data. Secondly, most studies focus only on the consistency of inter-modal feature information, neglecting the differences between features, resulting in inadequate feature information fusion. In this paper, we first extract multi-channel features to obtain more comprehensive feature information. We employ dual-channel features in both the visual and auditory modalities to enhance intra-modal feature representation. Secondly, we propose a symmetric mutual promotion (SMP) inter-modal feature fusion method. This method combines symmetric cross-modal attention mechanisms and self-attention mechanisms, where the cross-modal attention mechanism captures useful information from other modalities, and the self-attention mechanism models contextual information. This approach promotes the exchange of useful information between modalities, thereby strengthening inter-modal interactions. Furthermore, we integrate intra-modal features and inter-modal fused features, fully leveraging the complementarity of inter-modal feature information while considering feature information differences. Experiments conducted on two benchmark datasets demonstrate the effectiveness and superiority of our proposed method.


[40] Watch Wider and Think Deeper: Collaborative Cross-modal Chain-of-Thought for Complex Visual Reasoning cs.CV | cs.AIPDF

Wenting Lu, Didi Zhu, Tao Shen, Donglin Zhu, Ayong Ye

TL;DR: 本文提出了CoCoT(协同跨模态思维链)框架,以解决现有跨模态思维链方法在视觉推理中的两个关键局限:对单一粗粒度图像区域的过度依赖以及连续推理步骤间的语义碎片化。该框架通过动态多区域定位和关系感知推理,实现了视觉与语言线索的协同整合,并构建了包含74,691个高质量样本的CoCoT-70K数据集。实验表明,CoCoT在六个具有挑战性的基准测试上显著提升了复杂视觉推理性能。

Details

Motivation: 现有跨模态思维链方法存在两个关键问题:过度依赖单一粗粒度图像区域,以及连续推理步骤间出现语义碎片化,导致多模态推理中视觉与语言线索的整合不充分。

Result: 在六个具有挑战性的基准测试上,CoCoT显著提升了复杂视觉推理性能,在LLaVA-1.5模型上平均准确率提升了15.4%,在Qwen2-VL模型上提升了4.0%。

Insight: 主要创新点包括:1)动态多区域定位,能够根据问题自适应地检测最相关的图像区域;2)关系感知推理,通过迭代对齐视觉线索以形成连贯、逻辑的思维链。这为构建更鲁棒、可解释的跨模态推理系统提供了新思路。

Abstract: Multi-modal reasoning requires the seamless integration of visual and linguistic cues, yet existing Chain-of-Thought methods suffer from two critical limitations in cross-modal scenarios: (1) over-reliance on single coarse-grained image regions, and (2) semantic fragmentation between successive reasoning steps. To address these issues, we propose the CoCoT (Collaborative Coross-modal Thought) framework, built upon two key innovations: a) Dynamic Multi-Region Grounding to adaptively detect the most relevant image regions based on the question, and b) Relation-Aware Reasoning to enable multi-region collaboration by iteratively aligning visual cues to form a coherent and logical chain of thought. Through this approach, we construct the CoCoT-70K dataset, comprising 74,691 high-quality samples with multi-region annotations and structured reasoning chains. Extensive experiments demonstrate that CoCoT significantly enhances complex visual reasoning, achieving an average accuracy improvement of 15.4% on LLaVA-1.5 and 4.0% on Qwen2-VL across six challenging benchmarks. The data and code are available at: https://github.com/deer-echo/CoCoT.


[41] NitroGen: An Open Foundation Model for Generalist Gaming Agents cs.CV | cs.AI | cs.LGPDF

Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky

TL;DR: NitroGen是一个面向通用游戏智能体的视觉-动作基础模型,通过从超过1000款游戏的4万小时游戏视频中提取玩家动作进行大规模行为克隆训练,并在多游戏基准环境中评估其跨游戏泛化能力。

Details

Motivation: 解决通用游戏智能体在多样游戏环境中泛化能力不足的问题,通过构建大规模视频-动作数据集和统一模型来提升智能体的跨领域适应性。

Result: 在未见过的游戏中,NitroGen相比从头训练的模型在任务成功率上实现了最高52%的相对提升,在3D动作游戏、2D平台游戏和程序生成世界等多样领域表现出色。

Insight: 创新点包括自动从公开游戏视频构建互联网规模视频-动作数据集、设计跨游戏泛化评估基准,以及通过统一视觉-动作模型实现大规模行为克隆,为通用具身智能体研究提供了开放数据集和模型。

Abstract: We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.


[42] TAP-ViTs: Task-Adaptive Pruning for On-Device Deployment of Vision Transformers cs.CV | cs.AI | cs.LGPDF

Zhibo Wang, Zuoyuan Zhang, Xiaoyi Pang, Qile Zhang, Xuanyi Hao

TL;DR: 本文提出了一种名为TAP-ViTs的任务自适应剪枝框架,旨在为资源受限的移动和边缘设备生成设备特定的、经过剪枝的视觉Transformer模型,而无需访问任何原始本地数据。该方法通过基于高斯混合模型的度量数据集构建机制来推断设备级任务特性,并采用基于双粒度重要性评估的剪枝策略,实现了细粒度的、任务感知的剪枝。

Details

Motivation: 现有ViT剪枝方法要么生成单一模型,忽略了设备异构性,要么依赖设备本地数据进行微调,这在资源受限和隐私约束下往往不可行。因此,需要一种能在保护隐私的移动计算环境中实现任务定制化ViT剪枝的方法。

Result: 在多个ViT骨干网络和数据集上的广泛实验表明,在可比的压缩比下,TAP-ViTs始终优于最先进的剪枝方法。

Insight: 创新点在于:1) 提出了一种基于GMM的隐私保护度量数据集构建机制,仅上传模型参数来近似数据分布;2) 开发了一种基于双粒度重要性评估的剪枝策略,联合评估复合神经元重要性和自适应层重要性,实现针对设备计算预算的细粒度剪枝。

Abstract: Vision Transformers (ViTs) have demonstrated strong performance across a wide range of vision tasks, yet their substantial computational and memory demands hinder efficient deployment on resource-constrained mobile and edge devices. Pruning has emerged as a promising direction for reducing ViT complexity. However, existing approaches either (i) produce a single pruned model shared across all devices, ignoring device heterogeneity, or (ii) rely on fine-tuning with device-local data, which is often infeasible due to limited on-device resources and strict privacy constraints. As a result, current methods fall short of enabling task-customized ViT pruning in privacy-preserving mobile computing settings. This paper introduces TAP-ViTs, a novel task-adaptive pruning framework that generates device-specific pruned ViT models without requiring access to any raw local data. Specifically, to infer device-level task characteristics under privacy constraints, we propose a Gaussian Mixture Model (GMM)-based metric dataset construction mechanism. Each device fits a lightweight GMM to approximate its private data distribution and uploads only the GMM parameters. Using these parameters, the cloud selects distribution-consistent samples from public data to construct a task-representative metric dataset for each device. Based on this proxy dataset, we further develop a dual-granularity importance evaluation-based pruning strategy that jointly measures composite neuron importance and adaptive layer importance, enabling fine-grained, task-aware pruning tailored to each device’s computational budget. Extensive experiments across multiple ViT backbones and datasets demonstrate that TAP-ViTs consistently outperforms state-of-the-art pruning methods under comparable compression ratios.


[43] Understanding Pure Textual Reasoning for Blind Image Quality Assessment cs.CV | cs.AIPDF

Yuan Li, Shin’ya Nishida

TL;DR: 本文从信息流角度研究纯文本推理在盲图像质量评估中的作用,通过比较现有BIQA模型与三种学习图像-文本-分数关系的范式,发现仅使用文本信息时预测性能显著下降,而自一致性范式能有效缩小图像与文本条件预测之间的差距。

Details

Motivation: 探究文本信息在盲图像质量评估中的贡献程度,以及文本能在多大程度上代表与分数相关的图像内容,以澄清现有方法中文本推理的作用机制。

Result: 实验表明,现有模型仅用文本预测时性能显著下降;链式思维范式对BIQA性能提升有限,自一致性范式将图像与文本预测的PLCC/SRCC差距缩小至0.02/0.03,而自编码器范式效果较差但揭示了优化方向。

Insight: 自一致性范式能有效弥合图像与文本信息在质量预测中的差距,为改进BIQA及高级视觉任务的文本推理提供了新思路;同时,纯文本推理的局限性凸显了多模态融合的必要性。

Abstract: Textual reasoning has recently been widely adopted in Blind Image Quality Assessment (BIQA). However, it remains unclear how textual information contributes to quality prediction and to what extent text can represent the score-related image contents. This work addresses these questions from an information-flow perspective by comparing existing BIQA models with three paradigms designed to learn the image-text-score relationship: Chain-of-Thought, Self-Consistency, and Autoencoder. Our experiments show that the score prediction performance of the existing model significantly drops when only textual information is used for prediction. Whereas the Chain-of-Thought paradigm introduces little improvement in BIQA performance, the Self-Consistency paradigm significantly reduces the gap between image- and text-conditioned predictions, narrowing the PLCC/SRCC difference to 0.02/0.03. The Autoencoder-like paradigm is less effective in closing the image-text gap, yet it reveals a direction for further optimization. These findings provide insights into how to improve the textual reasoning for BIQA and high-level vision tasks.


[44] Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative cs.CV | cs.AI | eess.IVPDF

Li Wang, Xi Chen, XiangWen Deng, HuaHui Yi, ZeKun Jiang

TL;DR: 该论文评估了多模态大语言模型(MLLMs)在膝关节骨关节炎(OA)X光片分类任务中的表现。研究发现,对于这种需要高确定性的特定领域医学图像分类任务,单独训练视觉编码器的分类准确率优于完整的MLLM流程,微调LLM并未带来显著提升,且数据平衡性比数据规模更重要。

Details

Motivation: 尽管MLLMs在医学视觉问答和报告生成方面表现出潜力,但其生成和解释能力是否能可靠地迁移到疾病特异性分类任务(如膝关节OA分类)尚不明确,而该疾病在全球影响广泛但在现有医学MLLM基准中代表性不足。

Result: 在膝关节OA放射影像分类任务中,仅使用训练好的视觉编码器在分类准确率上就超越了完整的MLLM流程。基于提示词引导的LLM与微调LLM相比没有带来有意义的改进。在小型、类别平衡的数据集(500张图像)上进行LoRA微调,其结果优于在更大但类别不平衡的数据集(5,778张图像)上训练的结果。

Insight: 论文的创新点在于通过系统消融研究,量化了MLLM各组件(视觉编码器、连接器、LLM)对诊断准确率的贡献。核心发现是,对于高确定性的医学图像诊断分类,LLM更适合作为解释器和报告生成器,而非主要分类器。因此,开发临床适用系统应优先优化视觉编码器并进行精细的数据集构建,数据质量和平衡性可能比原始规模更重要。

Abstract: Multimodal large language models (MLLMs) show promising performance on medical visual question answering (VQA) and report generation, but these generation and explanation abilities do not reliably transfer to disease-specific classification. We evaluated MLLM architectures on knee osteoarthritis (OA) radiograph classification, which remains underrepresented in existing medical MLLM benchmarks, even though knee OA affects an estimated 300 to 400 million people worldwide. Through systematic ablation studies manipulating the vision encoder, the connector, and the large language model (LLM) across diverse training strategies, we measured each component’s contribution to diagnostic accuracy. In our classification task, a trained vision encoder alone could outperform full MLLM pipelines in classification accuracy and fine-tuning the LLM provided no meaningful improvement over prompt-based guidance. And LoRA fine-tuning on a small, class-balanced dataset (500 images) gave better results than training on a much larger but class-imbalanced set (5,778 images), indicating that data balance and quality can matter more than raw scale for this task. These findings suggest that for domain-specific medical classification, LLMs are more effective as interpreters and report generators rather than as primary classifiers. Therefore, the MLLM architecture appears less suitable for medical image diagnostic classification tasks that demand high certainty. We recommend prioritizing vision encoder optimization and careful dataset curation when developing clinically applicable systems.


[45] A Spatio-Temporal Deep Learning Approach For High-Resolution Gridded Monsoon Prediction cs.CV | cs.LGPDF

Parashjyoti Borah, Sanghamitra Sarkar, Ranjan Phukan

TL;DR: 本文提出了一种新颖的深度学习框架,将网格化的印度夏季季风预测重新构建为一个时空计算机视觉任务。该方法将季风前多变量大气和海洋场视为多通道图像序列,利用卷积神经网络从85年的ERA5再分析数据和IMD降雨数据中学习复杂映射,从而生成高分辨率的网格化降雨预测,包括四个季风月份和季节总平均的独立预测。

Details

Motivation: 传统长期预报方法主要预测单一的空间平均季节值,缺乏对区域资源管理至关重要的空间细节。本文旨在解决这一不足,为区域级资源管理提供高分辨率的空间详细预测。

Result: 该方法成功生成了对四个季风月份以及季节总平均的独立预测,证明了其在季节内和季节性展望中的实用性。

Insight: 核心创新点在于将网格化季风预测重新定义为时空计算机视觉任务,将多变量气象场处理为类似视频的多通道图像序列,并采用CNN架构进行建模。这为高分辨率气候预测提供了一种新的深度学习范式,可借鉴其将复杂时空序列预测问题转化为计算机视觉问题的思路。

Abstract: The Indian Summer Monsoon (ISM) is a critical climate phenomenon, fundamentally impacting the agriculture, economy, and water security of over a billion people. Traditional long-range forecasting, whether statistical or dynamical, has predominantly focused on predicting a single, spatially-averaged seasonal value, lacking the spatial detail essential for regional-level resource management. To address this gap, we introduce a novel deep learning framework that reframes gridded monsoon prediction as a spatio-temporal computer vision task. We treat multi-variable, pre-monsoon atmospheric and oceanic fields as a sequence of multi-channel images, effectively creating a video-like input tensor. Using 85 years of ERA5 reanalysis data for predictors and IMD rainfall data for targets, we employ a Convolutional Neural Network (CNN)-based architecture to learn the complex mapping from the five-month pre-monsoon period (January-May) to a high-resolution gridded rainfall pattern for the subsequent monsoon season. Our framework successfully produces distinct forecasts for each of the four monsoon months (June-September) as well as the total seasonal average, demonstrating its utility for both intra-seasonal and seasonal outlooks.


[46] PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding cs.CVPDF

Souhail Hadgi, Bingchen Gong, Ramana Sundararaman, Emery Pierson, Lei Li

TL;DR: 本文提出了PatchAlign3D,一种仅使用编码器的3D模型,能够直接从点云生成与语言对齐的补丁级特征。该方法通过两阶段预训练(从2D视觉特征蒸馏到3D补丁,并通过多正例对比目标将补丁嵌入与部件级文本嵌入对齐),实现了无需测试时多视图渲染的快速单次推理零样本3D部件分割。

Details

Motivation: 解决当前3D基础模型在全局任务(如检索、分类)上表现良好,但在局部部件级推理任务上迁移效果差的问题。现有基于多视图渲染和文本查询的方法推理成本高、严重依赖大语言模型提示工程,且未能充分利用形状的固有3D几何信息。

Result: 该方法在多个3D部件分割基准测试中显著优于之前基于渲染和前馈的方法,实现了零样本3D部件分割。

Insight: 创新点在于提出了一种直接从点云学习语言对齐的补丁级特征的编码器模型,通过两阶段预训练策略(特征蒸馏和对比对齐)有效结合了2D视觉先验和3D几何信息,避免了昂贵的多视图渲染和复杂的LLM提示工程,实现了高效的单次推理。

Abstract: Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D features from visual encoders such as DINOv2 into 3D patches, and (2) alignment of these patch embeddings with part-level text embeddings through a multi-positive contrastive objective. Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering, while significantly outperforming previous rendering-based and feed-forward approaches across several 3D part segmentation benchmarks. Project website: https://souhail-hadgi.github.io/patchalign3dsite/


[47] CT Scans As Video: Efficient Intracranial Hemorrhage Detection Using Multi-Object Tracking cs.CVPDF

Amirreza Parvahan, Mohammad Hoseyni, Javad Khoramdel, Amirhossein Nikoofard

TL;DR: 本文提出了一种轻量化的计算机视觉框架,将三维CT扫描数据视为视频序列进行处理,以高效检测颅内出血(ICH)。该方法结合了YOLO系列检测器与ByteTrack多目标跟踪算法,并引入了混合推理策略和时空一致性滤波器来提升检测精度。

Details

Motivation: 解决在边缘设备上运行三维卷积神经网络(3D CNN)进行医学影像分析时面临的高内存和计算需求问题,旨在为资源受限环境(如移动卒中单元)提供实时患者优先级排序的可扩展方案。

Result: 在Hemorica数据集上的独立测试数据表明,与基线2D检测器相比,所提框架将检测精度(Precision)从0.703提升至0.779,同时保持了高灵敏度。

Insight: 创新点在于将体积CT数据重新定义为顺序视频流,从而以近似3D上下文推理的方式大幅降低计算成本;通过引入多目标跟踪算法(ByteTrack)来强制执行解剖学一致性,并设计混合推理和滤波策略来缓解视频跟踪器的初始化延迟问题,有效区分真实病理与瞬态预测噪声。

Abstract: Automated analysis of volumetric medical imaging on edge devices is severely constrained by the high memory and computational demands of 3D Convolutional Neural Networks (CNNs). This paper develops a lightweight computer vision framework that reconciles the efficiency of 2D detection with the necessity of 3D context by reformulating volumetric Computer Tomography (CT) data as sequential video streams. This video-viewpoint paradigm is applied to the time-sensitive task of Intracranial Hemorrhage (ICH) detection using the Hemorica dataset. To ensure operational efficiency, we benchmarked multiple generations of the YOLO architecture (v8, v10, v11 and v12) in their Nano configurations, selecting the version with the highest mAP@50 to serve as the slice-level backbone. A ByteTrack algorithm is then introduced to enforce anatomical consistency across the $z$-axis. To address the initialization lag inherent in video trackers, a hybrid inference strategy and a spatiotemporal consistency filter are proposed to distinguish true pathology from transient prediction noise. Experimental results on independent test data demonstrate that the proposed framework serves as a rigorous temporal validator, increasing detection Precision from 0.703 to 0.779 compared to the baseline 2D detector, while maintaining high sensitivity. By approximating 3D contextual reasoning at a fraction of the computational cost, this method provides a scalable solution for real-time patient prioritization in resource-constrained environments, such as mobile stroke units and IoT-enabled remote clinics.


[48] MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark cs.CVPDF

Shaden Shaar, Bradon Thymes, Sirawut Chaixanien, Claire Cardie, Bharath Hariharan

TL;DR: 本文提出了MovieRecapsQA,一个基于电影解说视频构建的新型开放域多模态视频问答基准。该基准包含约8.2K个与电影字幕对齐的问答对,并提供用于无参考评估的文本事实依据,支持对视频长度和问题类型的细粒度分析。作者评估了七个最先进的多模态大语言模型,发现视觉问题最具挑战性,模型倾向于依赖文本输入,且所有模型都难以从视频中准确提取事实信息。

Details

Motivation: 现有视频问答基准难以捕捉真实电影理解所需的多模态推理,且大多不是开放域形式,因为自由形式答案的评估存在困难。

Result: 在MovieRecapsQA基准上评估了七个SOTA多模态大语言模型,结果表明:视觉问题最具挑战性;模型在有文本输入时优先依赖文本;所有模型从视频中提取准确事实信息仍很困难;在视频依赖问题上,专有模型和开源模型表现相当。

Insight: 创新点在于利用电影解说视频(同步的视觉摘要和文本摘要)构建首个提供显式输入文本上下文的开放域视频问答基准,支持无参考评估和细粒度分析。客观来看,其构建方法为多模态推理评估提供了更贴近真实场景且可验证的数据集。

Abstract: Understanding real-world videos such as movies requires integrating visual and dialogue cues to answer complex questions. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and are largely not open-ended, given the difficulty of evaluating free-form answers. In this paper, we introduce a novel open-ended multi-modal VideoQA benchmark, MovieRecapsQA created using movie recap videos–a distinctive type of YouTube content that summarizes a film by presenting its key events through synchronized visual (recap video) and textual (recap summary) modalities. Using the recap summary, we generate $\approx 8.2$ K question-answer (QA) pairs (aligned with movie-subtitles) and provide the necessary “facts” needed to verify an answer in a reference-free manner. To our knowledge, this is the first open-ended VideoQA benchmark that supplies explicit textual context of the input (video and/or text); which we use for evaluation. Our benchmark provides videos of multiple lengths (i.e., recap-segments, movie-segments) and categorizations of questions (by modality and type) to enable fine-grained analysis. We evaluate the performance of seven state-of-the-art MLLMs using our benchmark and observe that: 1) visual-only questions remain the most challenging; 2) models default to textual inputs whenever available; 3) extracting factually accurate information from video content is still difficult for all models; and 4) proprietary and open-source models perform comparably on video-dependent questions.


[49] Shallow- and Deep-fake Image Manipulation Localization Using Vision Mamba and Guided Graph Neural Network cs.CVPDF

Junbin Zhang, Hamid Reza Tohidypour, Yixiao Wang, Panos Nasiopoulos

TL;DR: 本文提出了一种结合Vision Mamba网络和引导图神经网络(G-GNN)的深度学习方法,用于同时定位由传统图像编辑工具(浅伪造)和先进AI技术(深伪造)产生的图像篡改区域。该方法旨在精确区分真实像素与篡改像素,并在评估中取得了优于现有方法的推理精度。

Details

Motivation: 解决现有研究大多只专注于浅伪造图像或深伪造视频的篡改定位问题,缺乏一个能同时处理浅伪造和深伪造图像篡改定位的统一方法。

Result: 评估结果表明,所提出的方法相比其他最先进(SOTA)方法取得了更高的推理精度。

Insight: 主要创新点在于利用Vision Mamba网络提取能清晰描述篡改与未篡改区域边界的特征图,并设计了一个新颖的引导图神经网络(G-GNN)模块来进一步放大篡改像素与真实像素之间的差异。从客观角度看,将Vision Mamba的全局建模能力与G-GNN的结构化关系推理相结合,是处理复杂篡改模式的一个有前景的方向。

Abstract: Image manipulation localization is a critical research task, given that forged images may have a significant societal impact of various aspects. Such image manipulations can be produced using traditional image editing tools (known as “shallowfakes”) or advanced artificial intelligence techniques (“deepfakes”). While numerous studies have focused on image manipulation localization on either shallowfake images or deepfake videos, few approaches address both cases. In this paper, we explore the feasibility of using a deep learning network to localize manipulations in both shallow- and deep-fake images, and proposed a solution for such purpose. To precisely differentiate between authentic and manipulated pixels, we leverage the Vision Mamba network to extract feature maps that clearly describe the boundaries between tampered and untouched regions. To further enhance this separation, we propose a novel Guided Graph Neural Network (G-GNN) module that amplifies the distinction between manipulated and authentic pixels. Our evaluation results show that our proposed method achieved higher inference accuracy compared to other state-of-the-art methods.


[50] DreamLoop: Controllable Cinemagraph Generation from a Single Photograph cs.CV | cs.AIPDF

Aniruddha Mahapatra, Long Mai, Cusuh Ham, Feng Liu

TL;DR: DreamLoop是一个可控的视频合成框架,专门用于从单张照片生成电影循环(cinemagraphs),无需专门的电影循环训练数据。它通过调整通用视频扩散模型,结合时间桥接和运动条件训练目标,实现灵活的电影循环生成。在推理时,通过将输入图像同时作为第一帧和最后一帧条件来强制无缝循环,利用静态轨迹保持背景静止,并通过用户指定的运动路径控制目标对象的动画轨迹和时序。

Details

Motivation: 现有图像动画技术局限于简单、低频的运动,且仅适用于具有重复纹理(如水和烟雾)的狭窄领域;而大规模视频扩散模型未针对电影循环的约束进行优化,缺乏生成无缝、可控循环所需的专门数据。因此,论文旨在解决从单张照片以可控方式生成电影循环的挑战。

Result: 论文表明,DreamLoop能够生成高质量、复杂的电影循环,符合用户意图,并优于现有方法。尽管未明确提及具体基准或定量结果,但声称在通用场景中实现了灵活直观的控制,是首个此类方法。

Insight: 创新点包括:通过训练时间桥接和运动条件目标来适应通用视频扩散模型,无需专门的电影循环数据;在推理中利用输入图像作为首尾帧条件强制无缝循环,结合静态轨迹和用户指定运动路径实现可控动画。从客观角度看,该方法将视频扩散模型与电影循环的特定约束(如循环性和背景静态性)相结合,提供了数据高效且用户友好的解决方案。

Abstract: Cinemagraphs, which combine static photographs with selective, looping motion, offer unique artistic appeal. Generating them from a single photograph in a controllable manner is particularly challenging. Existing image-animation techniques are restricted to simple, low-frequency motions and operate only in narrow domains with repetitive textures like water and smoke. In contrast, large-scale video diffusion models are not tailored for cinemagraph constraints and lack the specialized data required to generate seamless, controlled loops. We present DreamLoop, a controllable video synthesis framework dedicated to generating cinemagraphs from a single photo without requiring any cinemagraph training data. Our key idea is to adapt a general video diffusion model by training it on two objectives: temporal bridging and motion conditioning. This strategy enables flexible cinemagraph generation. During inference, by using the input image as both the first- and last- frame condition, we enforce a seamless loop. By conditioning on static tracks, we maintain a static background. Finally, by providing a user-specified motion path for a target object, our method provides intuitive control over the animation’s trajectory and timing. To our knowledge, DreamLoop is the first method to enable cinemagraph generation for general scenes with flexible and intuitive controls. We demonstrate that our method produces high-quality, complex cinemagraphs that align with user intent, outperforming existing approaches.


[51] CAMO: Category-Agnostic 3D Motion Transfer from Monocular 2D Videos cs.CVPDF

Taeyeon Kim, Youngju Na, Jumin Lee, Minhyuk Sung, Sung-Eui Yoon

TL;DR: CAMO是一种类别无关的3D运动迁移框架,能够直接从单目2D视频中将运动迁移到多样化的目标网格上,无需依赖预定义的参数化模板或显式的3D监督。其核心是结合形态参数化关节3D高斯溅射模型和密集语义对应关系,通过优化联合调整形状和姿态,有效缓解了形状-姿态歧义。

Details

Motivation: 解决从2D视频到3D资产的运动迁移问题,传统方法常因姿态歧义和物体形状多样性而需要类别特定的参数模板,CAMO旨在实现无需模板的通用迁移。

Result: 实验结果表明,与现有方法相比,CAMO在运动准确性、效率和视觉连贯性方面表现优越,显著推进了多样化物体类别和日常视频场景中的运动迁移。

Insight: 创新点在于结合形态参数化关节3D高斯溅射模型与密集语义对应进行联合优化,以类别无关的方式缓解形状-姿态歧义,实现视觉上忠实且高效的运动迁移。

Abstract: Motion transfer from 2D videos to 3D assets is a challenging problem, due to inherent pose ambiguities and diverse object shapes, often requiring category-specific parametric templates. We propose CAMO, a category-agnostic framework that transfers motion to diverse target meshes directly from monocular 2D videos without relying on predefined templates or explicit 3D supervision. The core of CAMO is a morphology-parameterized articulated 3D Gaussian splatting model combined with dense semantic correspondences to jointly adapt shape and pose through optimization. This approach effectively alleviates shape-pose ambiguities, enabling visually faithful motion transfer for diverse categories. Experimental results demonstrate superior motion accuracy, efficiency, and visual coherence compared to existing methods, significantly advancing motion transfer in varied object categories and casual video scenarios.


[52] HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps cs.CVPDF

Xuchang Zhong, Xu Cao, Jinke Feng, Hao Fang

TL;DR: 本文提出了一种新颖的单应性引导姿态估计网络(HOLO),用于多视角图像与标准定义(SD)地图之间的细粒度视觉定位。该方法通过将地面视图特征投影到鸟瞰图(BEV)域并与地图特征进行语义对齐,构建满足单应性约束的输入对,利用单应性关系指导特征融合并将姿态输出限制在有效可行区域内,从而显著提升了训练效率和定位精度。

Details

Motivation: 现有基于回归的视觉定位方法往往忽略了固有的几何先验,导致训练效率低下和定位精度有限。本文旨在通过引入单应性约束来利用几何先验,以解决SD地图上视觉定位的效率和精度问题。

Result: 在nuScenes数据集上进行的大量实验表明,该方法显著优于现有的最先进视觉定位方法,实现了SOTA性能。

Insight: 创新点在于首次将BEV语义推理与单应性学习统一用于图像到地图的定位,通过显式建模单应性变换,框架自然支持跨分辨率输入,增强了模型灵活性。从客观角度看,该方法通过几何约束引导特征融合和姿态回归,是一种有效利用先验知识提升性能的途径。

Abstract: Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly released to foster future research.


[53] Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench cs.CVPDF

Zanting Ye, Xiaolong Niu, Xuanbin Wu, Xu Han, Shengyuan Liu

TL;DR: 本文揭示了多模态大语言模型(MLLMs)在功能成像(特别是PET)中存在的功能性感知差距,即模型难以独立于形态学先验来解码示踪剂生物分布。为此,作者构建了首个大规模功能成像基准PET-Bench,并发现标准思维链提示会引发临床流畅但事实错误的诊断幻觉。为解决此问题,作者提出了原子视觉对齐(AVA)微调策略,有效弥合了感知差距,将思维链转化为稳健的推理工具。

Details

Motivation: 当前MLLMs在解剖模态上表现出色,但其在功能成像(如PET)中的能力尚未被充分探索。本文旨在识别并量化一个根本性的功能性感知差距,即现有视觉编码器无法独立于形态学先验来解码功能示踪剂的生物分布。

Result: 在包含52,308个分层QA对的大规模基准PET-Bench上对19个SOTA MLLMs进行了广泛评估,揭示了思维链幻觉陷阱。提出的AVA方法有效弥合了感知差距,将思维链从幻觉来源转变为稳健推理工具,将诊断准确率最高提升了14.83%。

Insight: 创新点在于首次系统性地识别并量化了MLLMs在功能成像中的功能性感知差距,并构建了首个大规模功能成像基准PET-Bench用于分层评估。提出的AVA微调策略(先掌握低级功能感知再进行高级诊断推理)是一个简单有效的解决方案,揭示了在功能成像领域,标准思维链提示可能适得其反,需要针对性的视觉对齐来确保推理的视觉基础。

Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities, their capability in functional imaging remains largely unexplored. In this work, we identify and quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors. Identifying Positron Emission Tomography (PET) as the quintessential modality to investigate this disconnect, we introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies. Extensive evaluation of 19 state-of-the-art MLLMs reveals a critical safety hazard termed the Chain-of-Thought (CoT) hallucination trap. We observe that standard CoT prompting, widely considered to enhance reasoning, paradoxically decouples linguistic generation from visual evidence in PET, producing clinically fluent but factually ungrounded diagnoses. To resolve this, we propose Atomic Visual Alignment (AVA), a simple fine-tuning strategy that enforces the mastery of low-level functional perception prior to high-level diagnostic reasoning. Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic accuracy by up to 14.83%. Code and data are available at https://github.com/yezanting/PET-Bench.


[54] ClearAIR: A Human-Visual-Perception-Inspired All-in-One Image Restoration cs.CVPDF

Xu Zhang, Huan Zhang, Guoli Wang, Qian Zhang, Lefei Zhang

TL;DR: 本文提出了ClearAIR,一种受人类视觉感知启发的All-in-One图像恢复框架。它采用从粗到细的分层恢复策略,首先利用基于多模态大语言模型的图像质量评估进行整体评价,然后通过区域感知和任务识别管道进行局部恢复,最后通过自监督的内部线索重用机制恢复细节。实验表明,该方法在多种合成和真实数据集上取得了优越性能。

Details

Motivation: 解决现有All-in-One图像恢复方法过度依赖退化特定表示,导致过度平滑和伪影的问题。

Result: 在多种合成和真实世界数据集上取得了优越性能。

Insight: 创新点包括:受人类视觉感知启发的分层恢复策略;利用MLLM进行跨模态理解的图像质量评估以更准确表征复合退化;结合语义引导的区域感知与退化感知模块进行局部恢复;以及自监督的内部线索重用机制以增强细节恢复。

Abstract: All-in-One Image Restoration (AiOIR) has advanced significantly, offering promising solutions for complex real-world degradations. However, most existing approaches rely heavily on degradation-specific representations, often resulting in oversmoothing and artifacts. To address this, we propose ClearAIR, a novel AiOIR framework inspired by Human Visual Perception (HVP) and designed with a hierarchical, coarse-to-fine restoration strategy. First, leveraging the global priority of early HVP, we employ a Multimodal Large Language Model (MLLM)-based Image Quality Assessment (IQA) model for overall evaluation. Unlike conventional IQA, our method integrates cross-modal understanding to more accurately characterize complex, composite degradations. Building upon this overall assessment, we then introduce a region awareness and task recognition pipeline. A semantic cross-attention, leveraging semantic guidance unit, first produces coarse semantic prompts. Guided by this regional context, a degradation-aware module implicitly captures region-specific degradation characteristics, enabling more precise local restoration. Finally, to recover fine details, we propose an internal clue reuse mechanism. It operates in a self-supervised manner to mine and leverage the intrinsic information of the image itself, substantially enhancing detail restoration. Experimental results show that ClearAIR achieves superior performance across diverse synthetic and real-world datasets.


[55] AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs cs.CVPDF

Boyu Chang, Qi Wang, Xi Guo, Zhixiong Nan, Yazhou Yao

TL;DR: 本文提出AbductiveMLLM,通过模仿人类认知中言语与图像溯因的交互,增强多模态大语言模型(MLLMs)的视觉溯因推理能力。该方法包含REASONER和IMAGINER两个协同组件:REASONER在言语域探索并筛选视觉一致的假设作为先验,IMAGINER则利用扩散模型生成与解释对应的视觉场景以丰富上下文。模型端到端训练,在标准VAR基准上达到SOTA性能。

Details

Motivation: 现有MLLMs虽具备通用多模态推理能力,但在视觉溯因推理(VAR)任务上仍落后于人类,需要从人类认知的双模态交互中汲取灵感以提升其溯因推断能力。

Result: 在标准VAR基准测试中,AbductiveMLLM实现了最先进的性能,一致优于传统解决方案和先进MLLMs。

Insight: 创新点在于引入言语与图像双模态协同的溯因机制:REASONER通过跨模态因果对齐筛选假设,IMAGINER通过文本到图像扩散模型进行视觉想象,共同增强MLLMs的因果一致性与上下文 grounding;这种受人类认知启发的双模式设计为提升MLLMs的复杂推理能力提供了新思路。

Abstract: Visual abductive reasoning (VAR) is a challenging task that requires AI systems to infer the most likely explanation for incomplete visual observations. While recent MLLMs develop strong general-purpose multimodal reasoning capabilities, they fall short in abductive inference, as compared to human beings. To bridge this gap, we draw inspiration from the interplay between verbal and pictorial abduction in human cognition, and propose to strengthen abduction of MLLMs by mimicking such dual-mode behavior. Concretely, we introduce AbductiveMLLM comprising of two synergistic components: REASONER and IMAGINER. The REASONER operates in the verbal domain. It first explores a broad space of possible explanations using a blind LLM and then prunes visually incongruent hypotheses based on cross-modal causal alignment. The remaining hypotheses are introduced into the MLLM as targeted priors, steering its reasoning toward causally coherent explanations. The IMAGINER, on the other hand, further guides MLLMs by emulating human-like pictorial thinking. It conditions a text-to-image diffusion model on both the input video and the REASONER’s output embeddings to “imagine” plausible visual scenes that correspond to verbal explanation, thereby enriching MLLMs’ contextual grounding. The two components are trained jointly in an end-to-end manner. Experiments on standard VAR benchmarks show that AbductiveMLLM achieves state-of-the-art performance, consistently outperforming traditional solutions and advanced MLLMs.


[56] EarthVL: A Progressive Earth Vision-Language Understanding and Generation Framework cs.CVPDF

Junjue Wang, Yanfei Zhong, Zihang Chen, Zhuo Zheng, Ailong Ma

TL;DR: 本文提出了一个渐进式地球视觉-语言理解与生成框架EarthVL,包含多任务数据集EarthVLSet和语义引导网络EarthVLNet,旨在解决地球视觉中对象关系推理的不足,以提升城市场景的全面理解。

Details

Motivation: 地球视觉在对象识别方面已取得里程碑进展,但缺乏对对象关系推理的探索,限制了场景的全面理解,因此本文旨在通过结合视觉与语言任务来推进这一领域。

Result: 在语义分割、多项选择和开放式视觉问答三个基准测试中,EarthVLNet表现出优越性,验证了分割特征对VQA性能的增强作用,并揭示了多项选择任务对视觉编码器更敏感、开放式任务需要先进视觉与语言解码器以获得最佳性能。

Insight: 创新点包括提出对象中心化的渐进式网络结构,通过语义分割引导关系推理和知识总结,以及引入数值差异损失动态处理不同对象的统计特性,为地理应用提供了连接’图像-掩码-文本’的基准。

Abstract: Earth vision has achieved milestones in geospatial object recognition but lacks exploration in object-relational reasoning, limiting comprehensive scene understanding. To address this, a progressive Earth vision-language understanding and generation framework is proposed, including a multi-task dataset (EarthVLSet) and a semantic-guided network (EarthVLNet). Focusing on city planning applications, EarthVLSet includes 10.9k sub-meter resolution remote sensing images, land-cover masks, and 761.5k textual pairs involving both multiple-choice and open-ended visual question answering (VQA) tasks. In an object-centric way, EarthVLNet is proposed to progressively achieve semantic segmentation, relational reasoning, and comprehensive understanding. The first stage involves land-cover segmentation to generate object semantics for VQA guidance. Guided by pixel-wise semantics, the object awareness based large language model (LLM) performs relational reasoning and knowledge summarization to generate the required answers. As for optimization, the numerical difference loss is proposed to dynamically add difference penalties, addressing the various objects’ statistics. Three benchmarks, including semantic segmentation, multiple-choice, and open-ended VQA demonstrated the superiorities of EarthVLNet, yielding three future directions: 1) segmentation features consistently enhance VQA performance even in cross-dataset scenarios; 2) multiple-choice tasks show greater sensitivity to the vision encoder than to the language decoder; and 3) open-ended tasks necessitate advanced vision encoders and language decoders for an optimal performance. We believe this dataset and method will provide a beneficial benchmark that connects ‘’image-mask-text’’, advancing geographical applications for Earth vision.


[57] DreamStyle: A Unified Framework for Video Stylization cs.CVPDF

Mengtian Li, Jinshu Chen, Songtao Zhao, Wanquan Feng, Pengqi Tu

TL;DR: 本文提出了DreamStyle,一个统一的视频风格化框架,支持文本引导、风格图像引导和首帧引导三种条件输入,并通过精心设计的数据处理流程获取高质量配对视频数据。该方法基于一个基础的图像到视频模型,采用具有特定token上矩阵的低秩适应进行训练,以减少不同条件token之间的混淆。定性和定量评估表明,DreamStyle在三种任务上均表现优异,在风格一致性和视频质量上超越了现有方法。

Details

Motivation: 现有视频风格化方法通常局限于单一类型的风格条件(如文本、风格图像或风格化首帧),限制了应用范围,并且缺乏高质量数据集导致风格不一致和时间闪烁问题。

Result: 定性和定量评估表明,DreamStyle在文本引导、风格图像引导和首帧引导三种视频风格化任务上均表现胜任,并在风格一致性和视频质量方面超越了竞争对手。

Insight: 主要创新点在于提出了一个统一框架支持多种风格条件输入,并设计了高质量数据获取流程。技术上的创新是采用了具有token特定上矩阵的LoRA训练方法,以减少不同条件token之间的混淆,这为多条件生成模型的训练提供了可借鉴的思路。

Abstract: Video stylization, an important downstream task of video generation models, has not yet been thoroughly explored. Its input style conditions typically include text, style image, and stylized first frame. Each condition has a characteristic advantage: text is more flexible, style image provides a more accurate visual anchor, and stylized first frame makes long-video stylization feasible. However, existing methods are largely confined to a single type of style condition, which limits their scope of application. Additionally, their lack of high-quality datasets leads to style inconsistency and temporal flicker. To address these limitations, we introduce DreamStyle, a unified framework for video stylization, supporting (1) text-guided, (2) style-image-guided, and (3) first-frame-guided video stylization, accompanied by a well-designed data curation pipeline to acquire high-quality paired video data. DreamStyle is built on a vanilla Image-to-Video (I2V) model and trained using a Low-Rank Adaptation (LoRA) with token-specific up matrices that reduces the confusion among different condition tokens. Both qualitative and quantitative evaluations demonstrate that DreamStyle is competent in all three video stylization tasks, and outperforms the competitors in style consistency and video quality.


[58] StableDPT: Temporal Stable Monocular Video Depth Estimation cs.CVPDF

Ivan Sobko, Hayko Riemenschneider, Markus Gross, Christopher Schroers

TL;DR: 本文提出了一种名为StableDPT的新方法,旨在解决将单目深度估计模型应用于视频序列时出现的时间不稳定性和闪烁伪影问题。该方法通过集成一个可在单个GPU上快速训练的时间模块,将任何最先进的基于图像的深度估计模型适配于视频处理。

Details

Motivation: 动机在于解决现有单目深度估计模型在视频序列上应用时产生的时间不一致性和闪烁问题,提升视频深度估计的时序稳定性。

Result: 在多个基准数据集上的评估表明,该方法在保持竞争力的最先进性能的同时,显著提高了时间一致性,并在实际场景中实现了高达2倍的加速处理。

Insight: 创新点包括:1) 在DPT头部引入基于高效交叉注意力机制的时间层,以整合整个视频序列关键帧的信息,捕获全局上下文和帧间关系;2) 提出一种新颖的推理策略,可处理任意长度视频,避免了其他方法中重叠窗口带来的尺度不对齐和冗余计算问题。

Abstract: Applying single image Monocular Depth Estimation (MDE) models to video sequences introduces significant temporal instability and flickering artifacts. We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing by integrating a new temporal module - trainable on a single GPU in a few days. Our architecture StableDPT builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head. The core of our contribution lies in the temporal layers within the head, which use an efficient cross-attention mechanism to integrate information from keyframes sampled across the entire video sequence. This allows the model to capture global context and inter-frame relationships leading to more accurate and temporally stable depth predictions. Furthermore, we propose a novel inference strategy for processing videos of arbitrary length avoiding the scale misalignment and redundant computations associated with overlapping windows used in other methods. Evaluations on multiple benchmark datasets demonstrate improved temporal consistency, competitive state-of-the-art performance and on top 2x faster processing in real-world scenarios.


[59] SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models cs.CVPDF

Ruiyang Zhang, Dongzhan Zhou, Zhedong Zheng

TL;DR: 本文提出SketchThinker-R1方法,旨在提升大型多模态模型进行草图式推理的效率。该方法通过三个阶段实现:草图模式冷启动、训练SketchJudge奖励模型以及草图思维强化学习,从而在保持答案准确性的同时显著减少推理过程的计算开销。

Details

Motivation: 解决大型多模态模型中广泛使用的逐步推理过程带来的高计算开销问题,如更高的令牌成本和响应时间,受人类高效草图式推理的启发,旨在提升模型推理效率。

Result: 在四个基准测试上的实验评估显示,SketchThinker-R1在保持最终答案准确性的同时,将推理令牌成本降低了超过64%。定性分析进一步表明草图式推理更专注于解决问题的关键线索。

Insight: 创新点在于将人类草图式推理的认知效率引入大型多模态模型,通过专门的奖励模型和强化学习框架激励模型生成简洁、目标导向的推理过程,从而在不牺牲性能的前提下实现高效推理。

Abstract: Despite the empirical success of extensive, step-by-step reasoning in large multimodal models, long reasoning processes inevitably incur substantial computational overhead, i.e., in terms of higher token costs and increased response time, which undermines inference efficiency. In contrast, humans often employ sketch-style reasoning: a concise, goal-directed cognitive process that prioritizes salient information and enables efficient problem-solving. Inspired by this cognitive efficiency, we propose SketchThinker-R1, which incentivizes sketch-style reasoning ability in large multimodal models. Our method consists of three primary stages. In the Sketch-Mode Cold Start stage, we convert standard long reasoning process into sketch-style reasoning and finetune base multimodal model, instilling initial sketch-style reasoning capability. Next, we train SketchJudge Reward Model, which explicitly evaluates thinking process of model and assigns higher scores to sketch-style reasoning. Finally, we conduct Sketch-Thinking Reinforcement Learning under supervision of SketchJudge to further generalize sketch-style reasoning ability. Experimental evaluation on four benchmarks reveals that our SketchThinker-R1 achieves over 64% reduction in reasoning token cost without compromising final answer accuracy. Qualitative analysis further shows that sketch-style reasoning focuses more on key cues during problem solving.


[60] TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors cs.CV | cs.AI | cs.LGPDF

Wei-Yuan Cheng, Kai-Po Chang, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang

TL;DR: 本文提出TA-Prompting方法,通过引入时序锚点(Temporal Anchors)来增强视频大语言模型(VideoLLMs),以提升其在密集视频描述任务中的性能。该方法能更精确地定位未修剪视频中的事件边界,并采用事件连贯性采样策略来生成与视频内容一致且时序连贯的描述。

Details

Motivation: 现有VideoLLMs在未修剪视频中难以准确识别事件边界,导致生成的描述缺乏精确的时序定位,因此需要一种方法来增强模型对时序事件的理解和定位能力。

Result: 在多个基准数据集上的实验表明,TA-Prompting在密集视频描述、时刻检索和时序问答等任务上优于现有最先进的VideoLLMs,取得了优越的性能。

Insight: 创新点在于引入可学习的时序锚点来精确事件定位,并结合事件连贯性采样策略确保描述的一致性和相关性,这为VideoLLMs的时序感知理解提供了新思路。

Abstract: Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.


[61] Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning cs.CVPDF

Guoqiang Liang, Jianyi Wang, Zhonghua Wu, Shangchen Zhou

TL;DR: 本文提出了Zoom-IQA,一个基于视觉语言模型(VLM)的图像质量评估(IQA)方法。该方法通过模拟关键认知行为(不确定性感知、区域推理和迭代优化)来提高评估的可靠性。其训练流程包括两个阶段:首先在自建的GR-IQA数据集上进行监督微调,使模型能够将评估依据关键区域;然后通过强化学习进行动态策略探索,并使用KL-Coverage正则化器和渐进重采样策略来稳定训练。实验表明,Zoom-IQA在鲁棒性、可解释性和泛化性方面均有提升,并在图像修复等下游任务中验证了其有效性。

Details

Motivation: 现有基于VLM的IQA方法在整合视觉和文本线索方面能力有限,导致推理不可靠。本文旨在解决这一问题,使模型能够生成更可靠的质量描述和分数。

Result: 广泛的实验表明,Zoom-IQA在鲁棒性、可解释性和泛化性方面取得了改进。

Insight: 创新点在于明确模拟了关键认知行为(不确定性感知、区域推理和迭代优化),并设计了一个两阶段训练流程,结合了监督微调与强化学习,特别是引入了KL-Coverage正则化器和渐进重采样策略来稳定训练并缓解标注偏差。

Abstract: Image Quality Assessment (IQA) is a long-standing problem in computer vision. Previous methods typically focus on predicting numerical scores without explanation or provide low-level descriptions lacking precise scores. Recent reasoning-based vision language models (VLMs) have shown strong potential for IQA, enabling joint generation of quality descriptions and scores. However, we notice that existing VLM-based IQA methods tend to exhibit unreliable reasoning due to their limited capability of integrating visual and textual cues. In this work, we introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors: uncertainty awareness, region reasoning, and iterative refinement. Specifically, we present a two-stage training pipeline: 1) supervised fine-tuning (SFT) on our Grounded-Rationale-IQA (GR-IQA) dataset to teach the model to ground its assessments in key regions; and 2) reinforcement learning (RL) for dynamic policy exploration, primarily stabilized by our KL-Coverage regularizer to prevent reasoning and scoring diversity collapse, and supported by a Progressive Re-sampling Strategy to mitigate annotation bias. Extensive experiments show that Zoom-IQA achieves improved robustness, explainability, and generalization. The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.


[62] DCG ReID: Disentangling Collaboration and Guidance Fusion Representations for Multi-modal Vehicle Re-Identification cs.CV | cs.AIPDF

Aihua Zheng, Ya Gao, Shihao Li, Chenglong Li, Jin Tang

TL;DR: 本文提出DCG-ReID方法,用于多模态车辆重识别任务,通过动态置信度解耦加权机制区分平衡与不平衡质量分布的数据,并分别设计协作融合模块和引导融合模块来提升类内一致性和模态间互补性。

Details

Motivation: 现有方法将所有多模态数据置于单一融合模型中,忽视了平衡与不平衡质量分布数据的不同需求,难以解耦类内一致性与模态间异质性的冲突。

Result: 在WMVeID863、MSVR310和RGBNT100三个多模态ReID基准测试上进行了广泛实验,验证了方法的有效性。

Insight: 创新点在于动态置信度解耦加权机制以及针对不同质量分布场景的两种融合策略,实现了模态贡献的动态重加权与差异化特征挖掘,提升了多模态联合决策性能。

Abstract: Multi-modal vehicle Re-Identification (ReID) aims to leverage complementary information from RGB, Near Infrared (NIR), and Thermal Infrared (TIR) modalities to retrieve the same vehicle. The challenges of multi-modal vehicle ReID arise from the uncertainty of modality quality distribution induced by inherent discrepancies across modalities, resulting in distinct conflicting fusion requirements for data with balanced and unbalanced quality distributions. Existing methods handle all multi-modal data within a single fusion model, overlooking the different needs of the two data types and making it difficult to decouple the conflict between intra-class consistency and inter-modal heterogeneity. To this end, we propose Disentangle Collaboration and Guidance Fusion Representations for Multi-modal Vehicle ReID (DCG-ReID). Specifically, to disentangle heterogeneous quality-distributed modal data without mutual interference, we first design the Dynamic Confidence-based Disentangling Weighting (DCDW) mechanism: dynamically reweighting three-modal contributions via interaction-derived modal confidence to build a disentangled fusion framework. Building on DCDW, we develop two scenario-specific fusion strategies: (1) for balanced quality distributions, Collaboration Fusion Module (CFM) mines pairwise consensus features to capture shared discriminative information and boost intra-class consistency; (2) for unbalanced distributions, Guidance Fusion Module (GFM) implements differential amplification of modal discriminative disparities to reinforce dominant modality advantages, guide auxiliary modalities to mine complementary discriminative info, and mitigate inter-modal divergence to boost multi-modal joint decision performance. Extensive experiments on three multi-modal ReID benchmarks (WMVeID863, MSVR310, RGBNT100) validate the effectiveness of our method. Code will be released upon acceptance.


[63] PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding cs.CV | cs.AIPDF

Iñaki Erregue, Kamal Nasrollahi, Sergio Escalera

TL;DR: PrismVAU是一个用于多模态视频异常理解(VAU)的轻量级实时推理系统,它利用单一现成的多模态大语言模型(MLLM)进行异常评分、解释和提示优化,无需微调、帧级标注或外部模块。

Details

Motivation: 解决现有VAU方法依赖微调MLLM或外部模块(如视频描述器)导致的标注成本高、训练流程复杂和推理开销大的问题。

Result: 在标准VAD基准测试上的广泛实验表明,PrismVAU提供了具有竞争力的检测性能和可解释的异常解释。

Insight: 创新点包括:1)通过弱监督自动提示工程(APE)框架优化文本锚点和提示;2)采用两阶段互补架构(粗粒度异常评分和基于MLLM的细化模块)实现高效上下文理解;3)系统完全基于现成MLLM,无需指令调优或密集处理,提升了实际应用效率。

Abstract: Video Anomaly Understanding (VAU) extends traditional Video Anomaly Detection (VAD) by not only localizing anomalies but also describing and reasoning about their context. Existing VAU approaches often rely on fine-tuned multimodal large language models (MLLMs) or external modules such as video captioners, which introduce costly annotations, complex training pipelines, and high inference overhead. In this work, we introduce PrismVAU, a lightweight yet effective system for real-time VAU that leverages a single off-the-shelf MLLM for anomaly scoring, explanation, and prompt optimization. PrismVAU operates in two complementary stages: (1) a coarse anomaly scoring module that computes frame-level anomaly scores via similarity to textual anchors, and (2) an MLLM-based refinement module that contextualizes anomalies through system and user prompts. Both textual anchors and prompts are optimized with a weakly supervised Automatic Prompt Engineering (APE) framework. Extensive experiments on standard VAD benchmarks demonstrate that PrismVAU delivers competitive detection performance and interpretable anomaly explanations – without relying on instruction tuning, frame-level annotations, and external modules or dense processing – making it an efficient and practical solution for real-world applications.


[64] Towards Faithful Reasoning in Comics for Small MLLMs cs.CV | cs.AIPDF

Chengcheng Feng, Haojie Yin, Yucheng Jin, Kaizhu Huang

TL;DR: 该论文针对漫画视觉问答(CVQA)任务,分析了标准思维链(CoT)提示在小规模多模态大语言模型(MLLMs)上性能下降的问题,并提出了一种新的漫画推理框架。该框架结合了模块化CoT生成、基于GRPO的强化微调以及结构化奖励,旨在生成更忠实、可迁移的推理链。实验表明,该方法在多个幽默和抽象视觉推理基准上超越了现有方法,其3B模型取得了SOTA性能。

Details

Motivation: 解决标准思维链(CoT)提示在漫画视觉问答(CVQA)任务中,特别是对于小规模MLLMs,会导致性能下降的问题。CVQA依赖于符号抽象、叙事逻辑和幽默,与常规VQA不同,标准CoT在此面临状态纠缠、虚假转换和探索效率低下等挑战。

Result: 在五个具有挑战性的基准测试(包括漫画VQA、表情包理解和社论漫画解读)上,提出的3B模型超越了最先进(SOTA)方法。插件实验为不同的MLLM带来了平均12.1%的额外性能提升。

Insight: 论文的创新点在于:1) 识别并分析了标准CoT在CVQA任务中失效的具体原因(状态纠缠等);2) 提出了一个专为小MLLMs设计的漫画推理框架,结合了模块化CoT生成、GRPO强化微调和结构化奖励,以提高推理的忠实度和可迁移性;3) 将方法推广到更广泛的幽默中心和抽象视觉推理任务,验证了其泛化能力。

Abstract: Comic-based visual question answering (CVQA) poses distinct challenges to multimodal large language models (MLLMs) due to its reliance on symbolic abstraction, narrative logic, and humor, which differ from conventional VQA tasks. Although Chain-of-Thought (CoT) prompting is widely used to enhance MLLM reasoning, surprisingly, its direct application to CVQA often degrades performance, especially in small-scale models. Our theoretical and empirical analyses reveal that standard CoT in CVQA suffers from state entanglement, spurious transitions, and exploration inefficiency, with small models particularly vulnerable in resource-constrained settings. To address these issues, we propose a novel comic reasoning framework, designed to produce more faithful and transferable reasoning chains in small MLLMs. Specifically, our framework combines modular CoT generation with GRPO-based reinforcement fine-tuning and a novel structured reward. Beyond comic VQA, we further evaluate our approach on a broader class of humor-centric and abstract visual reasoning tasks, including meme understanding and editorial cartoon interpretation. Across five challenging benchmarks, our 3B model outperforms state-of-the-art methods, and plug-in experiments yield an additional average improvement of $\mathbf{12.1%}$ across different MLLMs.


[65] ReCCur: A Recursive Corner-Case Curation Framework for Robust Vision-Language Understanding in Open and Edge Scenarios cs.CV | cs.MAPDF

Yihan Wei, Shenghai Yuan, Tianchen Deng, Boyang Lou, Enwen Hu

TL;DR: 本文提出ReCCur(递归角点案例管理)框架,这是一个低计算成本的系统,旨在通过多智能体递归流程将嘈杂的网络图像转化为可审计的细粒度标签,以应对开放和边缘场景中视觉语言理解的鲁棒性问题。

Details

Motivation: 解决角点案例(罕见或极端场景)难以大规模收集和标注的问题,因为网络数据噪声大、标签脆弱,且边缘部署环境限制了大规模重新训练。

Result: 在现实角点案例场景(如洪水车辆检测)中,ReCCur在消费级GPU上运行,持续提升数据纯度和可分离性,并只需最小化人工监督,为资源受限的下游训练和评估提供了实用基础。

Insight: 创新点包括:多模态一致性过滤、专家混合知识蒸馏与双重置信激活及不确定性采样、区域证据视觉语言模型对抗性标注(结合提议者和验证者以实现可解释标签)。这些方法共同构成了一个递归、可审计的标注管道,适用于边缘场景。

Abstract: Corner cases are rare or extreme scenarios that drive real-world failures, but they are difficult to curate at scale: web data are noisy, labels are brittle, and edge deployments preclude large retraining. We present ReCCur (Recursive Corner-Case Curation), a low-compute framework that converts noisy web imagery into auditable fine-grained labels via a multi-agent recursive pipeline. First, large-scale data acquisition and filtering expands a domain vocabulary with a vision-language model (VLM), crawls the web, and enforces tri-modal (image, description, keyword) consistency with light human spot checks to yield refined candidates. Next, mixture-of-experts knowledge distillation uses complementary encoders (e.g., CLIP, DINOv2, BEiT) for kNN voting with dual-confidence activation and uncertainty sampling, converging to a high-precision set. Finally, region-evidence VLM adversarial labeling pairs a proposer (multi-granularity regions and semantic cues) with a validator (global and local chained consistency) to produce explainable labels and close the loop. On realistic corner-case scenarios (e.g., flooded-car inspection), ReCCur runs on consumer-grade GPUs, steadily improves purity and separability, and requires minimal human supervision, providing a practical substrate for downstream training and evaluation under resource constraints. Code and dataset will be released.


[66] SA-ResGS: Self-Augmented Residual 3D Gaussian Splatting for Next Best View Selection cs.CVPDF

Kim Jun-Seong, Tae-Hyun Oh, Eduardo Pérez-Pellitero, Youngkyoon Jang

TL;DR: 本文提出了一种名为SA-ResGS的新型框架,用于在主动场景重建的下一最佳视角(NBV)选择中稳定不确定性量化并增强不确定性感知的监督。该框架通过三角测量生成自增强点云来估计场景覆盖,并引入首个为3D高斯泼溅定制的残差学习策略,以解决稀疏宽基线视图导致的监督不足问题。

Details

Motivation: 解决主动场景重建中下一最佳视角选择时,由于稀疏和宽基线视图导致的不确定性量化不稳定、监督不足以及高斯模型训练不稳定的问题。

Result: 在主动视角选择实验中,SA-ResGS在重建质量和视角选择鲁棒性方面均优于现有最先进的基线方法。

Insight: 创新点包括:基于物理的视角选择策略以促进高效均匀的场景覆盖;不确定性感知的残差监督方案,增强对弱贡献高斯的监督信号;以及通过约束视角选择和残差监督隐式地减少不确定性量化的偏差。这些方法共同缓解了宽基线探索和稀疏视图模糊性在NBV规划中的冲突效应。

Abstract: We propose Self-Augmented Residual 3D Gaussian Splatting (SA-ResGS), a novel framework to stabilize uncertainty quantification and enhancing uncertainty-aware supervision in next-best-view (NBV) selection for active scene reconstruction. SA-ResGS improves both the reliability of uncertainty estimates and their effectiveness for supervision by generating Self-Augmented point clouds (SA-Points) via triangulation between a training view and a rasterized extrapolated view, enabling efficient scene coverage estimation. While improving scene coverage through physically guided view selection, SA-ResGS also addresses the challenge of under-supervised Gaussians, exacerbated by sparse and wide-baseline views, by introducing the first residual learning strategy tailored for 3D Gaussian Splatting. This targeted supervision enhances gradient flow in high-uncertainty Gaussians by combining uncertainty-driven filtering with dropout- and hard-negative-mining-inspired sampling. Our contributions are threefold: (1) a physically grounded view selection strategy that promotes efficient and uniform scene coverage; (2) an uncertainty-aware residual supervision scheme that amplifies learning signals for weakly contributing Gaussians, improving training stability and uncertainty estimation across scenes with diverse camera distributions; (3) an implicit unbiasing of uncertainty quantification as a consequence of constrained view selection and residual supervision, which together mitigate conflicting effects of wide-baseline exploration and sparse-view ambiguity in NBV planning. Experiments on active view selection demonstrate that SA-ResGS outperforms state-of-the-art baselines in both reconstruction quality and view selection robustness.


[67] Motion Blur Robust Wheat Pest Damage Detection with Dynamic Fuzzy Feature Fusion cs.CV | cs.AIPDF

Han Zhang, Yanwei Wang, Fang Li, Hongjun Wang

TL;DR: 本文提出了一种名为动态模糊鲁棒卷积金字塔(DFRCP)的插件模块,用于增强YOLOv11在运动模糊条件下的目标检测性能。该方法通过融合多尺度特征并引入动态鲁棒切换单元,自适应地注入模糊特征以增强全局感知,同时开发了高效的CUDA并行内核以实现快速处理,适用于边缘设备部署。

Details

Motivation: 解决相机抖动引起的运动模糊导致目标检测性能下降的问题,现有方法要么将模糊视为噪声而丢失判别性结构,要么进行全图像恢复导致延迟增加,难以在资源受限设备上部署。

Result: 在包含约3,500张图像的私有小麦害虫损害数据集上,使用两种模糊增强方法进行训练,DFRCP使YOLOv11在模糊测试集上的准确率比基线YOLOv11提高了约10.4%,且仅带来适度的训练时间开销。

Insight: 创新点包括动态模糊特征融合机制、自适应模糊特征注入的动态鲁棒切换单元,以及高效的CUDA并行旋转与插值内核设计,实现了在保持检测精度的同时提升处理速度,适用于边缘计算场景。

Abstract: Motion blur caused by camera shake produces ghosting artifacts that substantially degrade edge side object detection. Existing approaches either suppress blur as noise and lose discriminative structure, or apply full image restoration that increases latency and limits deployment on resource constrained devices. We propose DFRCP, a Dynamic Fuzzy Robust Convolutional Pyramid, as a plug in upgrade to YOLOv11 for blur robust detection. DFRCP enhances the YOLOv11 feature pyramid by combining large scale and medium scale features while preserving native representations, and by introducing Dynamic Robust Switch units that adaptively inject fuzzy features to strengthen global perception under jitter. Fuzzy features are synthesized by rotating and nonlinearly interpolating multiscale features, then merged through a transparency convolution that learns a content adaptive trade off between original and fuzzy cues. We further develop a CUDA parallel rotation and interpolation kernel that avoids boundary overflow and delivers more than 400 times speedup, making the design practical for edge deployment. We train with paired supervision on a private wheat pest damage dataset of about 3,500 images, augmented threefold using two blur regimes, uniform image wide motion blur and bounding box confined rotational blur. On blurred test sets, YOLOv11 with DFRCP achieves about 10.4 percent higher accuracy than the YOLOv11 baseline with only a modest training time overhead, reducing the need for manual filtering after data collection.


[68] On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning cs.CV | cs.AI | cs.CCPDF

Siyi Lyu, Quan Liu, Feng Yan

TL;DR: 本文探讨了视觉Transformer(ViT)在非可解空间推理任务(如心理旋转)中的内在局限性。作者提出,这种局限性源于ViT架构固有的电路复杂性,而非数据规模。研究将空间理解形式化为学习一个群同态映射,并证明对于非可解群(如三维旋转群SO(3)),保持这种结构保持的嵌入在计算上受限于Word Problem(NC^1完全问题)。相比之下,恒定深度ViT的计算能力被严格限制在TC^0类中。基于TC^0 ≠ NC^1的猜想,作者确立了一个复杂性边界:恒定深度ViT从根本上缺乏有效捕捉非可解空间结构所需的逻辑深度。

Details

Motivation: 解决ViT在语义识别上表现出色,但在空间推理任务中却存在系统性失败的问题。作者认为这一限制源于架构本身的内在电路复杂性,而非通常归因的数据规模不足。

Result: 通过潜在空间探测验证了复杂性鸿沟的存在,表明随着组合深度的增加,ViT在非可解任务上的表征会发生结构崩溃。

Insight: 创新点在于将空间理解形式化为群同态学习问题,并从计算复杂性理论的角度,为ViT在非可解空间推理任务上的能力设立了理论边界(TC^0 vs. NC^1),这为理解Transformer架构的固有能力和局限性提供了新的理论框架。

Abstract: Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, we propose that this limitation arises from the intrinsic circuit complexity of the architecture. We formalize spatial understanding as learning a Group Homomorphism: mapping image sequences to a latent space that preserves the algebraic structure of the underlying transformation group. We demonstrate that for non-solvable groups (e.g., the 3D rotation group $\mathrm{SO}(3)$), maintaining such a structure-preserving embedding is computationally lower-bounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, we prove that constant-depth ViTs with polynomial precision are strictly bounded by $\mathsf{TC^0}$. Under the conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, we establish a complexity boundary: constant-depth ViTs fundamentally lack the logical depth to efficiently capture non-solvable spatial structures. We validate this complexity gap via latent-space probing, demonstrating that ViT representations suffer a structural collapse on non-solvable tasks as compositional depth increases.


[69] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation cs.CV | cs.AIPDF

Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding

TL;DR: 本文提出IBISAgent,一种新型的代理式多模态大语言模型,将生物医学图像分割重新定义为以视觉为中心的多步决策过程。该方法通过生成交错式的推理和基于文本的点击动作来调用分割工具,无需修改模型架构即可产生高质量掩码,并通过两阶段训练框架提升模型在复杂医学指代和推理分割任务中的鲁棒性。

Details

Motivation: 现有医学MLLM方法在实现像素级细粒度理解时面临两大挑战:一是引入隐式分割令牌并需同时微调MLLM和外部像素解码器,导致灾难性遗忘风险高且泛化能力受限;二是大多依赖单次推理,缺乏迭代优化分割结果的能力,导致性能不佳。

Result: 大量实验表明,IBISAgent在多个数据集上一致超越了闭源和开源的最先进方法,实现了SOTA性能。

Insight: 创新点在于将分割任务重构为多步决策的代理过程,通过交错推理与文本点击动作调用工具,并结合两阶段训练(冷启动监督微调与细粒度奖励的代理强化学习),这促进了像素级视觉推理能力的发展,且避免了模型架构修改带来的泛化限制。

Abstract: Recent research on medical MLLMs has gradually shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose a novel agentic MLLM, named IBISAgent, that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model’s robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods. All datasets, code, and trained models will be released publicly.


[70] Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA cs.CVPDF

Tong Wu, Thanet Markchom

TL;DR: 本文提出了一种专为卡通图像视觉问答(VQA)任务设计的多智能体大语言模型框架,该框架包含视觉、语言和评论三个智能体,通过协作整合视觉线索和叙事上下文进行结构化推理,并在Pororo和Simpsons两个卡通VQA数据集上进行了系统评估。

Details

Motivation: 解决标准大语言模型在处理卡通图像VQA任务时面临的挑战,如解读夸张的视觉抽象和叙事驱动的上下文,这些在自然图像训练模型中未得到充分处理。

Result: 在Pororo和Simpsons两个卡通VQA数据集上进行了实验,提供了每个智能体对最终预测贡献的详细分析,以深入理解基于LLM的多智能体在卡通VQA和多模态推理中的行为。

Insight: 创新点在于为卡通VQA设计了专门的多智能体LLM框架,通过视觉、语言和评论智能体的分工协作来增强结构化推理能力;客观分析认为,这种模块化多智能体方法为理解复杂多模态任务中的LLM行为提供了新视角。

Abstract: Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative-driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi-agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon-based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM-based multi-agent behaviour in cartoon VQA and multimodal inference.


[71] Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs cs.CV | cs.AIPDF

Chenchen Lin, Sanbao Su, Rachel Luo, Yuxiao Chen, Yan Wang

TL;DR: 本文提出了一种名为TGIF(文本引导层间融合)的轻量级模块,旨在缓解多模态大语言模型(MLLMs)中的幻觉问题。该方法通过将视觉编码器的不同层视为深度方向的“专家”,并基于文本提示动态预测视觉特征的融合方式,从而更充分地利用视觉层次信息,增强视觉基础。

Details

Motivation: 现有的MLLMs通常仅使用冻结视觉编码器的单一深层特征,未能充分利用其丰富的视觉层次线索,导致模型容易产生脱离图像证据的幻觉,过度依赖语言先验。现有缓解策略多作用于文本侧,且现有的多层融合方法是静态的,无法根据查询动态调整。

Result: 将TGIF集成到LLaVA-1.5-7B模型中,在幻觉、OCR和VQA基准测试上取得了持续改进,同时在ScienceQA、GQA和MMBench基准上保持或提升了性能。

Insight: 核心创新点是提出了查询条件化、层次感知的动态视觉特征融合方法(TGIF),它遵循直接外部融合原则,无需更新视觉编码器,开销极小。这为通过更精细地利用视觉编码器内部层次结构来增强视觉基础、减少幻觉提供了一条有效途径。

Abstract: Multimodal large language models (MLLMs) typically rely on a single late-layer feature from a frozen vision encoder, leaving the encoder’s rich hierarchy of visual cues under-utilized. MLLMs still suffer from visually ungrounded hallucinations, often relying on language priors rather than image evidence. While many prior mitigation strategies operate on the text side, they leave the visual representation unchanged and do not exploit the rich hierarchy of features encoded across vision layers. Existing multi-layer fusion methods partially address this limitation but remain static, applying the same layer mixture regardless of the query. In this work, we introduce TGIF (Text-Guided Inter-layer Fusion), a lightweight module that treats encoder layers as depth-wise “experts” and predicts a prompt-dependent fusion of visual features. TGIF follows the principle of direct external fusion, requires no vision-encoder updates, and adds minimal overhead. Integrated into LLaVA-1.5-7B, TGIF provides consistent improvements across hallucination, OCR, and VQA benchmarks, while preserving or improving performance on ScienceQA, GQA, and MMBench. These results suggest that query-conditioned, hierarchy-aware fusion is an effective way to strengthen visual grounding and reduce hallucination in modern MLLMs.


[72] Unified Thinker: A General Reasoning Modular Core for Image Generation cs.CV | cs.AIPDF

Sashuai Zhou, Qiang Zhou, Jijin Hu, Hanqing Yang, Yue Cao

TL;DR: 本文提出Unified Thinker,一种用于通用图像生成的任务无关推理架构。它将推理过程(Thinker)与图像生成器(Generator)解耦,通过一个两阶段训练范式(先构建结构化规划接口,后使用强化学习进行像素级反馈对齐),旨在解决生成模型在遵循逻辑密集型指令时的推理-执行差距问题。

Details

Motivation: 当前生成模型在图像合成方面虽有进展,但在遵循逻辑密集型指令时仍存在明显的推理-执行差距,而闭源系统(如Nano Banana)已展现出强大的推理驱动图像生成能力,凸显了与当前开源模型的巨大差距。作者认为弥合这一差距需要可执行的推理能力,即将高级意图分解为可直接指导生成过程的、可验证的规划。

Result: 在文本到图像生成和图像编辑任务上的大量实验表明,Unified Thinker显著提升了图像推理和生成质量。

Insight: 主要创新点在于提出了一个模块化的、可插拔的通用推理核心架构,将推理与生成解耦以实现独立升级;并引入了一个结合结构化规划接口和基于像素反馈的强化学习的训练范式,以优化视觉正确性而非文本合理性。这为提升生成模型的逻辑推理能力提供了一种可扩展的框架思路。

Abstract: Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning–execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.


[73] DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation cs.CVPDF

Jiajun jiao, Haowei Zhu, Puyuan Yang, Jianghui Wang, Ji Liu

TL;DR: 本文提出了一个由大语言模型驱动的自动化扩散模型加速代码生成与评估框架。该框架包含DiffBench基准测试和DiffAgent智能体两部分:DiffBench为扩散模型加速代码提供全面的三阶段自动化评估流程;DiffAgent则能针对任意扩散模型,通过规划、调试和代码生成组件的闭环工作流,结合遗传算法从执行环境中提取性能反馈,迭代生成最优加速策略和代码。

Details

Motivation: 扩散模型在图像和视频生成方面取得了显著成功,但其固有的多步推理过程带来了巨大的计算开销,阻碍了实际部署。加速扩散模型至关重要,但如何组合多种模型加速技术仍是一个重大挑战。

Result: 大量实验表明,DiffBench为生成的代码提供了全面的评估,并且DiffAgent在生成有效的扩散加速策略方面显著优于现有的大语言模型。

Insight: 创新点在于构建了一个集自动化评估(DiffBench)与自动化策略生成(DiffAgent)于一体的端到端LLM驱动框架。DiffAgent采用包含规划、调试和代码生成的闭环迭代工作流,并引入遗传算法利用执行环境反馈指导代码优化,这是一种将LLM的代码生成能力与基于性能的进化搜索相结合的创新方法。

Abstract: Diffusion models have achieved remarkable success in image and video generation. However, their inherently multiple step inference process imposes substantial computational overhead, hindering real-world deployment. Accelerating diffusion models is therefore essential, yet determining how to combine multiple model acceleration techniques remains a significant challenge. To address this issue, we introduce a framework driven by large language models (LLMs) for automated acceleration code generation and evaluation. First, we present DiffBench, a comprehensive benchmark that implements a three stage automated evaluation pipeline across diverse diffusion architectures, optimization combinations and deployment scenarios. Second, we propose DiffAgent, an agent that generates optimal acceleration strategies and codes for arbitrary diffusion models. DiffAgent employs a closed-loop workflow in which a planning component and a debugging component iteratively refine the output of a code generation component, while a genetic algorithm extracts performance feedback from the execution environment to guide subsequent code refinements. We provide a detailed explanation of the DiffBench construction and the design principles underlying DiffAgent. Extensive experiments show that DiffBench offers a thorough evaluation of generated codes and that DiffAgent significantly outperforms existing LLMs in producing effective diffusion acceleration strategies.


[74] AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation cs.CV | cs.AI | cs.LGPDF

Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert

TL;DR: 本文提出了AnatomiX,一种用于胸部X光片解读的解剖学感知多模态大语言模型。该模型采用两阶段方法:首先识别解剖结构并提取特征,然后利用大语言模型执行多种下游任务,如短语定位、报告生成、视觉问答和图像理解。

Details

Motivation: 现有多模态医学大语言模型在胸部X光解读中面临空间推理和解剖学理解的挑战,现有定位技术未能建立真正的解剖对应关系,导致医学领域解剖理解错误。

Result: 在多个基准测试上的广泛实验表明,AnatomiX在解剖学推理方面表现优异,在解剖定位、短语定位、定位诊断和定位描述任务上,相比现有方法性能提升超过25%。

Insight: 创新点在于受放射学工作流程启发,采用明确的两阶段解剖感知设计,将解剖结构识别与语言模型推理解耦,从而实现了更准确的解剖对应和空间理解,提升了医学图像解读的可靠性和性能。

Abstract: Multimodal medical large language models have shown impressive progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model explicitly designed for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://github.com/aneesurhashmi/anatomix


[75] UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision cs.CV | cs.AIPDF

Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang

TL;DR: 本文提出UniCorn框架,旨在解决统一多模态模型在理解与生成能力之间的不一致性(称为传导性失语症)。该框架通过将单一模型划分为提议者、求解者和评判者三个角色,利用自博弈生成高质量交互,并通过认知模式重构将隐式理解蒸馏为显式生成信号,从而实现无需外部数据或教师监督的自改进。

Details

Motivation: 统一多模态模型在跨模态理解方面已取得显著成功,但其利用内部知识进行高质量生成的能力仍存在明显差距,这种理解与生成之间的不一致性被形式化为传导性失语症。

Result: 在六个通用图像生成基准测试中,UniCorn相比基础模型实现了全面且显著的提升,在TIIF(73.8)、DPG(86.8)、CompBench(88.5)和新提出的UniCycle基准上达到SOTA性能,同时在WISE和OneIG上分别获得+5.0和+6.5的显著增益。

Insight: 创新点包括:1) 将单一模型划分为三个协作角色进行自博弈以生成高质量监督信号;2) 提出认知模式重构方法,将隐式理解蒸馏为显式生成信号;3) 引入基于文本-图像-文本重建循环的UniCycle一致性基准来验证多模态连贯性的恢复。该方法展示了完全自监督精炼在多模态智能中的可扩展性。

Abstract: While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.


[76] LTX-2: Efficient Joint Audio-Visual Foundation Model cs.CVPDF

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko

TL;DR: LTX-2是一个开源的音视频联合生成基础模型,能够以统一的方式生成高质量、时间同步的音视频内容。它采用非对称双流Transformer架构,通过双向跨注意力层和跨模态条件机制进行高效训练与推理,并引入了多语言文本编码器和模态感知的无分类器引导机制以提升生成质量与可控性。

Details

Motivation: 解决现有文本到视频扩散模型只能生成无声视频、缺乏音频提供的语义、情感和氛围线索的问题,旨在构建一个能够统一生成高质量同步音视频内容的模型。

Result: 在评估中,该模型在开源系统中实现了音视频质量和提示遵循方面的最先进水平(SOTA),同时以远低于专有模型的计算成本和推理时间,取得了与之相当的结果。

Insight: 主要创新点包括:1) 非对称双流Transformer架构,为视频分配更多参数,实现高效联合建模;2) 引入带时间位置编码的双向音视频跨注意力和跨模态AdaLN,实现共享时间步条件;3) 模态感知的无分类器引导机制,改善音视频对齐和可控性;4) 能够生成包含语音、背景音和拟音元素的丰富、连贯的音频轨道。

Abstract: Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent – missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene – complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.


[77] A Versatile Multimodal Agent for Multimedia Content Generation cs.CVPDF

Daoan Zhang, Wenlin Yao, Xiaoyang Wang, Yebowen Hu, Jiebo Luo

TL;DR: 本文提出了一种名为MultiMedia-Agent的多模态智能体,旨在自动化复杂的多媒体内容生成任务。该智能体系统包含数据生成流水线、内容创作工具库和偏好对齐评估指标,并引入技能习得理论来建模数据管理和智能体训练。通过两阶段关联策略(自关联和模型偏好关联)优化生成计划,并采用三阶段方法(基础/成功计划微调和偏好优化)训练智能体。实验结果表明,该方法有效,且能生成优于新颖模型的多媒体内容。

Details

Motivation: 当前AIGC模型大多只能作为特定应用场景中的独立组件,无法在真实世界应用中端到端地完成任务,且难以有效整合图像、视频、音频、文本等多模态输出。基于智能体的系统为解决复杂内容生成任务提供了可能。

Result: 比较结果表明,所提出的方法是有效的,MultiMedia-Agent能够生成比新颖模型更好的多媒体内容。

Insight: 创新点包括:将技能习得理论应用于训练数据管理和智能体训练建模;设计了两阶段关联策略(自关联和模型偏好关联)进行计划优化;采用三阶段方法(基础/成功计划微调和偏好优化)训练智能体。从客观角度看,其构建了一个整合数据生成、工具调用和评估的端到端智能体框架,以应对真实世界多媒体创作的复杂多模态需求。

Abstract: With the advancement of AIGC (AI-generated content) technologies, an increasing number of generative models are revolutionizing fields such as video editing, music generation, and even film production. However, due to the limitations of current AIGC models, most models can only serve as individual components within specific application scenarios and are not capable of completing tasks end-to-end in real-world applications. In real-world applications, editing experts often work with a wide variety of images and video inputs, producing multimodal outputs – a video typically includes audio, text, and other elements. This level of integration across multiple modalities is something current models are unable to achieve effectively. However, the rise of agent-based systems has made it possible to use AI tools to tackle complex content generation tasks. To deal with the complex scenarios, in this paper, we propose a MultiMedia-Agent designed to automate complex content creation. Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment. Notably, we introduce the skill acquisition theory to model the training data curation and agent training. We designed a two-stage correlation strategy for plan optimization, including self-correlation and model preference correlation. Additionally, we utilized the generated plans to train the MultiMedia-Agent via a three stage approach including base/success plan finetune and preference optimization. The comparison results demonstrate that the our approaches are effective and the MultiMedia-Agent can generate better multimedia content compared to novel models.


[78] Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training cs.CVPDF

Hexiao Lu, Xiaokun Sun, Zeyu Cai, Hao Guo, Ying Tai

TL;DR: Muses是一种无需训练的前馈式方法,用于生成奇幻3D生物。它通过3D骨架作为基础表示,将创作过程分为设计、组合和生成三个步骤,首先构建有创意的3D骨架,然后在结构化潜在空间中进行体素组装,最后在骨架指导下生成风格一致的纹理,从而克服了现有方法在部分级操作和跨域生成上的局限性。

Details

Motivation: 解决现有方法(如基于部分感知优化、手动组装或2D图像生成)在生成奇幻3D生物时因复杂部分级操作和有限跨域生成能力而导致的不真实或不连贯问题。

Result: 大量实验表明,Muses在视觉保真度和与文本描述的对齐方面达到了最先进(SOTA)水平,并在灵活3D对象编辑上展现出潜力。

Insight: 创新点在于利用3D骨架作为生物形态的基础表示,将3D内容创作形式化为结构感知的管道,通过图约束推理构建骨架、结构化潜在空间中的体素组装以及骨架条件下的图像引导外观建模,实现无需训练的高质量生成。

Abstract: We present Muses, the first training-free method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation. In contrast, Muses leverages the 3D skeleton, a fundamental representation of biological forms, to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation. Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape. Extensive experiments establish Muses’ state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing. Project page: https://luhexiao.github.io/Muses.github.io/.


eess.IV [Back]

[79] Expert-Guided Explainable Few-Shot Learning with Active Sample Selection for Medical Image Analysis eess.IV | cs.AI | cs.CVPDF

Longwei Wang, Ifrat Ikhtear Uddin, KC Santosh

TL;DR: 本文提出了一种专家引导的可解释性少样本学习与主动样本选择双框架(EGxFSL 和 xGAL),用于解决医学图像分析中标注数据稀缺和模型可解释性不足的问题。EGxFSL 通过基于 Grad-CAM 的 Dice 损失整合放射科医生定义的感兴趣区域作为空间监督,与原型分类联合优化以实现可解释的少样本学习。xGAL 则引入了一种迭代样本获取策略,优先考虑预测不确定性和注意力错位,形成了一个可解释性协同指导训练和样本选择的闭环框架。

Details

Motivation: 医学图像分析面临两个关键挑战:标注数据稀缺和模型可解释性缺乏,这阻碍了临床AI的部署。少样本学习(FSL)可以解决数据限制问题,但预测缺乏透明度;主动学习(AL)方法可以优化数据获取,但忽略了所获取样本的可解释性。因此,需要一种能够同时解决数据稀缺和模型可解释性问题的综合方法。

Result: 在 BraTS(MRI)、VinDr-CXR(胸部X光)和 SIIM-COVID-19(胸部X光)数据集上,所提方法分别达到了 92%、76% 和 62% 的准确率,在所有数据集上均持续优于非引导基线。在严重数据限制下,xGAL 仅用 680 个样本就达到了 76% 的准确率,而随机采样仅为 57%。Grad-CAM 可视化显示引导模型能够聚焦于诊断相关区域,并且在乳腺超声上的泛化验证证实了其跨模态适用性。

Insight: 论文的创新点在于将专家知识(放射科医生定义的ROI)以空间监督的形式整合到少样本学习框架中,并通过一种新颖的主动学习策略(xGAL)将可解释性(注意力错位)与预测不确定性共同作为样本选择的准则,从而构建了一个可解释性指导训练和样本选择的协同闭环系统。从客观角度看,这种将领域专家先验与模型可解释性度量深度结合到数据高效学习流程中的思路,为解决医学AI的“数据-信任”双重瓶颈提供了有前景的途径。

Abstract: Medical image analysis faces two critical challenges: scarcity of labeled data and lack of model interpretability, both hindering clinical AI deployment. Few-shot learning (FSL) addresses data limitations but lacks transparency in predictions. Active learning (AL) methods optimize data acquisition but overlook interpretability of acquired samples. We propose a dual-framework solution: Expert-Guided Explainable Few-Shot Learning (EGxFSL) and Explainability-Guided AL (xGAL). EGxFSL integrates radiologist-defined regions-of-interest as spatial supervision via Grad-CAM-based Dice loss, jointly optimized with prototypical classification for interpretable few-shot learning. xGAL introduces iterative sample acquisition prioritizing both predictive uncertainty and attention misalignment, creating a closed-loop framework where explainability guides training and sample selection synergistically. On the BraTS (MRI), VinDr-CXR (chest X-ray), and SIIM-COVID-19 (chest X-ray) datasets, we achieve accuracies of 92%, 76%, and 62%, respectively, consistently outperforming non-guided baselines across all datasets. Under severe data constraints, xGAL achieves 76% accuracy with only 680 samples versus 57% for random sampling. Grad-CAM visualizations demonstrate guided models focus on diagnostically relevant regions, with generalization validated on breast ultrasound confirming cross-modality applicability.


[80] Annealed Langevin Posterior Sampling (ALPS): A Rapid Algorithm for Image Restoration with Multiscale Energy Models eess.IV | cs.AI | cs.CVPDF

Jyothi Rikhab Chand, Mathews Jacob

TL;DR: 本文提出了一种名为退火朗之万后验采样(ALPS)的快速算法,用于解决成像中的逆问题。该方法通过一种快速的蒸馏策略,将预训练扩散模型的优势转移到多尺度能量模型(EBMs)中,从而克服了传统EBMs计算成本高和训练不稳定的历史缺点。ALPS算法利用EBMs的可组合性,支持最大后验(MAP)、最小均方误差(MMSE)估计和不确定性量化,并在图像修复和MRI重建任务中,在准确性和效率上匹配或超越了基于扩散模型的基线方法。

Details

Motivation: 解决成像逆问题需要支持高效推理、不确定性量化和原则性概率推理的模型。能量模型(EBMs)具有可解释的能量景观和组合结构,非常适合此任务,但历史上存在计算成本高和训练不稳定的问题。本文旨在克服EBMs的这些缺点,并利用其优势为逆问题提供一个可扩展且原则性的解决方案。

Result: 在图像修复和MRI重建任务上的实验表明,该方法在准确性和效率上匹配或超越了基于扩散模型的基线方法,同时支持MAP恢复。

Insight: 创新点在于提出了一种快速的蒸馏策略,将预训练扩散模型的优势转移到多尺度EBMs中,从而保留了基于势能框架固有的可解释性和可组合性。在此基础上,提出的ALPS算法对静态后验分布进行退火,这些分布定义明确且可组合,避免了扩散模型对隐变量使用复杂引导策略的需要。这为成像逆问题提供了一个可扩展且原则性的解决方案,具有在科学和临床环境中实际部署的潜力。

Abstract: Solving inverse problems in imaging requires models that support efficient inference, uncertainty quantification, and principled probabilistic reasoning. Energy-Based Models (EBMs), with their interpretable energy landscapes and compositional structure, are well-suited for this task but have historically suffered from high computational costs and training instability. To overcome the historical shortcomings of EBMs, we introduce a fast distillation strategy to transfer the strengths of pre-trained diffusion models into multi-scale EBMs. These distilled EBMs enable efficient sampling and preserve the interpretability and compositionality inherent to potential-based frameworks. Leveraging EBM compositionality, we propose Annealed Langevin Posterior Sampling (ALPS) algorithm for Maximum-A-Posteriori (MAP), Minimum Mean Square Error (MMSE), and uncertainty estimates for inverse problems in imaging. Unlike diffusion models that use complex guidance strategies for latent variables, we perform annealing on static posterior distributions that are well-defined and composable. Experiments on image inpainting and MRI reconstruction demonstrate that our method matches or surpasses diffusion-based baselines in both accuracy and efficiency, while also supporting MAP recovery. Overall, our framework offers a scalable and principled solution for inverse problems in imaging, with potential for practical deployment in scientific and clinical settings. ALPS code is available at the GitHub repository \href{https://github.com/JyoChand/ALPS}{ALPS}.


econ.EM [Back]

[81] Detecting and Mitigating Treatment Leakage in Text-Based Causal Inference: Distillation and Sensitivity Analysis econ.EM | cs.CL | econ.GN | stat.MLPDF

Adel Daoud, Richard Johansson, Connor T. Jerzak

TL;DR: 本文针对文本因果推断中存在的治疗泄漏问题,提出了形式化定义、四种文本蒸馏方法以及模拟与实证验证。研究发现适度蒸馏能在减少偏误和保留混杂信息之间取得最佳平衡。

Details

Motivation: 解决文本作为混杂变量时,因文本包含治疗状态预测信号而导致的治疗泄漏偏误问题,该问题在现有方法中缺乏系统性识别与缓解手段。

Result: 通过合成文本模拟和国际货币基金组织结构调整计划与儿童死亡率的实证应用验证,表明适度蒸馏方法能有效平衡偏误减少与混杂信息保留,而过度严格的方法会降低估计精度。

Insight: 创新点包括形式化定义治疗泄漏、提出四种文本蒸馏方法(如相似性段落移除、远监督分类等),并首次系统性地在文本因果推断中识别和缓解治疗泄漏,为基于文本的混杂变量调整提供了新工具。

Abstract: Text-based causal inference increasingly employs textual data as proxies for unobserved confounders, yet this approach introduces a previously undertheorized source of bias: treatment leakage. Treatment leakage occurs when text intended to capture confounding information also contains signals predictive of treatment status, thereby inducing post-treatment bias in causal estimates. Critically, this problem can arise even when documents precede treatment assignment, as authors may employ future-referencing language that anticipates subsequent interventions. Despite growing recognition of this issue, no systematic methods exist for identifying and mitigating treatment leakage in text-as-confounder applications. This paper addresses this gap through three contributions. First, we provide formal statistical and set-theoretic definitions of treatment leakage that clarify when and why bias occurs. Second, we propose four text distillation methods – similarity-based passage removal, distant supervision classification, salient feature removal, and iterative nullspace projection – designed to eliminate treatment-predictive content while preserving confounder information. Third, we validate these methods through simulations using synthetic text and an empirical application examining International Monetary Fund structural adjustment programs and child mortality. Our findings indicate that moderate distillation optimally balances bias reduction against confounder retention, whereas overly stringent approaches degrade estimate precision.


cs.IR [Back]

[82] FUSE : Failure-aware Usage of Subagent Evidence for MultiModal Search and Recommendation cs.IR | cs.AI | cs.CL | cs.LGPDF

Tushar Vatsa, Vibha Belavadi, Priya Shanmugasundaram, Suhas Suresha, Dewang Sultania

TL;DR: 本文提出了FUSE框架,用于多模态搜索和推荐任务,通过引入紧凑的Grounded Design Representation(GDR)来替代原始图像提示,并采用七种上下文预算策略来优化系统性能。

Details

Motivation: 解决多模态创意助手中检索质量因用户意图理解、内容类型选择、候选召回或结果排序等阶段失败而下降的问题,同时降低因处理原始图像带来的高成本。

Result: 在788个评估查询上,Context Compression策略在所有管道阶段达到最优性能:意图准确率93.3%,路由成功率86.8%(含回退),召回率99.4%,NDCG@5为88.5%。

Insight: 创新点包括使用GDR作为紧凑表示来减少计算开销,以及通过管道归因层监控系统性能;客观分析表明,战略性的上下文压缩策略优于全面或极简的上下文处理方法,为多模态系统设计提供了高效解决方案。

Abstract: Multimodal creative assistants decompose user goals and route tasks to subagents for layout, styling, retrieval, and generation. Retrieval quality is pivotal, yet failures can arise at several stages: understanding user intent, choosing content types, finding candidates (recall), or ranking results. Meanwhile, sending and processing images is costly, making naive multimodal approaches impractical. We present FUSE: Failure-aware Usage of Subagent Evidence for MultiModal Search and Recommendation. FUSE replaces most raw-image prompting with a compact Grounded Design Representation (GDR): a selection aware JSON of canvas elements (image, text, shape, icon, video, logo), structure, styles, salient colors, and user selection provided by the Planner team. FUSE implements seven context budgeting strategies: comprehensive baseline prompting, context compression, chain-of-thought reasoning, mini-shot optimization, retrieval-augmented context, two-stage processing, and zero-shot minimalism. Finally, a pipeline attribution layer monitors system performance by converting subagent signals into simple checks: intent alignment, content-type/routing sanity, recall health (e.g., zero-hit and top-match strength), and ranking displacement analysis. We evaluate the seven context budgeting variants across 788 evaluation queries from diverse users and design templates (refer Figure 3). Our systematic evaluation reveals that Context Compression achieves optimal performance across all pipeline stages, with 93.3% intent accuracy, 86.8% routing success(with fallbacks), 99.4% recall, and 88.5% NDCG@5. This approach demonstrates that strategic context summarization outperforms both comprehensive and minimal contextualization strategies.


cs.NI [Back]

[83] Multi-Modal Data-Enhanced Foundation Models for Prediction and Control in Wireless Networks: A Survey cs.NI | cs.AI | cs.CL | cs.CVPDF

Han Zhang, Mohammad Farzanullah, Mohammad Ghassemi, Akram Bin Sediq, Ali Afana

TL;DR: 本文是一篇关于多模态数据增强的基础模型在无线网络预测与控制中应用的综述,探讨了基础模型如何整合多模态数据以提升无线网络管理的智能化水平,并分析了相关数据集、方法学以及面临的挑战与未来方向。

Details

Motivation: 基础模型被视为重塑人工智能未来的突破性技术,将其集成到无线网络中旨在开发能够处理多样化网络管理请求和复杂多模态任务的通用AI代理,以提升网络管理的智能化与自动化。

Result: 本文作为综述未提供具体实验数据,但系统梳理了基础模型在无线网络预测与控制任务中的应用方法、可用数据集及开发方法论,为相关研究提供了全面的参考框架。

Insight: 创新点在于将多模态基础模型与无线网络管理深度结合,强调上下文信息理解在预测与控制任务中的关键作用,并提出了开发无线专用基础模型的数据集与方法学路径,为跨模态智能网络管理提供了新思路。

Abstract: Foundation models (FMs) are recognized as a transformative breakthrough that has started to reshape the future of artificial intelligence (AI) across both academia and industry. The integration of FMs into wireless networks is expected to enable the development of general-purpose AI agents capable of handling diverse network management requests and highly complex wireless-related tasks involving multi-modal data. Inspired by these ideas, this work discusses the utilization of FMs, especially multi-modal FMs in wireless networks. We focus on two important types of tasks in wireless network management: prediction tasks and control tasks. In particular, we first discuss FMs-enabled multi-modal contextual information understanding in wireless networks. Then, we explain how FMs can be applied to prediction and control tasks, respectively. Following this, we introduce the development of wireless-specific FMs from two perspectives: available datasets for development and the methodologies used. Finally, we conclude with a discussion of the challenges and future directions for FM-enhanced wireless networks.


cs.AI [Back]

[84] Time-Scaling Is What Agents Need Now cs.AI | cs.CLPDF

Zhi Liu, Guangzhi Wang

TL;DR: 这篇论文提出了’时间缩放’(Time-Scaling)的概念,作为提升智能体深度推理和问题解决能力的关键前沿。论文认为,早期AI范式(如神经网络、强化学习、符号AI)在Transformer大模型和世界模型下正融合为具有’感知-决策-行动’闭环能力的认知智能体。人类通过时间化的序列推理在有限认知资源下解决复杂问题,而现有大语言模型的提示技术(如思维链、思维树)在搜索完备性和效率上存在局限。因此,需要系统性地扩展和优化智能体随时间展开推理的能力,通过架构设计利用扩展的时间路径,实现更深层次的问题空间探索、动态策略调整和增强的元认知控制,而无需按比例增加静态模型参数。

Details

Motivation: 解决现有大语言模型(如使用思维链、思维树提示的模型)在深度语义推理方面搜索完备性和效率的局限性,以及智能体在有限认知资源下进行复杂问题求解时对时间化序列推理能力的需求。

Result: 摘要中未提及具体的定量实验结果或基准测试结果。

Insight: 论文宣称的核心创新点是提出了’时间缩放’这一概念,将其定位为增强深度推理和问题解决能力的关键前沿和基础原则。从客观角度看,其创新之处在于将时间维度作为智能体架构设计的核心要素进行系统化阐述,强调通过扩展时间路径(而非单纯增加模型参数)来模拟人类在认知约束下的序列推理过程,从而实现更深的问题探索和动态策略调整,这为下一代智能体的设计提供了新的理论方向。

Abstract: Early artificial intelligence paradigms exhibited separated cognitive functions: Neural Networks focused on “perception-representation,” Reinforcement Learning on “decision-making-behavior,” and Symbolic AI on “knowledge-reasoning.” With Transformer-based large models and world models, these paradigms are converging into cognitive agents with closed-loop “perception-decision-action” capabilities. Humans solve complex problems under limited cognitive resources through temporalized sequential reasoning. Language relies on problem space search for deep semantic reasoning. While early large language models (LLMs) could generate fluent text, they lacked robust semantic reasoning capabilities. Prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) extended reasoning paths by making intermediate steps explicit. Recent models like DeepSeek-R1 enhanced performance through explicit reasoning trajectories. However, these methods have limitations in search completeness and efficiency. This highlights the need for “Time-Scaling”–the systematic extension and optimization of an agent’s ability to unfold reasoning over time. Time-Scaling refers to architectural design utilizing extended temporal pathways, enabling deeper problem space exploration, dynamic strategy adjustment, and enhanced metacognitive control, paralleling human sequential reasoning under cognitive constraints. It represents a critical frontier for enhancing deep reasoning and problem-solving without proportional increases in static model parameters. Advancing intelligent agent capabilities requires placing Time-Scaling principles at the forefront, positioning explicit temporal reasoning management as foundational.


[85] ReTreVal: Reasoning Tree with Validation – A Hybrid Framework for Enhanced LLM Multi-Step Reasoning cs.AI | cs.CLPDF

Abhishek HS, Pavan C Shekar, Arpit Jain, Ashwanth Krishnan

TL;DR: 本文提出了ReTreVal(Reasoning Tree with Validation)框架,这是一个结合了思维树探索、自我精炼、基于LLM的批判性评分和反思记忆的混合方法,旨在增强大语言模型在数学和创意写作等复杂领域的多步推理能力。该框架通过构建结构化推理树、节点级迭代精炼与双重验证、以及跨问题记忆学习,实现了有界且经过验证的推理。

Details

Motivation: 解决现有方法(如ReAct、Reflexion、Self-Refine)在多步推理中缺乏对替代解决方案路径的结构化探索以及跨问题持久性学习的问题。

Result: 在基于Qwen 2.5 7B模型的500个数学问题和创意写作任务评估中,ReTreVal在结构化探索、批判驱动精炼和跨问题记忆的结合下,一致性地超越了ReAct、Reflexion和Self-Refine等方法。

Insight: 创新点在于将思维树的结构化探索与基于LLM批判的节点级精炼和验证相结合,并通过反思记忆缓冲区实现跨问题知识迁移;其自适应深度、双重验证机制和基于批判的剪枝策略在控制计算成本的同时提升了推理质量和鲁棒性。

Abstract: Multi-step reasoning remains a key challenge for Large Language Models (LLMs), particularly in complex domains such as mathematics and creative writing. While recent approaches including ReAct, Reflexion, and Self-Refine improve reasoning through iterative refinement and reflection, they often lack structured exploration of alternative solution paths and persistent learning across problems. We propose ReTreVal (Reasoning Tree with Validation), a hybrid framework that integrates Tree-of-Thoughts exploration, self-refinement, LLM-based critique scoring, and reflexion memory to enable bounded and validated multi-step reasoning. ReTreVal constructs a structured reasoning tree with adaptive depth based on problem complexity, where each node undergoes iterative self-critique and refinement guided by explicit LLM-generated feedback. A dual validation mechanism evaluates reasoning quality, coherence, and correctness at each node while persistently storing insights from successful reasoning paths and failure patterns in a reflexion memory buffer, enabling cross-problem learning. Critique-based pruning retains only the top-k highest-scoring nodes at each level, controlling computational cost while preserving high-quality solution paths. We evaluate ReTreVal against ReAct, Reflexion, and Self-Refine across 500 mathematical problems and creative writing tasks using Qwen 2.5 7B as the underlying LLM, and demonstrate that ReTreVal consistently outperforms existing methods through its combination of structured exploration, critique-driven refinement, and cross-problem memory, making it particularly effective for tasks requiring exploratory reasoning, rigorous verification, and knowledge transfer.


[86] Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning cs.AI | cs.CL | cs.LOPDF

Xinglang Zhang, Yunyao Zhang, ZeLiang Chen, Junqing Yu, Wei Yang

TL;DR: 本文系统分析了大型语言模型在逻辑推理任务中的表现,发现了一种称为’逻辑相变’的现象:当逻辑复杂度超过某个临界深度时,模型的推理性能会突然崩溃,而非平稳下降。基于此,作者提出了神经符号课程调优框架,通过自适应对齐自然语言与逻辑符号的表示,并围绕相变边界重塑训练动态,以渐进增强模型在更高逻辑深度下的推理能力。

Details

Motivation: 符号逻辑推理是LLM在高风险领域(如数学推理和法律判决)中可靠且可验证决策的关键能力,但目前研究不足。本文旨在探究LLM在逻辑推理中性能突然崩溃的现象及其解决方法。

Result: 在五个基准测试上的实验表明,所提方法有效缓解了高复杂度下的逻辑推理崩溃,在朴素提示和思维链提示下分别实现了平均准确率提升+1.26和+3.95,并提高了对未见逻辑组合的泛化能力。

Insight: 创新点在于首次揭示了LLM逻辑推理中的’逻辑相变’现象,并提出了神经符号课程调优框架,通过建立共享表示和针对性训练来增强模型对复杂逻辑的鲁棒性,为改进LLM的符号推理能力提供了新思路。

Abstract: Symbolic logical reasoning is a critical yet underexplored capability of large language models (LLMs), providing reliable and verifiable decision-making in high-stakes domains such as mathematical reasoning and legal judgment. In this study, we present a systematic analysis of logical reasoning under controlled increases in logical complexity, and reveal a previously unrecognized phenomenon, which we term Logical Phase Transitions: rather than degrading smoothly, logical reasoning performance remains stable within a regime but collapses abruptly beyond a critical logical depth, mirroring physical phase transitions such as water freezing beyond a critical temperature threshold. Building on this insight, we propose Neuro-Symbolic Curriculum Tuning, a principled framework that adaptively aligns natural language with logical symbols to establish a shared representation, and reshapes training dynamics around phase-transition boundaries to progressively strengthen reasoning at increasing logical depths. Experiments on five benchmarks show that our approach effectively mitigates logical reasoning collapse at high complexity, yielding average accuracy gains of +1.26 in naive prompting and +3.95 in CoT, while improving generalization to unseen logical compositions. Code and data are available at https://github.com/AI4SS/Logical-Phase-Transitions.


cs.SD [Back]

[87] Omni2Sound: Towards Unified Video-Text-to-Audio Generation cs.SD | cs.CV | cs.MMPDF

Yusheng Dai, Zehua Chen, Yuxuan Jiang, Baolong Gao, Qiuhong Ke

TL;DR: 本文提出了Omni2Sound,一个支持视频到音频(V2A)、文本到音频(T2A)以及视频-文本联合到音频(VT2A)生成的统一扩散模型。为了解决高质量多模态对齐数据的稀缺性,作者首先构建了大规模数据集SoundAtlas。为了应对模型训练中的跨任务与任务内竞争问题,作者设计了一种多阶段渐进式训练策略。最终,模型在统一的VGGSound-Omni基准测试中,使用标准DiT骨干网络,在全部三项任务上均达到了最先进的性能。

Details

Motivation: 训练一个整合V2A、T2A和VT2A生成的统一模型具有显著的应用灵活性,但面临两大基础挑战:一是缺乏高质量、视听文(A-V-T)严格对齐的音频描述数据,导致多模态条件间的语义冲突;二是存在跨任务(V2A与T2A性能权衡)和任务内(VT2A任务中的模态偏差)的竞争问题。

Result: 在构建的VGGSound-Omni统一评估基准(包含具有挑战性的屏幕外音频生成任务)上,Omni2Sound使用标准DiT骨干网络,在V2A、T2A和VT2A三项任务上均实现了最先进的(SOTA)性能,并在具有异构输入条件的基准测试中展现出强大的泛化能力。

Insight: 主要创新点包括:1)通过一个新颖的智能体流程(整合视觉到语言压缩、初级-高级智能体交接和严格的后验过滤)构建了高质量、大规模、视听文严格对齐的SoundAtlas数据集;2)设计了一种三阶段多任务渐进式训练策略,将跨任务竞争转化为联合优化,并缓解了VT2A任务中的模态偏差,从而在单一模型中同时保持了视听对齐和屏幕外音频生成的忠实度。

Abstract: Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight A-V-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5 times cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions. The project page is at https://swapforward.github.io/Omni2Sound.


cs.LG [Back]

[88] WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks cs.LG | cs.CVPDF

Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, Spencer Whitehead

TL;DR: 本文提出了WebGym,一个迄今为止最大规模的开源环境,用于训练现实世界的视觉网页智能体。该环境包含近30万个任务,覆盖多样化的真实网站和难度级别,并采用基于量规的评估。通过开发一个专为网页智能体设计的高吞吐量异步轨迹采样系统,实现了4-5倍的采样加速。在WebGym上微调基础视觉语言模型Qwen-3-VL-8B-Instruct,使其在未见网站任务上的成功率从26.2%提升至42.9%,显著超越了基于GPT-4o和GPT-5-Thinking等专有模型的智能体。

Details

Motivation: 解决现有网页智能体训练环境规模小、任务人工化或不够多样化的问题,这些不足以支持鲁棒策略学习。真实网站具有非平稳性和多样性,需要大规模、真实的任务集来训练智能体。

Result: 在由训练中未见网站任务组成的分布外测试集上,微调后的智能体成功率从26.2%提升至42.9%,显著优于GPT-4o(27.1%)和GPT-5-Thinking(29.8%),实现了SOTA性能。

Insight: 主要创新点在于构建了大规模、多样化的真实网页任务环境(WebGym),并开发了高吞吐量的异步轨迹采样系统以加速强化学习训练。从客观角度看,其强调在完全未见网站上的泛化能力评估,为网页智能体研究提供了更严格的基准和可扩展的训练框架。

Abstract: We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent’s own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.


[89] Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models cs.LG | cs.CLPDF

Seunghwan Jang, SooJean Han

TL;DR: 本文提出了一种名为分层风险采样(SHS)的推理方法,用于基于连续时间/离散时间马尔可夫链(CTMC/DTMC)的离散生成模型(如均匀噪声离散扩散和离散流匹配)。该方法通过将每个位置的编辑事件建模为累积风险或累积跳跃质量的驱动,并使用分层策略来安排事件,从而在保持无偏估计的同时,将编辑次数和时间的方差降至理论最小值,解决了传统基于步长的独立伯努利采样导致的编辑不足或过度编辑等问题。

Details

Motivation: 传统基于步长模拟的推理方法(如独立伯努利采样)在均匀噪声初始化下,由于每个位置的编辑决策独立,导致编辑次数和时间存在高方差,引发编辑不足(残留噪声)或过度编辑(不必要的级联替换)等特征性失败模式,降低了生成的可重复性。

Result: SHS方法在保持编辑次数期望值不变的同时,实现了无偏整数估计器的最小可能方差(上界为1/4),且不改变每次跳跃的目标采样,从而保留了多模态性。

Insight: 创新点在于将编辑事件调度从独立的逐步决策转变为基于累积风险的分层采样,这是一种即插即用且无需超参数的方法,能显著降低方差;同时,针对黑名单式词汇约束的变体通过在高风险位置优先安排早期编辑,缓解了后期掩码带来的伪影。

Abstract: CTMC/DTMC-based discrete generative models, including uniform-noise discrete diffusion (e.g., D3PM/CTDD) and discrete flow matching, enable non-autoregressive sequence generation by repeatedly replacing tokens through a time-inhomogeneous Markov process. Inference is typically implemented with step-based simulation: each token decides to jump via independent Bernoulli (or categorical) draws at every discretization step. Under uniform-noise initialization, where self-correction requires multiple edits per position, these independent decisions induce substantial variance in both the number and timing of edits, leading to characteristic failure modes such as under-editing (residual noise) or over-editing (cascading unnecessary substitutions), decreasing reproducibility. We propose Stratified Hazard Sampling (SHS), a drop-in and hyperparameter-free inference principle for any sampler that admits a stay-vs.-replace decomposition. SHS models per-token edits as events driven by cumulative hazard (CTMC) or cumulative jump mass (DTMC) and places events by stratifying this cumulative quantity: with a single random phase per position, a token jumps whenever its accumulated hazard crosses unit-spaced thresholds. This preserves the expected number of jumps while achieving the minimum possible variance among unbiased integer estimators (bounded by 1/4), without altering per-jump destination sampling and thus retaining multimodality. We also introduce a phase-allocation variant for blacklist-style lexical constraints that prioritizes early edits at high-risk positions to mitigate late-masking artifacts.


[90] ATLAS: Adaptive Test-Time Latent Steering with External Verifiers for Enhancing LLMs Reasoning cs.LG | cs.CLPDF

Tuc Nguyen, Thai Le

TL;DR: ATLAS是一种自适应测试时潜在引导框架,通过外部轻量级验证器动态控制LLM推理过程中的内部表示调整,以提升推理准确性和效率。

Details

Motivation: 现有潜在引导方法采用固定的引导策略和静态干预强度,缺乏对不同问题实例的鲁棒性,容易导致过度或不足引导。

Result: 在多个数学推理基准测试中,ATLAS在准确率上优于原始解码和固定引导基线,同时显著减少了测试时token使用量。

Insight: 首次将学习到的潜在验证集成到测试时引导中,实现基于每个样本和推理步骤的自适应调整,为控制推理效率提供了可扩展机制。

Abstract: Recent work on activation and latent steering has demonstrated that modifying internal representations can effectively guide large language models (LLMs) toward improved reasoning and efficiency without additional training. However, most existing approaches rely on fixed steering policies and static intervention strengths, which limit their robustness across problem instances and often result in over- or under-steering. We propose Adaptive Test-time Latent Steering, called (ATLAS), a task-specific framework that dynamically controls steering decisions at inference time using an external, lightweight latent verifier. Given intermediate hidden states, the verifier predicts the quality of ongoing reasoning and adaptively selects whether and how strongly to apply steering, enabling per-example and per-step adjustment with minimal overhead. To our knowledge, ATLAS is the first method to integrate learned latent verification into test-time steering for enhancing LLMs reasoning. Experiments on multiple mathematical reasoning benchmarks show that ATLAS consistently outperforms both vanilla decoding and fixed steering baselines, achieving higher accuracy while substantially reducing test-time token usage. These results demonstrate that verifier-guided latent adaptation provides an effective and scalable mechanism for controlling reasoning efficiency without sacrificing solution quality. All source code will be publicly available.


[91] One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling cs.LG | cs.CLPDF

Yiyuan Li, Zhen Huang, Yanan Wu, Weixun Wang, Xuefeng Li

TL;DR: 本文提出了一种名为’博学学习’的框架,挑战了强化学习需要大量高质量样本的传统假设。研究表明,通过精心设计和选择单个数学推理样本,可以显著提升大语言模型在物理、化学、生物等多个领域的推理性能,其效果优于使用更大规模数据集的训练方法。

Details

Motivation: 现有基于强化学习提升大语言模型推理能力的方法通常依赖于成千上万的高质量样本。本文旨在挑战这一关于数据需求的基本假设,探索极致的样本效率,即能否通过单个样本来实现广泛且显著的性能提升。

Result: 该方法在多个推理基准测试上取得了优于使用更大数据集训练的性能,表明样本质量和设计是关键。具体而言,一个经过工程化设计的、融合多学科元素的合成样本,其训练效果超过了使用自然出现的单个样本。

Insight: 核心创新点在于提出了’样本工程’的概念,强调对训练样本进行精心的、精确的工程设计,而非单纯增加数据量。研究揭示了数学技能对跨领域推理的重要性,并提供了构建最优’博学样本’的特征指导,为高效利用强化学习解锁大语言模型能力提供了新范式。

Abstract: The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.