cs.CL [Total: 42]
cs.CV [Total: 48]
cs.GR [Total: 1]
cs.AI [Total: 16]
cs.CR [Total: 1]
cs.LG [Total: 8]
cs.RO [Total: 2]
cs.CY [Total: 1]
cs.SE [Total: 2]
eess.IV [Total: 1]

cs.CL [Back]

[1] RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation cs.CL | cs.IRPDF

Keerthana Murugaraj, Salima Lamsiyah, Martin Theobald

TL;DR: 本文提出了RAGVUE，一个用于检索增强生成（RAG）系统的诊断性、可解释的自动化评估框架。该框架将RAG行为分解为检索质量、答案相关性与完整性、严格的声明级忠实度以及评估者校准等多个维度，并提供结构化解释，支持手动指标选择和全自动智能体评估。

Details

Motivation: 现有RAG系统评估指标通常将异构行为压缩为单一分数，难以揭示错误是源于检索、推理还是事实依据，因此需要一种更细粒度、可解释的诊断性评估方法。

Result: 在对比实验中，RAGVUE揭示了RAGAS等现有工具常忽略的细粒度失败案例，展示了其在研究流程和实际RAG开发中的集成能力。

Insight: 创新点在于将RAG评估分解为多个可解释的诊断维度，并提供了结构化解释和自动化/手动结合的评估方式，增强了评估过程的透明度和可操作性。

Abstract: Evaluating Retrieval-Augmented Generation (RAG) systems remains a challenging task: existing metrics often collapse heterogeneous behaviors into single scores and provide little insight into whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduce RAGVUE, a diagnostic and explainable framework for automated, reference-free evaluation of RAG pipelines. RAGVUE decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration. Each metric includes a structured explanation, making the evaluation process transparent. Our framework supports both manual metric selection and fully automated agentic evaluation. It also provides a Python API, CLI, and a local Streamlit interface for interactive usage. In comparative experiments, RAGVUE surfaces fine-grained failures that existing tools such as RAGAS often overlook. We showcase the full RAGVUE workflow and illustrate how it can be integrated into research pipelines and practical RAG development. The source code and detailed instructions on usage are publicly available on GitHub

[2] Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems cs.CL | cs.AI | cs.CY | cs.HCPDF

Zihan Gao, Mohsin Y. K. Yousufi, Jacob Thebault-Spieker

TL;DR: 本文提出了一种名为’集体叙事锚定’的参与式协议，旨在解决大语言模型在回答社区特定查询时存在的’知识盲点’问题。该协议将社区故事转化为结构化的叙事单元，并在社区治理下将其整合到AI系统中，以提升本地AI系统的问答能力。

Details

Motivation: 动机是解决大语言模型问答系统在社区特定查询上的失败，这些失败造成了’知识盲点’，边缘化了本地声音并加剧了认知不公。

Result: 在一个包含14,782个本地信息问答对的县级基准测试中，事实空白、文化误解、地理混淆和时间错位占错误的76.7%。在一个从参与式工作坊衍生的问答集上，最先进的大语言模型在没有额外上下文的情况下正确回答的问题少于21%。

Insight: 创新点在于提出了一种参与式协议和结构化模式，将丰富的社区叙事转化为可提取实体、时间、地点并进行验证和溯源控制的数据，同时强调了在构建社区锚定AI时需解决的代表性与权力、治理与控制、隐私与同意等关键设计张力。

Abstract: Large language model (LLM) question-answering systems often fail on community-specific queries, creating “knowledge blind spots” that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county-level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state-of-the-art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval-first, provenance-visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community-grounded AI that better answers local questions.

[3] TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation cs.CL | cs.AI | cs.LGPDF

Anas Ezzakri, Nicola Piovesan, Mohamed Sana, Antonio De Domenico, Fadhel Ayed

TL;DR: 本文介绍了TeleTables，一个专门用于评估大型语言模型在电信领域表格理解能力的基准测试。该基准通过多阶段数据生成流程从3GPP标准中提取表格，并生成500个人工验证的问题-答案对，以测试模型对表格的隐含知识和显式解释能力。评估发现，小模型在知识回忆和表格解释上表现不佳，而大模型展现出更强的推理能力，凸显了领域专用微调的必要性。

Details

Motivation: 针对大型语言模型在电信标准（如3GPP规范）中表现不佳的问题，作者认为关键原因在于这些标准密集包含表格，而模型对这类表格的知识和解释能力尚未得到充分评估，因此需要专门的基准来填补这一空白。

Result: 在TeleTables基准上，参数小于100亿的小模型在3GPP知识回忆和表格解释方面均表现不佳，而更大规模的模型则展现出更强的表格推理能力，但整体仍需要领域专用微调才能可靠处理电信标准。

Insight: 论文的创新点在于提出了首个针对电信领域表格理解的基准，并通过多阶段流程（结合多模态和推理导向的LLMs）自动生成和验证问题，这为评估和提升模型在专业领域的技术文档处理能力提供了新方法。

Abstract: Language Models (LLMs) are increasingly explored in the telecom industry to support engineering tasks, accelerate troubleshooting, and assist in interpreting complex technical documents. However, recent studies show that LLMs perform poorly on telecom standards, particularly 3GPP specifications. We argue that a key reason is that these standards densely include tables to present essential information, yet the LLM knowledge and interpretation ability of such tables remains largely unexamined. To address this gap, we introduce TeleTables, a benchmark designed to evaluate both the implicit knowledge LLMs have about tables in technical specifications and their explicit ability to interpret them. TeleTables is built through a novel multi-stage data generation pipeline that extracts tables from 3GPP standards and uses multimodal and reasoning-oriented LLMs to generate and validate questions. The resulting dataset, which is publicly available, comprises 500 human-verified question-answer pairs, each associated with the corresponding table in multiple formats. Our evaluation shows that, smaller models (under 10B parameters) struggle both to recall 3GPP knowledge and to interpret tables, indicating the limited exposure to telecom standards in their pretraining and the insufficient inductive biases for navigating complex technical material. Larger models, on the other hand, show stronger reasoning on table interpretation. Overall, TeleTables highlights the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards.

Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang

TL;DR: 本文提出了FronTalk基准，用于评估前端代码生成任务，特别关注结合多模态反馈（文本和视觉指令）的对话式代码生成。该基准包含100个多轮对话，源自新闻、金融和艺术等领域的真实网站。作者还提出了一个基于代理的评估框架，通过模拟用户浏览网页来评估功能正确性和用户体验。评估了20个模型后，发现存在两个关键挑战：模型容易遗忘先前实现的功能，以及开源视觉语言模型在解释视觉反馈方面存在困难。作者提出了一种名为AceCoder的基线方法，通过自主网页代理批评过去指令的实现，显著减少了遗忘问题并提升了性能。

Details

Motivation: 前端开发中，草图、线框图和带注释的截图等视觉工件对于传达设计意图至关重要，但它们在多轮代码生成中的作用尚未得到充分探索。本文旨在填补这一空白，研究结合多模态反馈的对话式代码生成在前端开发中的独特交互动态。

Result: 在FronTalk基准上评估了20个模型，发现模型存在显著的遗忘问题（覆盖先前实现的功能导致任务失败）和视觉反馈解释的持续挑战。提出的AceCoder基线方法将遗忘率降低至接近零，并将性能提升了高达9.3%（从56.0%提高到65.3%）。

Insight: 论文的创新点包括：首次系统研究前端开发中结合多模态反馈的对话式代码生成；引入了包含文本和视觉对等指令的多轮对话基准；提出了基于代理的评估框架，综合评估功能正确性和用户体验；针对遗忘问题，提出了一种通过自主网页代理批评历史实现的基线方法，有效缓解了多轮交互中的性能下降。从客观角度看，该工作强调了多模态交互在代码生成中的重要性，并为未来研究提供了实用的评估工具和解决方案思路。

Abstract: We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at https://github.com/shirley-wu/frontalk

Wei Xia, Haowen Tang, Luozheng Li

TL;DR: 本文提出了一种轻量级线性探针方法，用于量化大型语言模型内部政治意识形态表征与人类意识形态空间之间的系统性错位，并通过直接调整输出层概率实现与特定用户观点的对齐。该方法避免了模型重训练，在保持原始推理能力的同时实现了低成本、高效率的意识形态校准。

Details

Motivation: 解决LLMs内部政治意识形态表征与人类意识形态空间存在系统性错位的问题，这种错位具有模型特异性且可测量，需要一种轻量级方法实现与特定标注者意识形态的对齐。

Result: 通过线性探针量化模型错位程度，并在社交媒体分析任务中通过直接调整logit实现对齐，该方法在保持模型原始性能的同时实现了针对特定标注者意识形态的有效校准。

Insight: 创新点在于发现LLMs意识形态表征的系统性错位规律，并提出基于内部特征偏置分数直接调整输出概率的轻量级对齐方法，避免了参数更新，为模型价值观对齐提供了高效解决方案。

Abstract: LLMs internally organize political ideology along low-dimensional structures that are partially, but not fully aligned with human ideological space. This misalignment is systematic, model specific, and measurable. We introduce a lightweight linear probe that both quantifies the misalignment and minimally corrects the output layer. This paper introduces a simple and efficient method for aligning models with specific user opinions. Instead of retraining the model, we calculated a bias score from its internal features and directly adjusted the final output probabilities. This solution is practical and low-cost and preserves the original reasoning power of the model.

[6] LLMs for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach cs.CL | cs.AIPDF

Xiang Cheng, Wen Wang, Anindya Ghose

TL;DR: 本文提出LEXMA框架，通过强化学习微调大语言模型，为商业决策生成面向不同受众的叙事性解释，以提升AI决策的透明度和可解释性。

Details

Motivation: 解决现有可解释AI技术依赖事后数值特征归因、无法提供连贯叙事解释的问题，并应对生成解释时需同时保证决策正确性、忠实性、多受众适应性以及标签高效训练的挑战。

Result: 在抵押贷款审批决策场景中，LEXMA相比其他LLM基线在预测性能上有显著提升；人工评估表明，其生成的面向专家的解释更关注风险，面向消费者的解释更清晰、可操作且礼貌。

Insight: 创新点包括采用反射增强的监督微调与两阶段组相对策略优化（GRPO）的强化学习框架，分别微调参数集以提升决策正确性和满足不同受众的文体要求，且无需依赖人工标注的解释数据。

Abstract: Artificial Intelligence (AI) models increasingly drive high-stakes consumer interactions, yet their decision logic often remains opaque. Prevailing explainable AI techniques rely on post hoc numerical feature attributions, which fail to provide coherent narratives behind model decisions. Large language models (LLMs) present an opportunity to generate natural-language explanations, but three design challenges remain unresolved: explanations must be both decision-correct and faithful to the factors that drive the prediction; they should be able to serve multiple audiences without shifting the underlying decision rule; and they should be trained in a label-efficient way that does not depend on large corpora of human-scored explanations. To address these challenges, we introduce LEXMA (LLM-based EXplanations for Multi-Audience decisions), a reinforcement-learning-based fine-tuning framework that produces narrative-driven, audience-appropriate explanations. LEXMA combines reflection-augmented supervised fine-tuning with two stages of Group Relative Policy Optimization (GRPO). Specifically, it fine-tunes two separate parameter sets to improve decision correctness and satisfy stylistic requirements for different audiences, using reward signals that do not rely on human-annotated explanations. We instantiate LEXMA in the context of mortgage approval decisions. Results demonstrate that LEXMA yields significant improvements in predictive performance compared with other LLM baselines. Moreover, human evaluations show that expert-facing explanations generated by our approach are more risk-focused, and consumer-facing explanations are clearer, more actionable, and more polite. Our study contributes a cost-efficient, systematic LLM fine-tuning approach to enhance explanation quality for business decisions, offering strong potential for scalable deployment of transparent AI systems.

[7] Complexity Agnostic Recursive Decomposition of Thoughts cs.CL | cs.AI | cs.ITPDF

Kaleem Ullah Qasim, Jiashu Zhang, Hafiz Saif Ur Rehman

TL;DR: 本文提出了CARD框架，该框架通过预测问题复杂度并据此自适应地调整推理步骤的分解策略，以解决大语言模型在复杂多步推理任务中因固定推理策略而导致的失败问题。CARD包含一个复杂度估计器MRCE和一个两阶段递归求解器，在GSM8K和MATH-500基准测试上实现了更高的准确率和显著的效率提升。

Details

Motivation: 大语言模型在处理多步推理问题时，常因采用固定的推理策略而忽略问题本身的难度，导致性能不佳。本文旨在通过预测问题复杂度并动态调整分解策略来解决这一问题。

Result: 在GSM8K基准测试上，CARD在三个推理模型上实现了81.4%至89.2%的准确率，同时将token成本降低了1.88倍至2.40倍。在MATH-500基准测试上，CARD达到了75.1%至86.8%的准确率，并使用了1.71倍至5.74倍更少的token。

Insight: 创新点在于提出了复杂度无关的递归分解思想，通过预判问题复杂度来动态规划推理步骤和资源分配，实现了准确率与效率的双重提升。这为自适应推理策略的设计提供了新思路。

Abstract: Large language models often fail on multi-step reasoning due to fixed reasoning strategies that ignore problem specific difficulty. We introduce CARD (Complexity Agnostic Recursive Decomposition), a framework that predicts problem complexity before generation and adapts decomposition accordingly. Our system comprises MRCE (Multi-dimensional Reasoning Complexity Estimator), a 0.6B Qwen model predicting 30 fine-grained features from question text and a two-stage recursive solver: (1) hierarchical decomposition into K steps based on task profile and (2) per-step thought budget allocation (1, 5-9, or 10 thoughts) via recursive MRCE profiling. Evaluated on three reasoning models (Qwen3-0.6B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen3-1.7B), CARD achieves 81.4% to 89.2% accuracy on GSM8K while reducing token cost by 1.88x to 2.40x compared to fixed decomposition baselines. On MATH-500, CARD reaches 75.1 to 86.8% accuracy using 1.71x to 5.74x fewer tokens. Our results demonstrate that preemptive complexity estimation enables both higher accuracy and significant efficiency gains.

[8] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation cs.CLPDF

Joseph James, Chenghao Xiao, Yucheng Li, Nafise Sadat Moosavi, Chenghua Lin

TL;DR: RIGOURATE是一个两阶段多模态框架，旨在量化科学论文中的夸大陈述。它通过从论文正文中检索支持证据，并为每个主张分配一个夸大分数，以评估主张与证据的一致性。该框架基于一个包含超过10K个来自ICLR和NeurIPS论文的主张-证据对的数据集，使用八个LLM进行标注，并通过同行评审评论校准夸大分数，经人类评估验证。

Details

Motivation: 解决科学严谨性被忽视、作者倾向于夸大主张超出结果支持范围的问题，以促进更清晰、透明的科学交流。

Result: 与强基线相比，RIGOURATE在证据检索和夸大检测方面实现了改进，通过微调的重新排序器和预测模型提升了性能。

Insight: 创新点包括操作化证据比例性、利用多模态框架结合LLM标注和同行评审校准，以及构建大规模主张-证据数据集，为科学诚信评估提供量化工具。

Abstract: Scientific rigour tends to be sidelined in favour of bold statements, leading authors to overstate claims beyond what their results support. We present RIGOURATE, a two-stage multimodal framework that retrieves supporting evidence from a paper’s body and assigns each claim an overstatement score. The framework consists of a dataset of over 10K claim-evidence sets from ICLR and NeurIPS papers, annotated using eight LLMs, with overstatement scores calibrated using peer-review comments and validated through human evaluation. It employes a fine-tuned reranker for evidence retrieval and a fine-tuned model to predict overstatement scores with justification. Compared to strong baselines, RIGOURATE enables improved evidence retrieval and overstatement detection. Overall, our work operationalises evidential proportionality and supports clearer, more transparent scientific communication.

[9] MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking cs.CL | cs.AIPDF

Iago Alves Brito, Walcy Santos Rezende Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho

TL;DR: 本文提出了MiJaBench，一个双语对抗性基准测试，用于揭示大型语言模型在针对不同少数群体时的选择性安全漏洞。通过分析12个SOTA模型生成的52.8万个提示-响应对，研究发现安全对齐并非普遍语义能力，而是呈现人口统计学层次结构，且模型规模扩大会加剧这种差异。

Details

Motivation: 当前LLM安全评估通过聚合‘身份仇恨’为标量分数，掩盖了针对特定群体的系统性脆弱性，本文旨在揭露这种选择性安全缺陷。

Result: 在MiJaBench基准上，同一模型内针对不同目标群体的防御率波动高达33%，表明安全对齐存在显著群体差异，且模型缩放会加剧这些差异。

Insight: 创新点在于构建了细粒度、多语言的对抗性基准来量化LLM的少数群体偏见，并挑战了当前安全对齐的普适性和缩放定律，强调需要针对具体人口统计学进行细粒度对齐研究。

Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating “Identity Hate” into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.

[10] Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs cs.CL | cs.AI | cs.CYPDF

Myra Cheng, Robert D. Hawkins, Dan Jurafsky

TL;DR: 本文从语用学角度分析大型语言模型（LLMs）为何难以挑战用户的有害信念，指出模型默认倾向于顺应用户假设且缺乏认知警惕性。研究发现影响人类顺应行为的社会和语言因素（如话题相关性、语言编码和来源可靠性）同样影响LLMs，并验证了简单语用干预（如添加’wait a minute’短语）能显著提升模型在安全基准测试中的表现。

Details

Motivation: 解决LLMs在医疗建议、社会推理等领域频繁无法挑战用户有害信念的安全问题，从语用学视角探究其行为机制。

Result: 在三个安全基准测试（Cancer-Myth、SAGE-Eval、ELEPHANT）中验证了语用因素对模型性能的影响，通过简单干预使模型表现显著提升且保持低误报率。

Insight: 创新地将语用学理论（顺应与认知警惕）应用于LLMs安全性分析，揭示了模型行为受人类对话机制影响，并提出低成本、高效的语用干预策略提升安全性能。

Abstract: Large language models (LLMs) frequently fail to challenge users’ harmful beliefs in domains ranging from medical advice to social reasoning. We argue that these failures can be understood and addressed pragmatically as consequences of LLMs defaulting to accommodating users’ assumptions and exhibiting insufficient epistemic vigilance. We show that social and linguistic factors known to influence accommodation in humans (at-issueness, linguistic encoding, and source reliability) similarly affect accommodation in LLMs, explaining performance differences across three safety benchmarks that test models’ ability to challenge harmful beliefs, spanning misinformation (Cancer-Myth, SAGE-Eval) and sycophancy (ELEPHANT). We further show that simple pragmatic interventions, such as adding the phrase “wait a minute”, significantly improve performance on these benchmarks while preserving low false-positive rates. Our results highlight the importance of considering pragmatics for evaluating LLM behavior and improving LLM safety.

[11] LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation cs.CLPDF

Yuxiao Ye, Yiming Zhang, Yiran Ma, Huiyuan Xie, Huining Zhu

TL;DR: 本文提出LinguaGame，一种基于语言学与博弈论的多智能体对话生成范式，旨在通过建模对话为意图与策略的信号博弈，并采用无需训练的均衡近似算法进行推理时决策调整，以提升智能体在复杂任务（如模拟法庭辩论）中的沟通效率。

Details

Motivation: 现有基于大语言模型的多智能体系统主要关注架构设计（如角色分配与工作流编排），而本文针对交互过程本身，旨在帮助智能体更有效地通过语言传达其意图，从而提高沟通效率。

Result: 在模拟法庭程序和辩论场景中，通过人类专家评估，该方法在沟通效率上取得了显著提升。

Insight: 创新点在于将对话建模为基于意图与策略的信号博弈，并采用与任务耦合度低的语言学驱动推理框架，以及无需训练的均衡近似算法进行实时决策调整，为多智能体交互提供了新的理论基础与实用方法。

Abstract: Large Language Models (LLMs) have enabled Multi-Agent Systems (MASs) where agents interact through natural language to solve complex tasks or simulate multi-party dialogues. Recent work on LLM-based MASs has mainly focused on architecture design, such as role assignment and workflow orchestration. In contrast, this paper targets the interaction process itself, aiming to improve agents’ communication efficiency by helping them convey their intended meaning more effectively through language. To this end, we propose LinguaGame, a linguistically-grounded game-theoretic paradigm for multi-agent dialogue generation. Our approach models dialogue as a signalling game over communicative intents and strategies, solved with a training-free equilibrium approximation algorithm for inference-time decision adjustment. Unlike prior game-theoretic MASs, whose game designs are often tightly coupled with task-specific objectives, our framework relies on linguistically informed reasoning with minimal task-specific coupling. Specifically, it treats dialogue as intentional and strategic communication, requiring agents to infer what others aim to achieve (intents) and how they pursue those goals (strategies). We evaluate our framework in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency.

[12] GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence cs.CLPDF

Yibo Zhao, Jiapeng Zhu, Zichen Ding, Xiang Li

TL;DR: 本文提出了GRACE，一个基于强化学习的框架，旨在解决检索增强生成（RAG）系统中两个关键缺陷：在没有明确证据的情况下提供正确答案，以及在检索上下文不足时产生虚构回答。GRACE通过数据构造和多阶段门控奖励函数，训练模型评估证据充分性、提取关键证据，并决定回答或明确弃权。

Details

Motivation: 现有RAG系统缺乏一个统一的框架来同时解决基于证据的接地（grounding）和可靠的弃权（abstention）问题，导致模型可能产生无依据的正确回答或虚假信息。

Result: 在两个基准测试上的实验结果表明，GRACE实现了最先进的整体准确率，并在准确回答和拒绝之间取得了良好的平衡，同时仅需先前方法10%的标注成本。

Insight: 创新点在于提出了一个统一的强化学习框架，通过异构检索器自动生成多样化训练样本（无需人工标注），并设计多阶段门控奖励函数来联合优化证据评估、证据提取和回答/弃权决策，从而以低成本实现高性能的可靠生成。

Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs), yet systems remain susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient. While prior research has addressed these issues independently, a unified framework that integrates evidence-based grounding and reliable abstention is currently lacking. In this paper, we propose GRACE, a reinforcement-learning framework that simultaneously mitigates both types of flaws. GRACE employs a data construction method that utilizes heterogeneous retrievers to generate diverse training samples without manual annotation. A multi-stage gated reward function is then employed to train the model to assess evidence sufficiency, extract key supporting evidence, and provide answers or explicitly abstain. Experimental results on two benchmarks demonstrate that GRACE achieves state-of-the-art overall accuracy and strikes a favorable balance between accurate response and rejection, while requiring only 10% of the annotation costs of prior methods. Our code is available at https://github.com/YiboZhao624/Grace..

[13] FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback cs.CLPDF

Seongyeub Chu, Jongwoo Kim, Munyong Yi

TL;DR: 本文提出FeedEval框架，用于从教学角度评估LLM生成的作文反馈质量，包含具体性、帮助性和有效性三个维度，通过训练专门的LLM评估器筛选高质量反馈，提升下游自动作文评分模型的性能。

Details

Motivation: 现有研究常直接使用未经验证的LLM生成反馈来训练作文评分模型，导致噪声传播，需建立教学对齐的反馈质量评估方法。

Result: 在ASAP++基准测试中，FeedEval与专家评估高度一致，使用其筛选的高质量反馈训练的评分模型性能更优，且能帮助小型LLM进行更有效的作文修订。

Insight: 创新点在于构建了教学维度驱动的LLM反馈评估框架，并通过数据驱动训练专门评估器，为利用LLM生成内容提供质量保障机制。

Abstract: Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We will release our code and curated datasets upon accepted.

[14] Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization cs.CLPDF

Mizanur Rahman, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque

TL;DR: 本文提出了RL-Text2Vis，首个用于文本到可视化（Text2Vis）任务的强化学习框架。该框架基于组相对策略优化（GRPO），通过一个新颖的多目标奖励函数，联合优化文本准确性、代码有效性和可视化质量，并利用执行后反馈进行训练。实验表明，该方法在Text2Vis基准上显著超越了GPT-4o等基线模型。

Details

Motivation: 现有Text2Vis系统（无论是闭源LLM还是开源模型）生成的图表常存在语义对齐差、清晰度不足或代码不可执行等问题，而传统的监督微调方法无法利用执行后反馈来提升整体可视化质量。

Result: 在Text2Vis基准上，RL-Text2Vis训练的Qwen2.5模型（7B和14B）在图表质量上相对GPT-4o提升了22%，并将代码执行成功率从零样本基线的78%提升至97%。模型在VIS-Eval和NVBench等域外数据集上也表现出强大的泛化能力，显著超越了零样本和监督基线。

Insight: 主要创新点在于将强化学习（特别是GRPO）与多目标奖励（整合文本、代码和视觉反馈）相结合，首次系统性地利用执行后反馈来端到端优化可视化生成任务，为结构化多模态推理提供了一种有效策略。

Abstract: Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis-nlp/RL-Text2Vis.

[15] On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions cs.CL | cs.AI | cs.LGPDF

Zhiyuan He, Binghan Chen, Tianxiang Xiong, Ziyang Sun, Mozhao Zhu

TL;DR: 本文研究了知识编辑方法ROME在处理多跳推理任务时的局限性，识别了三种关键失败模式，并提出了一种名为冗余编辑的简单有效策略来提升多跳推理性能。

Details

Motivation: 现有知识编辑方法（如ROME）在更新单跳事实时表现出色，但在需要知识链的多跳推理任务中面临显著挑战，本文旨在探究这些局限性并提出改进方案。

Result: 实验表明，冗余编辑策略能将2跳问题的准确率提升至少15.5个百分点，相比之前的单次编辑策略提高了96%，但以牺牲一定的特异性和语言自然性为代价。

Insight: 创新点在于揭示了ROME在多跳推理中的层深相关失败模式（如“跳跃过晚”和泛化能力下降），并提出了通过冗余编辑来缓解这些问题的策略，为知识编辑方法的实际应用提供了重要洞见。

Abstract: Recent advances in Knowledge Editing (KE), particularly Rank-One Model Editing (ROME), show superior efficiency over fine-tuning and in-context learning for updating single-hop facts in transformers. However, these methods face significant challenges when applied to multi-hop reasoning tasks requiring knowledge chaining. In this work, we study the effect of editing knowledge with ROME on different layer depths and identify three key failure modes. First, the “hopping-too-late” problem occurs as later layers lack access to necessary intermediate representations. Second, generalization ability deteriorates sharply when editing later layers. Third, the model overfits to edited knowledge, incorrectly prioritizing edited-hop answers regardless of context. To mitigate the issues of “hopping-too-late” and generalisation decay, we propose Redundant Editing, a simple yet effective strategy that enhances multi-hop reasoning. Our experiments demonstrate that this approach can improve accuracy on 2-hop questions by at least 15.5 percentage points, representing a 96% increase over the previous single-edit strategy, while trading off some specificity and language naturalness.

[16] When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation cs.CLPDF

Rhea Kapur, Robert Hawkins, Elisa Kreiss

TL;DR: 本文指出当前视觉语言模型生成的图像描述中，描述长度与信息特异性常被混淆，提出应将两者解耦，并构建了控制长度但信息量不同的数据集，验证了人们更偏好特异性高的描述，支持评估方法应优先考虑特异性而非冗长性。

Details

Motivation: 解决视觉语言模型生成的图像描述中长度与特异性混淆的问题，强调描述应信息密集而非冗长，以提升描述质量。

Result: 通过构建控制长度、变化信息内容的数据集，验证了人类评估者更偏好特异性高的描述，且仅控制长度无法解释特异性差异，长度分配方式至关重要。

Insight: 创新点在于将描述长度与特异性解耦，并基于对比集定义特异性，强调评估应直接关注信息密度而非长度，为图像描述评估提供了新视角。

Abstract: Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.

[17] Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR cs.CLPDF

Yihong Tang, Kehai Chen, Xuefeng Bai, Benyou Wang, Zeming Liu

TL;DR: 本文提出Character-R1框架，旨在通过提供可验证的奖励信号来增强角色扮演代理（RPA）的角色感知推理能力，解决现有方法因模仿表面行为而缺乏内部认知一致性的问题。该框架包含三个核心设计：认知聚焦奖励、参考引导奖励和角色条件奖励归一化。

Details

Motivation: 当前角色扮演代理通常通过模仿表面行为构建，缺乏内部认知一致性，导致在复杂情境中出现角色不符的错误，因此需要一种能提供全面可验证奖励信号的框架来增强角色感知推理。

Result: 大量实验表明，Character-R1在知识、记忆等方面显著优于现有方法，但摘要未具体提及基准测试或是否达到SOTA水平。

Insight: 创新点在于引入结构化内部认知的奖励设计（如基于10个角色元素的标签分析）、利用参考响应作为优化锚点的重叠度量，以及基于角色类别调整奖励分布以确保跨异构角色的鲁棒优化，从客观角度看，这些设计系统性提升了角色推理的可靠性和可解释性。

Abstract: Current role-playing agents (RPAs) are typically constructed by imitating surface-level behaviors, but this approach lacks internal cognitive consistency, often causing out-of-character errors in complex situations. To address this, we propose Character-R1, a framework designed to provide comprehensive verifiable reward signals for effective role-aware reasoning, which are missing in recent studies. Specifically, our framework comprises three core designs: (1) Cognitive Focus Reward, which enforces explicit label-based analysis of 10 character elements (e.g., worldview) to structure internal cognition; (2) Reference-Guided Reward, which utilizes overlap-based metrics with reference responses as optimization anchors to enhance exploration and performance; and (3) Character-Conditioned Reward Normalization, which adjusts reward distributions based on character categories to ensure robust optimization across heterogeneous roles. Extensive experiments demonstrate that Character-R1 significantly outperforms existing methods in knowledge, memory and others.

[18] From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset cs.CL | cs.AIPDF

Haneul Yoo, Won Ik Cho, Geunhye Kim, Jiyoon Han

TL;DR: 本文提出了一种可扩展的方法，通过利用国家社会研究课程作为文化感知监督的基础，构建开放式的、特定文化的问答数据集。作者开发了CuCu框架，这是一个自动化的多智能体LLM框架，能够将国家教科书课程转化为开放式、特定文化的问答对。以韩国国家社会研究课程为例，构建了KCaQA数据集，包含34.1k个开放式问答对。

Details

Motivation: 解决大型语言模型在语言和文化方面进展不平衡的问题，这些模型往往反映以英语为中心的训练数据中隐含的价值观，旨在实现实际的文化对齐。

Result: 通过将CuCu应用于韩国国家社会研究课程，构建了KCaQA数据集，包含34.1k个开放式问答对。定量和定性分析表明，KCaQA覆盖了特定文化主题，并产生了基于当地社会文化背景的响应。

Insight: 创新点在于利用国家课程作为文化监督的来源，通过多智能体LLM框架自动化生成文化特定的开放式问答数据集，为文化对齐提供了一种可扩展的数据构建方法。

Abstract: Large language models (LLMs) achieve strong performance on many tasks, but their progress remains uneven across languages and cultures, often reflecting values latent in English-centric training data. To enable practical cultural alignment, we propose a scalable approach that leverages national social studies curricula as a foundation for culture-aware supervision. We introduce CuCu, an automated multi-agent LLM framework that transforms national textbook curricula into open-ended, culture-specific question-answer pairs. Applying CuCu to the Korean national social studies curriculum, we construct KCaQA, comprising 34.1k open-ended QA pairs. Our quantitative and qualitative analyses suggest that KCaQA covers culture-specific topics and produces responses grounded in local sociocultural contexts.

[19] MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark cs.CLPDF

Anyang Song, Ying Cheng, Yiqian Xu, Rui Feng

TL;DR: 该论文提出了MAGA-Bench，一个通过增强大语言模型（LLM）对齐性来生成机器增强文本（MGT）的基准。其核心是MAGA流程，包括从提示构建到推理过程的全面对齐，并引入了关键组件RLDF（基于检测器反馈的强化学习）。该工作旨在通过生成更高质量的对齐文本来同时攻击现有检测器以测试其鲁棒性，并提升基于此数据微调的检测器的泛化能力。

Details

Motivation: 随着大语言模型对齐性的不断进化，机器生成文本（MGT）越来越难以与人类书写文本（HWT）区分，加剧了虚假新闻和网络欺诈等滥用问题。现有微调检测器的泛化能力高度依赖数据集质量，仅扩大MGT来源不足，需要进一步增强生成过程。

Result: 实验表明，在MAGA训练集上微调的RoBERTa检测器，其泛化检测AUC平均提升了4.60%。同时，MAGA数据集导致所选检测器的AUC平均下降了8.13%，这为未来研究检测器的泛化检测能力提供了指示性意义。

Insight: 论文的核心创新点在于提出通过系统性增强生成文本的对齐性（MAGA流程）来构建高质量基准数据集，特别是引入了RLDF方法。其独特视角在于，利用增强的对齐文本既能作为对抗样本攻击现有检测器（评估鲁棒性），又能作为训练数据提升新检测器的泛化能力，实现了“攻防一体”的数据集构建思路。

Abstract: Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors’ generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var’s theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA’s pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.

[20] ToolGate: Contract-Grounded and Verified Tool Execution for LLMs cs.CL | cs.AI | cs.FLPDF

Yanming Liu, Xinyue Peng, Jiannan Cao, Xinyi Wang, Songhang Deng

TL;DR: 本文提出了ToolGate框架，通过形式化工具调用为Hoare风格的合约（包含前置条件和后置条件），为LLM工具执行提供逻辑安全保证和可验证的状态演化。该框架维护一个显式的符号状态空间，确保只有经过验证的工具结果才能更新状态，从而防止无效或幻觉结果污染世界表示。

Details

Motivation: 现有LLM工具增强框架严重依赖自然语言推理来决定工具调用时机和结果提交，缺乏逻辑安全性和可验证性的形式化保证。

Result: 实验验证表明，ToolGate显著提高了工具增强LLM系统的可靠性和可验证性，同时在复杂的多步推理任务上保持了有竞争力的性能。

Insight: 核心创新在于将工具调用形式化为Hoare合约，并通过运行时验证前置/后置条件来门控工具执行与状态更新，为构建更可信、可调试的AI系统奠定了基础。这是一种将程序验证思想引入LLM工具使用范式的系统性方法。

Abstract: Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex reasoning tasks. However, existing frameworks rely heavily on natural language reasoning to determine when tools can be invoked and whether their results should be committed, lacking formal guarantees for logical safety and verifiability. We present \textbf{ToolGate}, a forward execution framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling. ToolGate maintains an explicit symbolic state space as a typed key-value mapping representing trusted world information throughout the reasoning process. Each tool is formalized as a Hoare-style contract consisting of a precondition and a postcondition, where the precondition gates tool invocation by checking whether the current state satisfies the required conditions, and the postcondition determines whether the tool’s result can be committed to update the state through runtime verification. Our approach guarantees that the symbolic state evolves only through verified tool executions, preventing invalid or hallucinated results from corrupting the world representation. Experimental validation demonstrates that ToolGate significantly improves the reliability and verifiability of tool-augmented LLM systems while maintaining competitive performance on complex multi-step reasoning tasks. This work establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools.

[21] See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation cs.CL | cs.CVPDF

Naquee Rizwan, Subhankar Swain, Paramananda Bhaskar, Gagan Aryan, Shehryaar Shah Khan

TL;DR: 本文提出了一种名为’See, Explain, and Intervene’的少样本多模态智能体框架，用于仇恨表情包（meme）的审核。该框架利用生成式AI模型，从检测、内容解释和发布前干预三个互补角度处理仇恨表情包，旨在解决数据稀缺条件下的通用仇恨表情包审核问题。

Details

Motivation: 现有研究通常将仇恨表情包的检测、解释和干预分开研究，这不符合现实世界的应用场景。同时，为表情包审核标注大规模数据集成本极高，因此需要一种在有限数据条件下仍能有效工作的通用化审核方法。

Result: 论文宣称这是首个专注于有限数据条件下通用化仇恨表情包审核的工作。虽然没有在摘要中提及具体的基准测试和定量结果，但强调了其方法在现实生产场景中具有强大的部署潜力。

Insight: 创新点在于将检测、解释和干预三个任务整合到一个统一的少样本多模态智能体框架中，并利用任务特定的生成式多模态智能体以及大型多模态模型的少样本适应能力来处理不同类型的表情包，从而应对数据稀缺的挑战。

Abstract: In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.

[22] PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards cs.CLPDF

Mukesh Ghimire, Aosong Feng, Liwen You, Youzhi Luo, Fang Liu

TL;DR: 论文提出了PRISM框架，用于在缺乏可验证奖励的情况下对大型语言模型进行后训练。该框架结合了过程奖励模型和模型内部置信度来指导学习，以解决现有方法依赖不可靠的内部一致性信号的问题。

Details

Motivation: 当前LLM后训练方法依赖昂贵的人工监督或外部验证器，而随着模型能力提升，获取高质量人工解决方案愈发困难，因此需要从无标签数据中学习。现有基于模型内部一致性（如熵或自置信度）的方法信号不可靠，难以支持大规模长期训练。

Result: 论文表明，有效结合过程奖励模型和自置信度可以实现稳定的训练和更好的测试性能，同时控制模型的内部置信度。

Insight: 创新点在于提出了一个统一的训练框架PRISM，利用过程奖励模型与模型内部置信度相结合，在无真实标签的情况下提供更可靠的学习信号，以解决内部一致性信号不可靠的问题，为无监督或弱监督的LLM后训练提供了新思路。

Abstract: Current techniques for post-training Large Language Models (LLMs) rely either on costly human supervision or on external verifiers to boost performance on tasks such as mathematical reasoning and code generation. However, as LLMs improve their problem-solving, any further improvement will potentially require high-quality solutions to difficult problems that are not available to humans. As a result, learning from unlabeled data is becoming increasingly attractive in the research community. Existing methods extract learning signal from a model’s consistency, either by majority voting or by converting the model’s internal confidence into reward. Although internal consistency metric such as entropy or self-certainty require no human intervention, as we show in this work, these are unreliable signals for large-scale and long-term training. To address the unreliability, we propose PRISM, a unified training framework that uses a Process Reward Model (PRM) to guide learning alongside model’s internal confidence in the absence of ground-truth labels. We show that effectively combining PRM with self-certainty can lead to both stable training and better test-time performance, and also keep the model’s internal confidence in check.

[23] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking cs.CLPDF

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song

TL;DR: 本文介绍了Qwen3-VL-Embedding和Qwen3-VL-Reranker模型系列，它们是基于Qwen3-VL基础模型的最新扩展，共同构建了一个端到端的高精度多模态检索与重排序框架。该框架能够将文本、图像、文档图像和视频等多种模态映射到统一的表示空间，支持超过30种语言，并提供2B和8B两种参数规模以适应不同部署需求。

Details

Motivation: 为了解决多模态检索中不同模态数据的统一表示和细粒度相关性评估问题，构建一个端到端的高精度多模态搜索管道。

Result: 在多个多模态嵌入评估基准上取得了最先进的结果。具体而言，Qwen3-VL-Embedding-8B在MMEB-V2基准上获得了77.8的总分，在所有模型中排名第一（截至2025年1月8日）。

Insight: 创新点包括：1）采用多阶段训练范式（从大规模对比预训练到重排序模型蒸馏）来生成语义丰富的高维向量；2）支持Matryoshka表示学习，允许灵活的嵌入维度；3）结合了用于生成统一嵌入的编码器模型和用于细粒度相关性评估的交叉编码器重排序模型，形成互补的端到端管道；4）继承了强大的多语言能力并支持长上下文（高达32K令牌）。

Abstract: In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

Han Zhu, Jiale Chen, Chengkun Cai, Shengjie Sun, Haoran Li

TL;DR: 本文提出了AM$^3$Safety框架，旨在解决多模态大语言模型在多轮对话中的安全对齐问题。该工作首先构建了一个开源多模态对话数据集InterSafe-V，然后设计了一个结合冷启动拒绝阶段和基于组相对策略优化的微调框架，以提升模型在多轮交互中的安全性和有用性。

Details

Motivation: 现有基于人类反馈的强化学习方法主要针对单轮视觉问答任务，且依赖昂贵的人工标注，难以有效应对多轮多模态场景中安全漏洞逐渐显现、安全协议被遗忘的挑战。

Result: 在Qwen2.5-VL-7B-Instruct和LLaVA-NeXT-7B模型上的实验表明，该方法在多模态多轮安全基准测试中，攻击成功率降低超过10%，无害性维度提升至少8%，有帮助性维度提升超过13%，同时保持了模型的通用能力。

Insight: 创新点包括：1）通过模型间交互构建了更贴近真实场景的多轮多模态安全对话数据集InterSafe-V；2）提出了结合冷启动拒绝和基于组相对策略优化的微调框架，使用轮次感知的双目标奖励在整个对话中进行优化，实现了数据高效的安全对齐。

Abstract: Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10% decrease in Attack Success Rate (ASR) together with an increment of at least 8% in harmless dimension and over 13% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.

[25] Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval cs.CLPDF

Seyeon Jeong, Yeonjun Choi, JongWook Kim, Beakcheol Jang

TL;DR: 本文提出了Tool-MAD，一个用于事实核查的多智能体辩论框架。该框架通过为每个智能体配备不同的外部工具（如搜索API或RAG模块）来增强事实核查能力，并引入了自适应查询机制和基于忠实度与答案相关性的量化评估，以缓解大语言模型的幻觉问题。

Details

Motivation: 现有基于多智能体辩论的事实核查框架主要依赖模型内部知识或静态文档，容易产生幻觉；而引入外部证据的MADKE等方法，其一次性检索机制难以适应辩论过程中出现的新论点或信息。

Result: 在四个事实核查基准测试上的实验结果表明，Tool-MAD始终优于最先进的多智能体辩论框架，实现了高达5.5%的准确率提升，并且在医学等专业领域展现出强大的鲁棒性和适应性。

Insight: 主要创新点包括：1) 为辩论智能体配备异构外部工具以鼓励多样化视角；2) 基于辩论流程迭代优化证据检索的自适应查询机制；3) 将忠实度和答案相关性分数整合到最终决策过程中，使法官智能体能定量评估回答的连贯性和问题对齐度，从而有效检测幻觉。

Abstract: Large Language Models (LLMs) suffer from hallucinations and factual inaccuracies, especially in complex reasoning and fact verification tasks. Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue, promoting diverse reasoning and mutual verification. However, existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence to mitigate this, its one-time retrieval mechanism limits adaptability to new arguments or emerging information during the debate. To address these limitations, We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool, such as a search API or RAG module. Tool-MAD introduces three key innovations: (1) a multi-agent debate framework where agents leverage heterogeneous external tools, encouraging diverse perspectives, (2) an adaptive query formulation mechanism that iteratively refines evidence retrieval based on the flow of the debate, and (3) the integration of Faithfulness and Answer Relevance scores into the final decision process, allowing the Judge agent to quantitatively assess the coherence and question alignment of each response and effectively detect hallucinations. Experimental results on four fact verification benchmarks demonstrate that Tool-MAD consistently outperforms state-of-the-art MAD frameworks, achieving up to 5.5% accuracy improvement. Furthermore, in medically specialized domains, Tool-MAD exhibits strong robustness and adaptability across various tool configurations and domain conditions, confirming its potential for broader real-world fact-checking applications.

[26] PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks cs.CL | cs.AIPDF

Yehoon Jang, Chaewon Lee, Hyun-seok Min, Sungchul Choi

TL;DR: 本文介绍了PILOT-Bench，这是首个针对专利领域法律推理的基准测试，它整合了美国专利商标局专利审判和上诉委员会（PTAB）的决定与专利数据，并设计了三个基于IRAC（Issue, Rule, Application, Conclusion）框架的分类任务：Issue Type、Board Authorities和Subdecision。

Details

Motivation: 当前大语言模型（LLMs）在专利和法律实践中的应用仅限于轻量级任务，缺乏系统评估其在专利领域结构化法律推理能力的方法，因此需要建立一个专门的基准来填补这一空白。

Result: 在Issue Type任务上，闭源模型的Micro-F1分数持续超过0.75，而最强的开源模型（Qwen-8B）性能约为0.56，显示出推理能力上的显著差距。

Insight: 创新点在于首次创建了以PTAB为中心的基准，通过IRAC对齐的分类任务系统化评估专利法律推理，为未来通过数据集设计和模型对齐改进LLMs提供了方向。

Abstract: The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.

[27] Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence cs.CLPDF

Shengyin Sun, Yiming Li, Renxi Liu, Weizhe Lin, Hui-Ling Zhen

TL;DR: 本文重新审视了Judge Decoding范式，从第一性原理出发，揭示了传统方法依赖昂贵监督学习的‘关键性’评分，本质上已编码在草稿模型与目标模型的分布散度中。论文通过理论证明，学习到的线性Judge与Kullback-Leibler散度在结构上对应，并基于此提出了一种无需训练、基于KL散度的简单验证机制。

Details

Motivation: 解决Speculative Decoding中Judge Decoding依赖昂贵且有噪声的监督信号的问题，旨在从分布散度的本质出发，构建一个无需训练、更鲁棒的验证机制。

Result: 在推理和代码生成等多个基准测试上的广泛实验表明，该方法匹配或超越了复杂的训练型Judge（如AutoJudge），并在领域转移中表现出更优的鲁棒性，完全消除了监督瓶颈。

Insight: 核心创新在于从理论层面建立了学习型Judge与KL散度的结构对应关系，并据此设计出完全免训练的验证方法；其洞察在于，草稿与目标模型的分布差异本身已包含了足够的信息用于高效验证，无需额外监督学习。

Abstract: Judge Decoding accelerates LLM inference by relaxing the strict verification of Speculative Decoding, yet it typically relies on expensive and noisy supervision. In this work, we revisit this paradigm from first principles, revealing that the ``criticality’’ scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence. We theoretically prove a structural correspondence between learned linear judges and Kullback-Leibler (KL) divergence, demonstrating they rely on the same underlying logit primitives. Guided by this, we propose a simple, training-free verification mechanism based on KL divergence. Extensive experiments across reasoning and coding benchmarks show that our method matches or outperforms complex trained judges (e.g., AutoJudge), offering superior robustness to domain shifts and eliminating the supervision bottleneck entirely.

[28] NC2C: Automated Convexification of Generic Non-Convex Optimization Problems cs.CL | cs.AIPDF

Xinyue Peng, Yanming Liu, Yihan Cang, Yuwei Zhang, Xinyi Wang

TL;DR: 本文提出NC2C，一个基于大语言模型（LLM）的端到端自动化框架，旨在将通用的非凸优化问题转化为可解的凸形式。该框架利用LLM的数学推理能力，自主检测非凸成分、选择最优凸化策略并生成严格的凸等价问题，通过集成符号推理、自适应变换技术和迭代验证，确保转换的鲁棒性和有效性。

Details

Motivation: 解决传统求解器在处理具有复杂目标函数和约束的非凸优化问题时效率低下、过度依赖专家知识和手动凸化的问题。

Result: 在包含100个通用非凸问题的多样化数据集上，NC2C实现了89.3%的执行率和76%的成功率，能够生成可行且高质量的凸转换，显著优于基线方法。

Insight: 创新点在于利用LLM的数学推理能力实现非凸到凸问题的全自动转换，减少对专家知识的依赖；框架集成了符号推理、自适应变换和迭代验证机制，提升了转换的鲁棒性，使凸求解器能够高效处理先前难以解决的优化任务。

Abstract: Non-convex optimization problems are pervasive across mathematical programming, engineering design, and scientific computing, often posing intractable challenges for traditional solvers due to their complex objective functions and constrained landscapes. To address the inefficiency of manual convexification and the over-reliance on expert knowledge, we propose NC2C, an LLM-based end-to-end automated framework designed to transform generic non-convex optimization problems into solvable convex forms using large language models. NC2C leverages LLMs’ mathematical reasoning capabilities to autonomously detect non-convex components, select optimal convexification strategies, and generate rigorous convex equivalents. The framework integrates symbolic reasoning, adaptive transformation techniques, and iterative validation, equipped with error correction loops and feasibility domain correction mechanisms to ensure the robustness and validity of transformed problems. Experimental results on a diverse dataset of 100 generic non-convex problems demonstrate that NC2C achieves an 89.3% execution rate and a 76% success rate in producing feasible, high-quality convex transformations. This outperforms baseline methods by a significant margin, highlighting NC2C’s ability to leverage LLMs for automated non-convex to convex transformation, reduce expert dependency, and enable efficient deployment of convex solvers for previously intractable optimization tasks.

[29] RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection cs.CLPDF

Zhiwei Liu, Runteng Guo, Baojie Qu, Yuechen Jiang, Min Peng

TL;DR: 本文提出了RAAR，首个用于跨领域虚假信息检测的检索增强型智能体推理框架。该框架通过检索与目标样本语义、情感和写作风格对齐的多视角源领域证据，并利用多智能体协作构建可验证的多步推理路径，以克服现有方法在跨领域泛化和系统推理方面的不足。

Details

Motivation: 解决跨领域虚假信息检测中，现有方法依赖单一视角线索、难以泛化到挑战性或代表性不足的领域，以及推理大语言模型局限于同分布数据的问题。

Result: 在三个跨领域虚假信息检测任务上的评估表明，RAAR显著提升了基础模型的能力，并优于其他跨领域方法、先进大语言模型以及基于大语言模型的适应方法。

Insight: 创新点包括：1）通过检索多视角源领域证据实现超越同分布假设的跨领域迁移；2）通过专业化多智能体协作构建可验证的多步推理路径，克服单一视角建模和系统推理缺失；3）结合监督微调和强化学习训练单一多任务验证器以增强验证和推理能力。

Abstract: Cross-domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single-perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same-distribution data. To address these gaps, we introduce RAAR, the first retrieval-augmented agentic reasoning framework for cross-domain misinformation detection. To enable cross-domain transfer beyond same-distribution assumptions, RAAR retrieves multi-perspective source-domain evidence aligned with each target sample’s semantics, sentiment, and writing style. To overcome single-perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi-step reasoning paths through specialized multi-agent collaboration, where perspective-specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine-tuning and reinforcement learning to train a single multi-task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR-8b and RAAR-14b models. Evaluation on three cross-domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross-domain methods, advanced LLMs, and LLM-based adaptation approaches. The project will be released at https://github.com/lzw108/RAAR.

[30] MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News cs.CLPDF

Zhiwei Liu, Paul Thompson, Jiaqi Rong, Baojie Qu, Runteng Guo

TL;DR: 本文介绍了MisSpans，这是首个用于细粒度虚假信息跨域检测与分析的多领域、人工标注的基准数据集，包含成对的真实与虚假新闻故事。该数据集定义了三个互补任务：识别句子中的虚假片段、按虚假信息类型分类片段以及基于识别片段提供解释。研究评估了15个代表性大语言模型在零样本和少样本设置下的表现，揭示了细粒度虚假信息识别的挑战性。

Details

Motivation: 现有虚假信息检测方法通常在整条声明或段落级别使用粗糙的二元标签进行评估，掩盖了真实与虚假细节常共存于单个句子中的事实，且限制了可解释性，无法定位具体的误导性片段或区分虚假细节的类型。

Result: 在MisSpans基准上评估了15个代表性LLM（包括增强推理和非推理变体）在零样本和少样本设置下的表现。结果表明，细粒度虚假信息识别与分析具有挑战性，性能受到模型大小、推理能力以及领域特定文本特征等多种交互因素的影响。

Insight: 创新点在于构建了首个细粒度、跨领域的虚假信息片段级标注基准，并定义了从识别、分类到解释的完整任务链，推动了虚假信息检测向更精细、可解释的方向发展。客观来看，该工作强调了超越二元分类、关注局部细节和虚假类型对于理解和打击虚假信息的重要性。

Abstract: Online misinformation is increasingly pervasive, yet most existing benchmarks and methods evaluate veracity at the level of whole claims or paragraphs using coarse binary labels, obscuring how true and false details often co-exist within single sentences. These simplifications also limit interpretability: global explanations cannot identify which specific segments are misleading or differentiate how a detail is false (e.g., distorted vs. fabricated). To address these gaps, we introduce MisSpans, the first multi-domain, human-annotated benchmark for span-level misinformation detection and analysis, consisting of paired real and fake news stories. MisSpans defines three complementary tasks: MisSpansIdentity for pinpointing false spans within sentences, MisSpansType for categorising false spans by misinformation type, and MisSpansExplanation for providing rationales grounded in identified spans. Together, these tasks enable fine-grained localisation, nuanced characterisation beyond true/false and actionable explanations. Expert annotators were guided by standardised guidelines and consistency checks, leading to high inter-annotator agreement. We evaluate 15 representative LLMs, including reasoning-enhanced and non-reasoning variants, under zero-shot and one-shot settings. Results reveal the challenging nature of fine-grained misinformation identification and analysis, and highlight the need for a deeper understanding of how performance may be influenced by multiple interacting factors, including model size and reasoning capabilities, along with domain-specific textual features. This project will be available at https://github.com/lzw108/MisSpans.

[31] A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs cs.CLPDF

Maxime Delmas, Lei Xu, André Freitas

TL;DR: 本文提出了一种名为ToPG（Traversal over Proposition Graphs）的新型RAG框架，旨在解决现有RAG方法在处理复杂多跳查询和简单事实查询时的局限性。该方法将知识库建模为包含命题、实体和段落的异构图，通过迭代的“建议-选择”循环进行图遍历，从而在多种QA任务上实现了强大的性能。

Details

Motivation: 标准RAG流水线基于分块检索，擅长简单事实检索，但缺乏结构连通性，难以处理复杂的多跳查询；而基于知识图谱的RAG虽在复杂任务上表现良好，却在面向事实的单跳查询上表现不佳。本文旨在弥合这一差距，构建一个能同时应对简单和复杂查询的统一框架。

Result: 在三个不同的QA任务（简单、复杂和抽象QA）上进行评估，ToPG在基于准确性和质量的指标上都表现出强大的性能。

Insight: 论文宣称的创新点在于将知识库建模为异构图，结合了命题的事实粒度与图的连通性，并设计了迭代的“建议-选择”循环进行查询感知的图遍历。从客观角度看，其核心创新在于将细粒度事实（命题）与结构化图遍历机制相结合，为构建高效、通用的结构化RAG系统提供了一个关键组件。

Abstract: Standard RAG pipelines based on chunking excel at simple factual retrieval but fail on complex multi-hop queries due to a lack of structural connectivity. Conversely, initial strategies that interleave retrieval with reasoning often lack global corpus awareness, while Knowledge Graph (KG)-based RAG performs strongly on complex multi-hop tasks but suffers on fact-oriented single-hop queries. To bridge this gap, we propose a novel RAG framework: ToPG (Traversal over Proposition Graphs). ToPG models its knowledge base as a heterogeneous graph of propositions, entities, and passages, effectively combining the granular fact density of propositions with graph connectivity. We leverage this structure using iterative Suggestion-Selection cycles, where the Suggestion phase enables a query-aware traversal of the graph, and the Selection phase provides LLM feedback to prune irrelevant propositions and seed the next iteration. Evaluated on three distinct QA tasks (Simple, Complex, and Abstract QA), ToPG demonstrates strong performance across both accuracy- and quality-based metrics. Overall, ToPG shows that query-aware graph traversal combined with factual granularity is a critical component for efficient structured RAG systems. ToPG is available at https://github.com/idiap/ToPG.

[32] V-FAT: Benchmarking Visual Fidelity Against Text-bias cs.CL | cs.CV | cs.LG | cs.MMPDF

Ziteng Wang, Yujie He, Guanliang Li, Siqi Yang, Jiaqi Xiong

TL;DR: 本文提出了V-FAT基准，用于诊断多模态大语言模型（MLLMs）过度依赖文本线索而非真实视觉信息的‘文本偏见’问题。该基准包含4026个VQA实例，通过一个三级评估框架（L1内部语料偏见、L2外部指令偏见、L3协同偏见）系统性地增加视觉证据与文本信息之间的冲突，并引入视觉鲁棒性分数（VRS）来量化模型真正的视觉保真度。

Details

Motivation: 解决当前MLLMs在标准视觉推理基准上表现优异，但可能过度依赖语言捷径而非真正视觉基础（即‘文本偏见’）的问题，旨在探究视觉感知与语言先验之间的根本张力。

Result: 对12个前沿MLLMs的评估表明，尽管模型在现有基准上表现出色，但在高语言主导性下（即V-FAT基准的L2和L3级别）会出现显著的视觉崩溃，揭示了其视觉保真度的严重不足。

Insight: 创新点在于将文本偏见解耦为内部语料偏见和外部指令偏见两个维度，并构建了一个系统性的诊断基准（V-FAT）和评估框架来量化该问题。客观来看，其提出的VRS指标和三级冲突测试方法为评估MLLMs的视觉基础能力提供了新的、更严格的视角和工具。

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize “lucky” linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.

[33] GenProve: Learning to Generate Text with Fine-Grained Provenance cs.CLPDF

Jingxuan Wei, Xingyue Wang, Yanghaoyu Liao, Jie Dong, Yuchen Liu

TL;DR: 本文提出GenProve框架，旨在解决大语言模型生成文本时幻觉问题，通过引入细粒度溯源任务，要求模型在生成流畅答案的同时输出句子级结构化溯源三元组。

Details

Motivation: 现有引用方法通常粒度较粗，无法区分直接引用与复杂推理，导致用户难以验证生成主张与引用来源之间的支持关系，因此需要细粒度的溯源机制来增强问责性。

Result: 在ReFInE数据集上，GenProve通过结合监督微调与组相对策略优化，显著优于14个强LLM，在答案保真度与溯源正确性的联合评估中表现突出。

Insight: 创新点在于定义了生成时细粒度溯源任务，并构建了区分引用、压缩和推理的专家验证数据集；分析揭示了模型在表面引用上表现良好，但在基于推理的溯源上存在显著差距，表明可验证推理仍是区别于表面引用的前沿挑战。

Abstract: Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability & Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.

[34] A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction cs.CL | cs.SDPDF

Qing Wang, Zehan Li, Yaodong Song, Hongjie Chen, Jian Kang

TL;DR: 本文提出了一种统一的口语语言模型，通过注入情感归因思维（IEAT）的数据构建策略增强情感智能，使模型能够内化情感感知推理。模型采用两阶段渐进训练策略，第一阶段通过自蒸馏进行语音-文本对齐和情感属性建模，第二阶段进行端到端跨模态联合优化以确保文本与口语情感表达的一致性。在HumDial情感智能基准测试中，该方法在情感轨迹建模、情感推理和共情响应生成方面均取得了顶尖性能。

Details

Motivation: 解决现有口语语言模型在情感智能方面缺乏内化推理能力的问题，通过注入用户情感状态及其成因来提升人机交互的自然性和情感感知能力。

Result: 在HumDial情感智能基准测试中，该方法在情感轨迹建模、情感推理和共情响应生成任务上均达到顶尖性能，无论是基于LLM还是人类评估均表现优异。

Insight: 创新点在于提出IEAT策略将情感归因思维内化到模型推理过程中，避免了显式监督，并通过两阶段跨模态训练实现情感表达的一致性，为情感智能模型提供了可借鉴的数据构建和训练框架。

Abstract: This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model’s internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.

[35] Text as a Universal Interface for Transferable Personalization cs.CL | cs.AIPDF

Yuting Liu, Jian Guan, Jia-Nan Li, Wei Wu, Jiang-Ming Yang

TL;DR: 本文提出了一种使用自然语言作为通用接口来表示用户偏好的方法，以解决大型语言模型（LLM）个性化中的可解释性和可迁移性问题。作者开发了一个名为AlignXplore+的两阶段训练框架来生成文本化的偏好摘要，并在九个基准测试中验证了其有效性。

Details

Motivation: 现有方法通常将用户偏好表示为模型特定的隐式向量或参数，导致难以解释和跨模型/任务迁移的’黑盒’配置文件。本文旨在通过自然语言这一通用接口来创建可解释、可重用且能持续演化的偏好描述。

Result: 在九个基准测试上的实验表明，其8B参数的模型（AlignXplore+）达到了最先进的性能（SOTA），显著优于更大的开源模型，并在跨任务、模型家族和交互格式上表现出强大的可迁移性。

Insight: 核心创新在于将自然语言确立为模型和任务无关的偏好表示通用接口，这增强了可解释性和可迁移性。其两阶段训练框架（结合监督微调和强化学习）旨在优化长期效用和跨任务迁移能力，为偏好建模提供了新范式。

Abstract: We study the problem of personalization in large language models (LLMs). Prior work predominantly represents user preferences as implicit, model-specific vectors or parameters, yielding opaque ``black-box’’ profiles that are difficult to interpret and transfer across models and tasks. In contrast, we advocate natural language as a universal, model- and task-agnostic interface for preference representation. The formulation leads to interpretable and reusable preference descriptions, while naturally supporting continual evolution as new interactions are observed. To learn such representations, we introduce a two-stage training framework that combines supervised fine-tuning on high-quality synthesized data with reinforcement learning to optimize long-term utility and cross-task transferability. Based on this framework, we develop AlignXplore+, a universal preference reasoning model that generates textual preference summaries. Experiments on nine benchmarks show that our 8B model achieves state-of-the-art performanc – outperforming substantially larger open-source models – while exhibiting strong transferability across tasks, model families, and interaction formats.

[36] Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization cs.CLPDF

Xueyun Tian, Minghua Ma, Bingbing Xu, Nuoyan Lyu, Wei Li

TL;DR: 本文提出在思维链监督微调中利用包含错误答案的负样本轨迹，以增强大语言模型的领域外泛化能力。研究发现负样本中的有效中间推理步骤能提供额外监督，缓解过拟合。进一步提出了基于增益的损失加权方法，自适应调整样本损失权重，在Qwen2.5-7B等模型上显著提升了领域外泛化性能。

Details

Motivation: 标准思维链监督微调仅使用正确答案的正样本轨迹，忽略了错误答案的负样本，导致监督信息不足和过拟合问题，限制了模型的领域外泛化能力。

Result: 在Qwen2.5-7B模型上，结合负样本的微调比仅使用正样本的基线在领域外泛化任务上提升了5.51%；作为强化学习初始化时，MMLU分数从72.82%提升至76.47%。

Insight: 创新点在于发现负样本轨迹中常包含有效的中间推理步骤，能提供正则化效果；提出的GLOW方法通过基于训练进展的自适应损失加权，更高效地利用了未过滤的轨迹数据，提升了训练效率和泛化性能。

Abstract: Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.

[37] Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models cs.CL | cs.AI | q-bio.NCPDF

Yueqing Hu, Xinyang Peng, Shuting Peng, Hanqi Wang, Tianhong Wang

TL;DR: 本文研究大型语言模型中的推理蒸馏范式，发现通过监督微调让学生模型模仿教师模型的推理轨迹会导致‘功能对齐崩溃’，即学生模型仅表面模仿推理的语言形式而未能内化教师的动态资源分配策略，从而丧失与人类认知成本的自然对齐。

Details

Motivation: 动机在于探究当前流行的推理蒸馏方法（通过监督微调让学生模型模仿教师模型的推理轨迹）是否能够传递教师模型中与人类认知成本对齐的认知结构，解决蒸馏过程中可能出现的‘邯郸学步’式表面模仿问题。

Result: 在14个模型上的测试表明，教师模型与人类难度缩放高度相关（平均相关系数r̄=0.64），而蒸馏后的学生模型这种对齐显著退化（r̄=0.34），甚至低于蒸馏前的基线性能，出现‘负迁移’现象。

Insight: 创新点在于揭示了推理蒸馏中的‘功能对齐崩溃’现象，并指出人类似的认知是通过主动强化学习涌现的属性，而非被动模仿；这为改进蒸馏方法提供了认知视角的见解，强调需要超越语言形式的表面模仿，关注动态资源分配策略的内化。

Abstract: Recent Large Reasoning Models trained via reinforcement learning exhibit a “natural” alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation – training student models to mimic these traces via Supervised Fine-Tuning (SFT) – fails to transmit this cognitive structure. Testing the “Hán Dān Xué Bù” (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a “Functional Alignment Collapse”: while teacher models mirror human difficulty scaling ($\bar{r}=0.64$), distilled students significantly degrade this alignment ($\bar{r}=0.34$), often underperforming their own pre-distillation baselines (“Negative Transfer”). Our analysis suggests that SFT induces a “Cargo Cult” effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher’s dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.

[38] Agent-as-a-Judge cs.CL | cs.AIPDF

Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu

TL;DR: 本文首次系统性地综述了从’LLM-as-a-Judge’到’Agent-as-a-Judge’的演进，提出了一个统一的框架来刻画这一范式转变，包括关键维度、发展分类学、核心方法、应用领域、前沿挑战及未来研究方向，为下一代智能体评估提供了清晰路线图。

Details

Motivation: 随着被评估对象日益复杂、专业且多步骤，传统LLM评估方法因存在固有偏见、浅层单次推理及无法验证现实观察而可靠性受限，促使向智能体评估范式转变，但该领域缺乏统一框架来梳理这一快速发展的格局。

Result: 本文是一项综述性研究，未报告具体定量实验结果，但通过建立发展分类学并系统梳理核心方法与跨领域应用，为领域提供了结构化知识体系与发展路线图。

Insight: 创新点在于首次提出了刻画’Agent-as-a-Judge’范式转变的统一框架与发展分类学，强调了智能体法官通过规划、工具增强验证、多智能体协作与持久记忆实现更鲁棒、可验证、细致评估的核心能力，为构建下一代评估系统指明了方向。

Abstract: LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.

[39] RelayLLM: Efficient Reasoning via Collaborative Decoding cs.CL | cs.AI | cs.LGPDF

Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu

TL;DR: RelayLLM是一个通过令牌级协作解码实现高效推理的新框架，旨在解决大语言模型（LLM）推理成本高、延迟大，而小语言模型（SLM）推理能力不足的问题。它让SLM作为主动控制器，仅在生成关键令牌时动态调用LLM，从而大幅减少计算开销。

Details

Motivation: 现有协作方法（如级联或路由）在粗粒度上将整个查询卸载给LLM，当SLM能够处理大部分推理步骤时会造成显著的计算浪费。因此，需要一种更细粒度的协作机制来平衡性能与成本。

Result: 在六个基准测试上的实验结果表明，RelayLLM平均准确率达到49.52%，有效缩小了两种模型间的性能差距，同时仅对总生成令牌的1.07%调用LLM，相比性能匹配的随机路由器实现了98.2%的成本降低。

Insight: 创新点在于提出了令牌级的动态协作解码框架，使SLM能主动控制LLM的调用时机，并通过两阶段训练（预热和组相对策略优化GRPO）来学习平衡独立生成与寻求帮助的策略，实现了成本与性能的高效权衡。

Abstract: Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively “relaying” the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.

[40] Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems cs.CLPDF

Jihao Zhao, Ding Chen, Zhaoxin Fan, Kerun Xu, Mengting Hu

TL;DR: 本文提出Inside Out框架，通过构建全局维护的PersonaTree作为长期用户画像载体，结合轻量级MemListener模型进行结构化记忆操作，以解决长期个性化对话系统中记忆噪声累积、推理退化及人设不一致等问题，并在响应生成时提供低延迟和按需详情的两种模式。

Details

Motivation: 现有长期个性化对话系统难以平衡无限交互流与有限上下文约束，导致记忆噪声累积、推理退化及人设不一致，本文旨在通过可控记忆压缩和结构化操作来应对这些挑战。

Result: 实验表明，PersonaTree在抑制上下文噪声和保持人设一致性方面优于全文拼接及多种个性化记忆系统；轻量级MemListener在记忆操作决策性能上媲美甚至超越DeepSeek-R1-0528和Gemini-3-Pro等强大推理模型。

Insight: 创新点包括：以PersonaTree实现可控记忆增长与压缩，通过强化学习训练MemListener生成结构化可执行操作，以及支持低延迟和按需详情两种响应模式，为长期个性化对话提供了可扩展且一致的解决方案。

Abstract: Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.

[41] Measuring and Fostering Peace through Machine Learning and Artificial Intelligence cs.CL | cs.CY | cs.LGPDF

P. Gilda, P. Dungarwal, A. Thongkham, E. T. Ajayi, S. Choudhary

TL;DR: 该研究利用机器学习和人工智能技术，从新闻和社交媒体中测量国家和平水平，并开发了名为MirrorMirror的Chrome扩展工具，通过实时反馈帮助用户理解其媒体消费的和平性，以促进更尊重、细致和有益的沟通。

Details

Motivation: 解决当前新闻和社交媒体内容偏向情感激活（如愤怒）以增加点击率的问题，旨在通过技术手段测量和提升媒体内容的和平性，减少偏见和负面情绪传播。

Result: 新闻媒体分析模型在一个数据集上训练后，在另一个新闻数据集上仍表现出高准确性；社交媒体分析结合了词级（GoEmotions）和上下文级（大语言模型）方法；MirrorMirror工具已开发并测试，长期目标是成为开源工具。

Insight: 创新点包括利用文本嵌入和神经网络跨数据集测量和平水平，以及开发实时反馈工具促进用户媒体素养；从客观角度看，该研究将AI应用于社会和平度量与干预，具有跨学科应用潜力，但实际效果需进一步验证。

Abstract: We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.

[42] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization cs.CL | cs.AI | cs.LGPDF

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak

TL;DR: 本文提出了GDPO（Group reward-Decoupled Normalization Policy Optimization）方法，用于解决在多奖励强化学习场景中直接应用GRPO（Group Relative Policy Optimization）会导致奖励信号分辨率降低、训练不稳定和次优收敛的问题。GDPO通过对各个奖励进行解耦归一化，更好地保留了奖励间的相对差异，从而实现了更准确的多奖励优化。

Details

Motivation: 随着语言模型能力的提升，用户期望其不仅能提供准确回答，还能在多种场景下符合多样的人类偏好。为此，强化学习流程开始整合多个奖励（每个奖励对应一种偏好）来引导模型行为。然而，现有工作默认在多奖励设置下应用GRPO，而未检验其适用性。本文发现直接应用GRPO会导致不同奖励组合的优势值坍缩，降低训练信号分辨率，引发次优收敛甚至早期训练失败。

Result: 在工具调用、数学推理和代码推理三个任务上，GDPO在正确性指标（准确率、错误率）和约束遵循指标（格式、长度）上均一致优于GRPO，证明了其有效性和泛化能力。

Insight: 论文的创新点在于揭示了GRPO在多奖励设置中的归一化缺陷，并提出了解耦归一化的GDPO方法，通过独立处理每个奖励的归一化来保持奖励间的相对差异，从而提升多奖励优化的准确性和训练稳定性。从客观角度看，这种解耦归一化策略为多目标强化学习中的奖励处理提供了新的技术思路，有助于更精细地平衡不同偏好。

Abstract: As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

cs.CV [Back]

[43] Unified Text-Image Generation with Weakness-Targeted Post-Training cs.CV | cs.AIPDF

Jiahui Chen, Philippe Hansen-Estruch, Xiaochuang Han, Yushi Hu, Emily Dinan

TL;DR: 本文提出了一种通过弱点针对性后训练实现统一文本-图像生成的方法，使模型能在单一推理过程中自主从文本推理过渡到视觉合成，从而提升文本到图像生成的性能。

Details

Motivation: 现有统一多模态生成架构依赖显式的模态切换（先生成推理文本再手动切换至图像生成），限制了跨模态耦合并阻碍自动多模态生成，因此需要实现完全统一的文本-图像生成。

Result: 在四个不同的文本到图像基准测试中，该方法通过使用完全自生成的合成数据进行离线奖励加权后训练，在多模态图像生成方面取得了改进。

Insight: 创新点在于采用针对性后训练数据策略（针对特定限制而非广泛图像-标题语料库或基准对齐数据），并同时奖励加权文本和图像两种模态，以增强跨模态耦合和自动生成能力。

Abstract: Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.

[44] ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers cs.CVPDF

Mohsen Ghafoorian, Amirhossein Habibian

TL;DR: 本文提出了ReHyAt（循环混合注意力）机制，用于视频扩散变换器，旨在解决现有基于Transformer的视频扩散模型中二次注意力复杂度导致的序列长度可扩展性受限问题。该方法结合了softmax注意力的保真度与线性注意力的效率，通过分块循环重构实现恒定内存使用，并利用轻量级蒸馏与微调流程，显著降低了训练成本。

Details

Motivation: 当前基于Transformer的视频扩散模型虽能实现最先进的视频生成，但其二次注意力复杂度严重限制了处理长序列时的可扩展性，因此需要一种既能保持生成质量又能高效扩展的注意力机制。

Result: 在VBench和VBench-2.0基准测试以及人类偏好研究中，ReHyAt达到了最先进的视频质量，同时将注意力成本从二次降低到线性，为长时长和设备端视频生成提供了实际可扩展性。

Insight: 创新点在于混合注意力设计（结合softmax与线性注意力），支持从现有softmax模型高效蒸馏，训练成本降低两个数量级（约160 GPU小时），并提供了可应用于未来双向softmax模型的轻量级蒸馏与微调方案。

Abstract: Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt’s hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.

[45] Comparative Analysis of Custom CNN Architectures versus Pre-trained Models and Transfer Learning: A Study on Five Bangladesh Datasets cs.CV | cs.LGPDF

Ibrahim Tanvir, Alif Ruslan, Sartaj Solaiman

TL;DR: 本研究对自定义卷积神经网络（CNN）与预训练模型（ResNet-18和VGG-16）在五个孟加拉国图像数据集上进行了比较分析，评估了特征提取和迁移学习两种方法。结果表明，采用微调的迁移学习方法在性能上始终优于从头训练的自定义CNN和特征提取方法，尤其在数据有限或任务复杂的场景下优势显著。

Details

Motivation: 动机在于比较不同深度学习策略（自定义CNN、特征提取、迁移学习）在多个实际数据集上的性能，为从业者根据数据集特点、计算资源和性能需求选择合适方法提供实践指导。

Result: 在五个孟加拉国数据集（Footpath Vision, Auto Rickshaw Detection, Mango Image Classification, Paddy Variety Recognition, Road Damage Detection）上，微调迁移学习（特别是ResNet-18）比自定义CNN和特征提取方法准确率提升3%到76%，其中在Road Damage BD数据集上达到100%准确率。自定义CNN在模型大小（340万参数）和简单任务训练效率上有优势，但预训练模型在复杂分类任务上性能更优。

Insight: 论文宣称的创新点在于对多种深度学习方法在多样化的实际数据集上进行系统比较，强调了迁移学习微调在数据有限场景下的有效性。客观分析认为，其核心洞察是：对于资源受限或数据稀缺的应用，迁移学习是更可靠的选择，而自定义CNN在计算效率上有其价值，这为实际工程选型提供了明确依据。

Abstract: This study presents a comprehensive comparative analysis of custom-built Convolutional Neural Networks (CNNs) against popular pre-trained architectures (ResNet-18 and VGG-16) using both feature extraction and transfer learning approaches. We evaluated these models across five diverse image classification datasets from Bangladesh: Footpath Vision, Auto Rickshaw Detection, Mango Image Classification, Paddy Variety Recognition, and Road Damage Detection. Our experimental results demonstrate that transfer learning with fine-tuning consistently outperforms both custom CNNs built from scratch and feature extraction methods, achieving accuracy improvements ranging from 3% to 76% across different datasets. Notably, ResNet-18 with fine-tuning achieved perfect 100% accuracy on the Road Damage BD dataset. While custom CNNs offer advantages in model size (3.4M parameters vs. 11-134M for pre-trained models) and training efficiency on simpler tasks, pre-trained models with transfer learning provide superior performance, particularly on complex classification tasks with limited training data. This research provides practical insights for practitioners in selecting appropriate deep learning approaches based on dataset characteristics, computational resources, and performance requirements.

[46] PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache cs.CVPDF

Kunyang Li, Mubarak Shah, Yuzhang Shang

TL;DR: 本文提出了一种名为PackCache的无训练KV缓存加速方法，用于统一自回归视频生成模型。该方法通过动态压缩KV缓存，利用条件锚定、跨帧衰减建模和空间保持位置嵌入三种机制，显著提升了长序列视频生成的推理效率。

Details

Motivation: 统一自回归模型在处理多模态任务时，其KV缓存大小随生成令牌数量线性增长，成为限制推理效率和生成长度的主要瓶颈。本文旨在解决这一效率问题，特别是针对视频生成任务。

Result: 在48帧长序列视频生成中，PackCache将端到端生成速度提升了1.7-2.2倍。在A40和H200硬件上，对受KV缓存膨胀影响最大的最后四帧部分，分别实现了2.6倍和3.7倍的加速。

Insight: 创新点在于发现了KV缓存令牌的时空特性（文本和条件图像令牌作为持久语义锚点，注意力随时间距离自然衰减），并据此设计了无需训练的动态缓存压缩机制，有效平衡了计算效率与生成质量。

Abstract: A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.

[47] Combining facial videos and biosignals for stress estimation during driving cs.CVPDF

Paraskevi Valergaki, Vassilis C. Nicodemou, Iason Oikonomidis, Antonis Argyros, Anastasios Roussos

TL;DR: 该论文提出了一种结合面部视频和生理信号进行驾驶压力估计的方法，通过分析EMOCA提取的3D面部几何特征（表情和姿态系数）与生理信号，利用Transformer时序建模框架和跨模态注意力融合策略，显著提升了压力识别的性能。

Details

Motivation: 解决仅依赖面部视频进行压力识别时因主观性和自主面部控制导致的不可靠问题，探索解耦的3D面部几何特征在压力估计中的作用。

Result: 在分心驾驶压力识别任务中，跨模态注意力融合EMOCA与生理信号达到最佳性能（AUROC 92%，准确率86.7%），EMOCA与注视融合也表现优异（AUROC 91.8%），优于单模态和早期融合策略。

Insight: 创新点在于首次系统分析3D面部几何系数对压力的响应，并设计基于Transformer的跨模态注意力融合框架，有效整合时序面部特征与生理信号，为多模态压力识别提供了新思路。

Abstract: Reliable stress recognition from facial videos is challenging due to stress’s subjective nature and voluntary facial control. While most methods rely on Facial Action Units, the role of disentangled 3D facial geometry remains underexplored. We address this by analyzing stress during distracted driving using EMOCA-derived 3D expression and pose coefficients. Paired hypothesis tests between baseline and stressor phases reveal that 41 of 56 coefficients show consistent, phase-specific stress responses comparable to physiological markers. Building on this, we propose a Transformer-based temporal modeling framework and assess unimodal, early-fusion, and cross-modal attention strategies. Cross-Modal Attention fusion of EMOCA and physiological signals achieves best performance (AUROC 92%, Accuracy 86.7%), with EMOCA-gaze fusion also competitive (AUROC 91.8%). This highlights the effectiveness of temporal modeling and cross-modal attention for stress recognition.

[48] Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection cs.CV | cs.AIPDF

Maxim Clouser, Kia Khezeli, John Kalantari

TL;DR: 本文研究如何利用少量配对数据，通过LoRA微调基于流匹配的视觉基础模型FLUX.1，实现从RGB到红外（IR）和合成孔径雷达（SAR）的跨光谱图像翻译，并利用生成的合成数据提升目标检测性能。

Details

Motivation: 现有视觉基础模型主要基于RGB数据训练，而许多安全关键应用依赖红外、SAR等非可见光模态；研究旨在探索如何用极少配对样本（每域仅100张）适配基础模型，以支持跨模态翻译并增强下游检测任务。

Result: 在KAIST（RGB到IR）和M4-SAR（RGB到SAR）数据集上，仅用50对保留图像计算的LPIPS指标能有效预测下游检测性能（YOLOv11n和DETR的mAP）；最佳LoRA适配器生成的合成IR数据提升了KAIST行人检测，合成SAR数据结合有限真实SAR显著提高了M4-SAR基础设施检测效果。

Insight: 创新点包括：利用少量配对数据通过LoRA微调流匹配基础模型实现跨光谱翻译；提出LPIPS作为下游检测性能的代理指标；证明合成数据能有效增强非可见光模态下的目标检测，为扩展基础模型至多模态应用提供了轻量级适配路径。

Abstract: Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.

Jusheng Zhang, Yijia Fan, Zimo Wen, Jian Wang, Keze Wang

TL;DR: 论文提出了一种名为Tri-MARF的新型框架，通过整合2D多视角图像、文本描述和3D点云三种模态输入，并采用多智能体协作架构，以提升大规模3D对象标注的效率和准确性。该框架包含三个专门化的智能体：视觉语言模型智能体用于生成多视角描述，信息聚合智能体用于选择最优描述，以及门控智能体用于对齐文本语义与3D几何以优化标注。

Details

Motivation: 现有基于单一模型的方法在处理3D对象标注时，常因空间复杂性、遮挡和视角不一致等问题而效果有限，难以有效应对自动驾驶、机器人和增强现实等应用的需求。

Result: 在Objaverse、LVIS、Objaverse XL和ABO等基准数据集上的大量实验表明，Tri-MARF显著优于现有方法，实现了CLIPScore 88.7（优于先前SOTA方法），在ViLT R@5上的检索准确率达到45.2和43.8，并在单个NVIDIA A100 GPU上达到每小时高达12000个对象的吞吐量。

Insight: 创新点在于将多模态输入（2D图像、文本和3D点云）与多智能体协作架构相结合，通过专门化智能体的分工协作（如描述生成、信息聚合和语义几何对齐），有效解决了3D标注中的复杂挑战，提升了标注的准确性和可扩展性，为大规模3D场景理解提供了新思路。

Abstract: Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU

[50] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization cs.CV | cs.CLPDF

Xingjian Diao, Zheyuan Liu, Chunhui Zhang, Weiyi Wu, Keyi Kong

TL;DR: 该论文提出了一种名为Gated Perception-Reasoning Optimization (GPRO)的元推理控制器，用于解决大型视觉语言模型（LVLMs）中因思维链机制导致的过度思考问题。GPRO通过动态路由计算，在轻量快速路径、慢速感知路径和慢速推理路径之间选择，以平衡任务准确性和计算成本。

Details

Motivation: 大型视觉语言模型在通过思维链机制进行逐步推理时，常因过度思考而产生冗长响应，导致推理效率低下甚至准确性下降。现有自适应推理策略大多忽视了视觉感知失败这一根本瓶颈，而推理错误往往源于不完美的感知而非深思不足。

Result: 在五个基准测试上的实验表明，GPRO显著提高了准确性和效率，优于最近的慢思考方法，同时生成了更短的响应。

Insight: 创新点在于明确区分了感知错误与推理错误，并利用大规模失败归因监督（约79万样本）训练控制器，通过多目标强化学习优化不确定性下的准确性与成本权衡。从客观角度看，该方法将视觉感知失败作为核心瓶颈处理，并通过动态路由机制实现了更高效的推理过程。

Abstract: Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.

[51] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving cs.CVPDF

Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu

TL;DR: 本文提出了UniDrive-WM，一个基于视觉语言模型（VLM）的统一世界模型，用于自动驾驶。该模型在一个单一架构中联合执行驾驶场景理解、轨迹规划和轨迹条件化的未来图像生成。通过在具有挑战性的Bench2Drive基准测试上的实验，证明了该方法能生成高保真度的未来图像，并在轨迹误差和碰撞率上显著优于先前最佳方法。

Details

Motivation: 当前的自动驾驶系统通常将感知、预测和规划作为独立的模块处理，这可能导致信息损失和次优性能。本文旨在通过一个统一的VLM世界模型，将这些任务紧密集成，以提升场景理解和规划的安全性。

Result: 在Bench2Drive基准测试上，UniDrive-WM将L2轨迹误差降低了5.9%，碰撞率降低了9.2%，超过了先前的最佳方法，并生成了高保真度的未来图像。

Insight: 核心创新在于提出了一个端到端的统一架构，将场景理解、轨迹规划和生成式未来预测联合优化，利用未来图像的生成作为额外的监督信号来迭代优化轨迹规划。此外，论文还比较了未来图像预测的离散和连续输出表示，分析了它们对下游驾驶性能的影响，为模型设计提供了见解。

Abstract: World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM’s trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .

[52] Vision-Language Agents for Interactive Forest Change Analysis cs.CV | cs.AI | cs.CLPDF

James Brock, Ce Zhang, Nantheera Anantrasirichai

TL;DR: 本文提出了一种由大语言模型驱动的智能体，用于集成森林变化分析，支持跨多个遥感图像变化解释任务的自然语言查询。该系统构建于一个多级变化解释的视觉-语言骨干网络之上，并采用基于LLM的编排机制。作者还引入了Forest-Change数据集，包含双时相卫星影像、像素级变化掩码和多粒度语义变化描述。

Details

Motivation: 解决在森林监测中，像素级变化检测和复杂森林动态的语义变化描述这两个持续存在的挑战，并探索将大语言模型与视觉-语言模型集成用于遥感图像变化解释这一尚未充分研究的领域。

Result: 在Forest-Change数据集上，mIoU和BLEU-4得分分别为67.10%和40.17%；在专注于树木的LEVIR-MCI-Trees基准子集上，得分分别为88.13%和34.41%。

Insight: 创新点在于提出了一个LLM驱动的、支持自然语言交互的集成分析框架，将变化检测与语义描述任务统一起来。从客观角度看，其构建的多级变化解释骨干网络与LLM编排器的结合，以及新构建的包含多粒度标注的森林变化数据集，对于推动遥感图像的可解释交互式分析具有借鉴意义。

Abstract: Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.

[53] All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction cs.CVPDF

Ziyou Jiang, Mingyang Li, Junjie Wang, Yuekai Huang, Jie Huang

TL;DR: 本文提出RepMD方法，通过复现有害模因的设计概念来检测不断演变的网络有害模因。该方法首先定义设计概念图（DCG）来描述设计有害模因的步骤，然后从历史模因中推导并修剪DCG，最后利用多模态大语言模型（MLLM）在DCG指导下进行检测。

Details

Motivation: 有害模因在互联网社区中不断演变，其类型变化和时间演化的特性使得分析困难；作者发现不同模因可能共享不变的设计原则（即恶意用户的设计概念），这有助于理解其有害性。

Result: RepMD在检测任务中达到最高准确率81.1%，在泛化到类型变化和时间演化的模因时准确率略有下降；人工评估显示，该方法能提升人类发现有害模因的效率，每张模因耗时15~30秒。

Insight: 创新点在于引入设计概念图（DCG）来捕捉有害模因的不变设计原则，并利用MLLM进行多模态指导检测，为处理动态演变内容提供了可解释的结构化方法。

Abstract: Harmful memes are ever-shifting in the Internet communities, which are difficult to analyze due to their type-shifting and temporal-evolving nature. Although these memes are shifting, we find that different memes may share invariant principles, i.e., the underlying design concept of malicious users, which can help us analyze why these memes are harmful. In this paper, we propose RepMD, an ever-shifting harmful meme detection method based on the design concept reproduction. We first refer to the attack tree to define the Design Concept Graph (DCG), which describes steps that people may take to design a harmful meme. Then, we derive the DCG from historical memes with design step reproduction and graph pruning. Finally, we use DCG to guide the Multimodal Large Language Model (MLLM) to detect harmful memes. The evaluation results show that RepMD achieves the highest accuracy with 81.1% and has slight accuracy decreases when generalized to type-shifting and temporal-evolving memes. Human evaluation shows that RepMD can improve the efficiency of human discovery on harmful memes, with 15$\sim$30 seconds per meme.

[54] MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing cs.CVPDF

Zihao Lin, Wanrong Zhu, Jiuxiang Gu, Jihyung Kil, Christopher Tensmeyer

TL;DR: 本文提出MiLDEAgent，一个基于推理的多层设计文档编辑框架，结合了强化学习训练的多模态推理器和图像编辑器，以解决从自然语言指令编辑多层设计文档（如海报）的挑战。同时，作者构建了包含2万多个设计文档的MiLDEBench基准和MiLDEEval评估协议，实验表明MiLDEAgent在层感知推理和精确编辑方面显著优于开源基线，性能与闭源模型相当。

Details

Motivation: 现实世界中的设计文档（如海报）本质上是多层的，包含装饰、文本和图像等元素。现有工作大多忽视多层设计文档编辑，专注于单层图像编辑或多层生成，这些方法假设画布是平坦的，缺乏确定修改内容和位置所需的推理能力。

Result: 在MiLDEBench基准上的广泛实验显示，现有方法（包括14个开源和2个闭源模型）无法泛化：开源模型通常无法完成多层文档编辑任务，而闭源模型存在格式违规问题。相比之下，MiLDEAgent实现了强大的层感知推理和精确编辑，显著优于所有开源基线，性能与闭源模型相当，从而为多层设计文档编辑建立了首个强基线。

Insight: 论文的创新点在于首次系统性地解决了多层设计文档编辑问题，提出了结合强化学习多模态推理器和图像编辑器的框架，并构建了大规模基准和评估协议。从客观角度看，其层感知推理机制和任务特定评估维度（如指令遵循、布局一致性、美观性和文本渲染）为复杂文档编辑任务提供了可借鉴的结构化方法。

Abstract: Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.

[55] HUR-MACL: High-Uncertainty Region-Guided Multi-Architecture Collaborative Learning for Head and Neck Multi-Organ Segmentation cs.CV | cs.AIPDF

Xiaoyu Liu, Siwen Wei, Linhao Qu, Mingyuan Pan, Chengsheng Zhang

TL;DR: 本文提出了一种名为HUR-MACL的模型，用于头颈部多器官分割。该模型通过卷积神经网络自适应识别高不确定性区域，并针对这些区域，协同利用Vision Mamba和可变形CNN来提升分割精度，同时引入异构特征蒸馏损失以促进两种架构在高不确定性区域的协作学习。

Details

Motivation: 解决头颈部风险器官（尤其是形状复杂的小器官）分割中，现有混合架构模型仅简单拼接特征、导致功能重叠和分割精度有限的问题。

Result: 在两个公开数据集和一个私有数据集上取得了SOTA（最先进）的结果。

Insight: 创新点在于提出高不确定性区域引导的协作学习框架，自适应地识别困难区域，并针对性地结合Vision Mamba和可变形CNN的优势，通过异构特征蒸馏损失促进架构间协作，从而提升复杂小器官的分割性能。

Abstract: Accurate segmentation of organs at risk in the head and neck is essential for radiation therapy, yet deep learning models often fail on small, complexly shaped organs. While hybrid architectures that combine different models show promise, they typically just concatenate features without exploiting the unique strengths of each component. This results in functional overlap and limited segmentation accuracy. To address these issues, we propose a high uncertainty region-guided multi-architecture collaborative learning (HUR-MACL) model for multi-organ segmentation in the head and neck. This model adaptively identifies high uncertainty regions using a convolutional neural network, and for these regions, Vision Mamba as well as Deformable CNN are utilized to jointly improve their segmentation accuracy. Additionally, a heterogeneous feature distillation loss was proposed to promote collaborative learning between the two architectures in high uncertainty regions to further enhance performance. Our method achieves SOTA results on two public datasets and one private dataset.

[56] HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment cs.CVPDF

Wenzhi Chen, Bo Hu, Leida Li, Lihuo He, Wen Lu

TL;DR: 本文提出HyperAlign框架，利用双曲蕴含几何自适应评估文本到图像生成任务中的图文对齐质量。该方法将CLIP提取的欧几里得特征映射到双曲空间，通过动态监督蕴含建模机制将离散蕴含逻辑转化为连续几何结构监督，并设计自适应调制回归器，利用双曲几何特征生成样本级调制参数来校准欧几里得余弦相似度，最终预测对齐分数。

Details

Motivation: 现有图文对齐评估方法依赖欧几里得空间度量，忽略了语义对齐的结构化特性，且缺乏对不同样本的自适应能力。

Result: HyperAlign在单数据库评估和跨数据库泛化任务中均取得了极具竞争力的性能，充分验证了双曲几何建模在图文对齐评估中的有效性。

Insight: 创新点在于将双曲几何引入图文对齐评估，通过动态监督蕴含建模将逻辑关系转化为几何约束，并利用双曲特征实现样本自适应的分数校准，提升了评估的结构化建模能力和泛化性。

Abstract: With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.

[57] Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning cs.CV | cs.CLPDF

Wentao Zhang, Lifei Wang, Lina Lu, MingKun Xu, Shangyang Li

TL;DR: 本文提出Agri-R1，一个通过强化学习增强农业推理能力的视觉语言模型。该方法利用视觉语言合成和LLM过滤自动生成高质量推理数据，仅需19%的样本，并采用结合领域词典和模糊匹配的奖励函数进行GRPO训练，以提升开放域农业问答的准确性和语言灵活性。

Details

Motivation: 解决农业病害诊断中传统微调方法需要大量标注、可解释性差、泛化能力不足的问题，同时应对现有推理方法依赖昂贵专家标注且难以处理开放域多样化农业查询的挑战。

Result: 在CDDMBench基准测试中，3B参数的Agri-R1模型性能与7B至13B参数的基线模型相当，在病害识别准确率上相对提升23.2%，农业知识问答提升33.3%，跨域泛化能力比标准微调提高26.10分。

Insight: 创新点包括自动化高质量推理数据生成流程，以及结合领域特定词典和模糊匹配的奖励函数设计，通过结构化推理数据与GRPO探索的协同作用提升复杂问题下的模型性能。

Abstract: Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2% relative gain in disease recognition accuracy, +33.3% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.

[58] DB-MSMUNet:Dual Branch Multi-scale Mamba UNet for Pancreatic CT Scans Segmentation cs.CVPDF

Qiu Guan, Zhiqiang Yang, Dezhang Ye, Yang Chen, Xinli Xu

TL;DR: 本文提出了一种名为DB-MSMUNet（双分支多尺度Mamba UNet）的新型编码器-解码器架构，专门用于胰腺CT扫描的精确分割。该方法通过多尺度Mamba模块增强全局上下文建模和局部形变适应，并采用双解码器设计分别处理边界和区域信息，同时引入辅助深度监督以提升多尺度特征判别力。在多个数据集上的实验表明，该方法在分割精度、边缘保持和鲁棒性方面优于现有先进方法。

Details

Motivation: 胰腺及其病变在CT扫描中的准确分割对胰腺癌的精确诊断和治疗至关重要，但由于组织与周围器官对比度低、解剖边界模糊、器官形状不规则以及病变尺寸小等因素，该任务极具挑战性。

Result: 在NIH胰腺数据集、MSD数据集和一个临床胰腺肿瘤数据集上，DB-MSMUNet分别取得了89.47%、87.59%和89.02%的Dice相似系数，在分割精度、边缘保持和跨数据集鲁棒性方面超越了大多数现有的最先进方法。

Insight: 创新点包括：1）结合可变形卷积和多尺度状态空间建模的多尺度Mamba模块，以同时增强全局上下文和局部形变适应；2）双解码器设计，其中边缘解码器通过边缘增强路径显式捕获边界线索，区域解码器通过多层解码器利用多尺度深度语义特征保留细粒度细节并重建小病变；3）在多尺度上为两个解码器添加辅助深度监督头，以提供更精确的梯度反馈并增强多尺度特征的判别能力。

Abstract: Accurate segmentation of the pancreas and its lesions in CT scans is crucial for the precise diagnosis and treatment of pancreatic cancer. However, it remains a highly challenging task due to several factors such as low tissue contrast with surrounding organs, blurry anatomical boundaries, irregular organ shapes, and the small size of lesions. To tackle these issues, we propose DB-MSMUNet (Dual-Branch Multi-scale Mamba UNet), a novel encoder-decoder architecture designed specifically for robust pancreatic segmentation. The encoder is constructed using a Multi-scale Mamba Module (MSMM), which combines deformable convolutions and multi-scale state space modeling to enhance both global context modeling and local deformation adaptation. The network employs a dual-decoder design: the edge decoder introduces an Edge Enhancement Path (EEP) to explicitly capture boundary cues and refine fuzzy contours, while the area decoder incorporates a Multi-layer Decoder (MLD) to preserve fine-grained details and accurately reconstruct small lesions by leveraging multi-scale deep semantic features. Furthermore, Auxiliary Deep Supervision (ADS) heads are added at multiple scales to both decoders, providing more accurate gradient feedback and further enhancing the discriminative capability of multi-scale features. We conduct extensive experiments on three datasets: the NIH Pancreas dataset, the MSD dataset, and a clinical pancreatic tumor dataset provided by collaborating hospitals. DB-MSMUNet achieves Dice Similarity Coefficients of 89.47%, 87.59%, and 89.02%, respectively, outperforming most existing state-of-the-art methods in terms of segmentation accuracy, edge preservation, and robustness across different datasets. These results demonstrate the effectiveness and generalizability of the proposed method for real-world pancreatic CT segmentation tasks.

[59] HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution cs.CVPDF

Yang Zou, Xingyue Zhu, Kaiqi Han, Jun Ma, Xingyuan Li

TL;DR: 本文提出了HATIR，一种用于湍流红外视频超分辨率的热感知扩散模型，通过将热感知形变先验注入扩散采样路径，联合建模湍流退化和结构细节损失的反过程。该方法利用相量引导流估计器生成可靠的湍流感知流来指导反向扩散，并通过湍流感知解码器选择性地抑制不稳定时间线索以增强边缘感知特征聚合。作者还构建了首个湍流红外视频超分辨率数据集FLIR-IVSR。

Details

Motivation: 现有视频超分辨率方法要么忽略了红外与可见光图像之间的固有模态差异，要么无法有效恢复湍流引起的畸变，而将湍流缓解算法与超分辨率方法直接级联会因退化解耦建模导致误差传播和累积。

Result: 在作者构建的FLIR-IVSR数据集上进行了实验，但摘要中未明确提及具体的定量结果或与SOTA方法的比较。

Insight: 主要创新点在于：1）提出了首个联合建模湍流退化和分辨率退化的红外视频超分辨率扩散框架；2）设计了基于热活动区域相量响应物理原理的相量引导流估计器；3）引入了湍流感知解码器进行选择性特征聚合；4）构建了首个湍流红外视频超分辨率数据集，推动了该领域研究。

Abstract: Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation (TM) algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce HATIR, a Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FLIR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FLIR T1050sc camera (1024 X 768) spanning 640 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR. Project page: https://github.com/JZ0606/HATIR

[60] WebCryptoAgent: Agentic Crypto Trading with Web Informatics cs.CVPDF

Ali Kurban, Wei Luo, Liangyu Zuo, Zeyu Zhang, Renda Han

TL;DR: 本文提出WebCryptoAgent，一种用于加密货币交易的智能体框架，旨在解决整合异构网络信息与市场微观结构信号进行短期决策的挑战。该框架通过特定模态的智能体处理非结构化网络内容、社交情绪和结构化OHLCV信号，并生成统一的证据文档进行置信度校准推理。同时，它采用解耦的控制架构，将每小时战略推理与秒级实时风险模型分离，以实现快速市场冲击检测和独立于交易循环的保护性干预。

Details

Motivation: 现有交易系统难以在嘈杂的多源网络证据上进行联合推理，同时无法在亚秒级时间尺度上对快速价格冲击保持鲁棒性。具体挑战在于整合非结构化网络内容、社交情绪和结构化OHLCV信号以做出连贯且可解释的交易决策而不放大虚假相关性，以及风险控制，因为缓慢的审慎推理流程不适合处理需要立即防御性响应的突发市场冲击。

Result: 在真实世界加密货币市场上的大量实验表明，与现有基线相比，WebCryptoAgent提高了交易稳定性，减少了虚假活动，并增强了对尾部风险的处理能力。

Insight: 论文的创新点在于将网络信息决策分解为特定模态的智能体并整合其输出进行置信度校准推理，以及采用解耦的控制架构分离战略推理与实时风险模型以实现快速冲击响应。从客观角度看，这种多智能体协同与解耦风险控制的结合，为高频、高波动性环境下的智能交易系统设计提供了新的思路。

Abstract: Cryptocurrency trading increasingly depends on timely integration of heterogeneous web information and market microstructure signals to support short-horizon decision making under extreme volatility. However, existing trading systems struggle to jointly reason over noisy multi-source web evidence while maintaining robustness to rapid price shocks at sub-second timescales. The first challenge lies in synthesizing unstructured web content, social sentiment, and structured OHLCV signals into coherent and interpretable trading decisions without amplifying spurious correlations, while the second challenge concerns risk control, as slow deliberative reasoning pipelines are ill-suited for handling abrupt market shocks that require immediate defensive responses. To address these challenges, we propose WebCryptoAgent, an agentic trading framework that decomposes web-informed decision making into modality-specific agents and consolidates their outputs into a unified evidence document for confidence-calibrated reasoning. We further introduce a decoupled control architecture that separates strategic hourly reasoning from a real-time second-level risk model, enabling fast shock detection and protective intervention independent of the trading loop. Extensive experiments on real-world cryptocurrency markets demonstrate that WebCryptoAgent improves trading stability, reduces spurious activity, and enhances tail-risk handling compared to existing baselines. Code will be available at https://github.com/AIGeeksGroup/WebCryptoAgent.

[61] Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models cs.CVPDF

Yanbing Zeng, Jia Wang, Hanghang Ma, Junqiang Wu, Jie Zhu

TL;DR: 本文提出了一种名为Forge-and-Quench的统一多模态框架，旨在通过理解模型来增强图像生成的保真度和细节丰富度。该框架利用MLLM对对话上下文进行推理，生成增强的文本指令，并通过新颖的Bridge Adapter将其映射为虚拟视觉表示（Bridge Feature），作为视觉引导信号注入T2I主干模型，从而优化生成过程。

Details

Motivation: 当前多模态领域的一个关键目标是将图像生成与理解集成到单一框架中，但理解如何有效辅助生成尚未得到充分探索。本文旨在探索利用理解模型来提升生成图像的保真度和细节，而非仅依赖其推理能力和世界知识。

Result: 实验表明，Forge-and-Quench框架在多个模型上显著提高了图像保真度和细节，同时保持了指令遵循的准确性并增强了世界知识的应用。该框架具有出色的可扩展性和灵活性，能够高效迁移到不同的MLLM和T2I模型，且训练开销显著降低。

Insight: 创新点在于提出了利用理解模型来直接增强生成图像保真度的新视角，并设计了Bridge Feature和Bridge Adapter作为连接理解与生成的桥梁。该框架实现了理解与生成的有效协同，且不损害MLLM原有的多模态理解能力，具有较高的实用性和迁移效率。

Abstract: Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM’s inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at https://github.com/YanbingZeng/Forge-and-Quench.

[62] On the Holistic Approach for Detecting Human Image Forgery cs.CVPDF

Xiao Guo, Jie Zhu, Anil Jain, Xiaoming Liu

TL;DR: 本文提出HuForDet，一种用于检测人类图像伪造的端到端框架，通过双分支架构分别处理面部伪造和全身语义一致性，并引入包含面部伪造数据和全身合成人类图像的新数据集HuFor，在多种人类图像伪造检测任务中实现了最先进的性能。

Details

Motivation: 现有检测方法专注于面部区域伪造或全身合成图像，缺乏对整个人类图像伪造谱系的泛化能力，因此需要一种统一的检测框架。

Result: 在统一的人类图像伪造数据集HuFor上进行广泛实验，HuForDet在多种伪造检测任务中达到了最先进的性能，并展现出卓越的鲁棒性。

Insight: 创新点包括：1）结合RGB域和频域异构专家的面部伪造检测分支，以及自适应LoG模块以捕获从细粒度边界到粗尺度纹理的伪影；2）利用多模态大语言模型分析全身语义一致性的上下文伪造检测分支，并引入置信度估计机制动态加权特征融合；3）构建了统一的面部和全身伪造数据集HuFor，促进了整体检测研究。

Abstract: The rapid advancement of AI-generated content (AIGC) has escalated the threat of deepfakes, from facial manipulations to the synthesis of entire photorealistic human bodies. However, existing detection methods remain fragmented, specializing either in facial-region forgeries or full-body synthetic images, and consequently fail to generalize across the full spectrum of human image manipulations. We introduce HuForDet, a holistic framework for human image forgery detection, which features a dual-branch architecture comprising: (1) a face forgery detection branch that employs heterogeneous experts operating in both RGB and frequency domains, including an adaptive Laplacian-of-Gaussian (LoG) module designed to capture artifacts ranging from fine-grained blending boundaries to coarse-scale texture irregularities; and (2) a contextualized forgery detection branch that leverages a Multi-Modal Large Language Model (MLLM) to analyze full-body semantic consistency, enhanced with a confidence estimation mechanism that dynamically weights its contribution during feature fusion. We curate a human image forgery (HuFor) dataset that unifies existing face forgery data with a new corpus of full-body synthetic humans. Extensive experiments show that our HuForDet achieves state-of-the-art forgery detection performance and superior robustness across diverse human image forgeries.

[63] AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection cs.CVPDF

Yunqing Hu, Zheming Yang, Chang Zhao, Qi Guo, Meng Gao

TL;DR: 本文提出了AIVD框架，通过轻量级边缘检测器与云端多模态大语言模型（MLLM）的协作，实现了统一的精确目标定位和高质量语义生成。该框架设计了视觉-语义协同增强的高效微调策略以提升MLLM的鲁棒性，并提出了异构资源感知的动态调度算法以优化系统吞吐和延迟。

Details

Motivation: 解决多模态大语言模型（MLLMs）在精确目标定位和资源受限的边缘-云协同部署方面面临的挑战。

Result: 实验结果表明，AIVD在显著降低资源消耗的同时，提高了MLLM的分类性能和语义生成质量；所提出的调度策略在多种场景下实现了更高的吞吐量和更低的延迟。

Insight: 创新点在于边缘-云协同的框架设计，通过协同增强微调提升MLLM对边缘噪声和场景变化的鲁棒性，以及异构资源感知的动态调度算法以应对实际部署的复杂性。

Abstract: Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM’s robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.

[64] Skeletonization-Based Adversarial Perturbations on Large Vision Language Model’s Mathematical Text Recognition cs.CVPDF

Masatomo Yoshida, Haruto Namura, Nicola Adami, Masahiro Okuda

TL;DR: 本文提出了一种基于骨架化的新型对抗攻击方法，通过有效缩减搜索空间来探索大型视觉语言模型（如ChatGPT）在视觉能力上的局限性。该方法专门针对包含文本的图像，特别是数学公式图像，通过评估原始输出与对抗扰动输出之间的字符和语义变化，深入分析了模型的视觉解释和推理能力。

Details

Motivation: 动机是探索基础模型的视觉能力和局限性，通过引入一种新颖的对抗攻击方法，利用骨架化有效缩减搜索空间，特别针对数学公式图像这类具有挑战性的文本图像，以揭示模型在视觉解释和推理中的弱点。

Result: 该方法在ChatGPT上进行了应用，展示了其在真实场景中的实际影响，通过评估字符和语义变化，证明了攻击的有效性，但未提及具体基准测试或与SOTA的比较结果。

Insight: 创新点在于将骨架化技术应用于对抗攻击中，以缩减搜索空间并针对数学公式图像进行攻击，这为理解大型视觉语言模型的视觉解释能力提供了新视角，并可能启发更鲁棒的模型设计。

Abstract: This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models’ visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.

[65] GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models cs.CV | cs.AIPDF

Shurong Zheng, Yousong Zhu, Hongyin Zhao, Fan Yang, Yufei Zhan

TL;DR: 本文提出了GeM-VG，一种基于多模态大语言模型（MLLM）的广义多图像视觉定位方法。为了解决现有方法在单目标定位和任务类型上的局限，作者系统分类了多图像定位任务，并构建了MG-Data-240K数据集。通过结合思维链推理和直接回答的混合强化微调策略，模型在多种定位任务上表现出强大的泛化能力。实验表明，该模型在多图像和单图像定位基准上均取得了显著提升，同时保持了良好的多图像理解能力。

Details

Motivation: 现有MLLMs在多图像视觉定位任务中存在局限，主要受限于单目标定位和任务类型单一，缺乏对广义定位任务的统一建模。本文旨在解决这一问题，实现能够处理多样化、依赖跨图像线索和推理的广义多图像视觉定位。

Result: 在MIG-Bench和MC-Bench多图像定位基准上，分别比先前领先的MLLMs提升了2.0%和9.7%；在单图像定位基准ODINW上，比基础模型提升了9.1%。模型在保持强大通用多图像理解能力的同时，实现了SOTA水平的性能。

Insight: 创新点包括：1) 对多图像定位任务进行系统性分类，以统一建模广义任务；2) 引入大规模数据集MG-Data-240K，弥补现有数据在目标数量和图像关系上的不足；3) 提出结合思维链推理和直接回答的混合强化微调策略，利用互补优势提升模型的感知和推理鲁棒性。从客观角度看，该工作通过任务分类、数据构建和训练策略的创新，有效推动了MLLMs在复杂多模态定位任务中的泛化能力。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model’s overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.

[66] CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models cs.CV | cs.AI | cs.CL | cs.MMPDF

Tobia Poppi, Burak Uzkent, Amanmeet Garg, Lucas Porto, Garin Kessler

TL;DR: 本文提出CounterVid框架，通过合成反事实视频来缓解视频语言模型中的动作和时序幻觉问题。该框架利用多模态大语言模型生成动作建议和编辑指导，结合扩散模型大规模生成语义硬负样本，构建了一个包含约2.6万个偏好对的合成数据集。进一步提出MixDPO方法，联合利用文本和视觉偏好进行直接偏好优化，在Qwen2.5-VL模型上微调后，在时序排序等任务上取得了持续改进，并能有效迁移到标准视频幻觉基准测试中。

Details

Motivation: 现有视频语言模型在理解和推理动作及时序时容易产生幻觉，现有缓解策略（如文本过滤或随机视频扰动）未能解决根本原因：模型过度依赖语言先验而非细粒度的视觉动态信息。

Result: 使用MixDPO对Qwen2.5-VL进行微调后，在时序排序等任务上取得了持续改进，并能有效迁移到标准视频幻觉基准测试中。

Insight: 创新点在于提出了一个可扩展的反事实视频生成框架，通过合成仅在动作或时序结构上不同但场景上下文保持一致的视频，构建针对性的硬负样本数据集，并设计了联合文本与视觉偏好的统一直接偏好优化方法，从数据增强和训练目标两方面共同缓解幻觉问题。

Abstract: Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.

[67] PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference cs.CVPDF

Denis Korzhenkov, Adil Karjauv, Animesh Karnewar, Mohsen Ghafoorian, Amirhossein Habibian

TL;DR: 本文提出了一种名为PyramidalWan的流程，能够将预训练的视频扩散模型高效地转换为金字塔结构模型，通过低成本微调实现转换，并在不降低输出视频质量的前提下，显著提升多步去噪模型的推理效率。

Details

Motivation: 现有开源金字塔视频模型需从头训练，且在视觉合理性上往往落后于SOTA系统，因此需要一种方法能将高质量预训练模型高效转换为金字塔架构，以降低计算成本。

Result: 通过低成本的微调，成功将预训练模型转换为金字塔模型，保持了输出视频质量；同时探索并比较了多种步数蒸馏策略，以进一步提升推理效率。

Insight: 创新点在于提出了一个将预训练扩散模型转换为金字塔模型的流程，避免了从头训练，并研究了金字塔模型内的步数蒸馏策略，为高效视频生成提供了新思路。

Abstract: Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at https://qualcomm-ai-research.github.io/PyramidalWan.

[68] SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models cs.CVPDF

Oriol Rabasseda, Zenjie Li, Kamal Nasrollahi, Sergio Escalera

TL;DR: 本文提出了SOVABench，一个基于真实监控视频构建的车辆行为检索基准，旨在评估模型在监控场景下对车辆相关动作的辨别能力。论文还提出了一种无需训练的多模态大语言模型框架，通过生成描述来产生可解释的嵌入表示，并在SOVABench及其他空间与计数基准上取得了良好性能。

Details

Motivation: 现有视频检索基准大多关注场景级相似性，缺乏对监控所需动作辨别能力的评估，因此需要构建专门针对车辆监控动作的检索基准来填补这一空白。

Result: 实验表明，即使对于当前最先进的视觉和多模态模型，SOVABench中定义的动作区分仍然具有挑战性；提出的无训练框架在SOVABench以及一些对比视觉语言模型常失败的空间和计数基准上表现强劲。

Insight: 创新点包括：1) 构建了专注于车辆监控动作的真实世界检索基准SOVABench，并设计了跨动作辨别和时间方向理解两种评估协议；2) 利用MLLM的视觉推理和指令跟随能力，提出了一种无需训练的可解释嵌入生成框架，适用于图像和视频。

Abstract: Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.

[69] Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform cs.CV | cs.LGPDF

Suyash Mishra, Qiang Li, Srikanth Patil, Satyanarayan Pati, Baddu Narendra

TL;DR: 本文提出了一个面向制药行业的工业级生成式AI框架，用于处理大规模多模态数据（包括PDF、视频和音频），并系统评估了超过40个视觉语言模型在长视频推理任务上的性能、效率瓶颈和实际部署限制。

Details

Motivation: 解决现有视觉语言模型在工业场景（如制药内容理解）中处理长视频时面临的GPU资源、延迟和成本约束下的可扩展性问题。

Result: 在Video-MME和MMBench基准及私有数据集（25,326个视频）上的实证分析显示，使用SDPA注意力机制在商用GPU上实现了3-8倍的效率提升，多模态在8/12个任务领域（尤其是长度依赖任务）中带来改进，并揭示了开源与闭源VLM在时序对齐和关键帧检测方面的瓶颈。

Insight: 创新点在于从工业实践角度系统分析了现有VLM在长视频推理中的实际限制、权衡与失败模式，而非提出新模型；提供了关于多模态作用、注意力机制权衡、时序推理极限及视频分割挑战的四个关键发现，为设计可扩展工业多模态系统提供了实用指导。

Abstract: Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new “A+B” model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.

[70] Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics cs.CV | cs.AIPDF

Subhadeep Roy, Gagan Bhatia, Steffen Eger

TL;DR: 本文研究了多模态评估指标中存在的原型性偏差问题，即自动评估指标倾向于选择视觉和社会原型图像而非语义正确的图像。作者构建了ProtoBias基准测试，发现CLIPScore、PickScore等常用指标存在误判，而人类评估更注重语义正确性。为此，作者提出了ProtoScore指标，显著降低了误判率并提升了鲁棒性。

Details

Motivation: 解决自动评估指标在文本到图像模型评估中可能优先考虑视觉和社会原型而非语义正确性的系统性问题，揭示现有指标的盲点。

Result: 在ProtoBias基准测试（涵盖动物、物体和人口统计图像）上，CLIPScore、PickScore和基于VQA的得分等广泛使用的指标频繁误判语义正确但非原型的图像与轻微错误但原型的图像对，而LLM-as-Judge系统在社会基础案例中表现出不均匀的鲁棒性；人类评估则始终更偏好语义正确性且决策边界更大。ProtoScore指标将失败率大幅降低，抑制误判，运行速度比GPT-5推理快多个数量级，接近更大规模闭源评估者的鲁棒性。

Insight: 创新点在于识别并量化了多模态评估中的原型性偏差，通过构建受控对比基准ProtoBias进行定向评估，并提出了轻量高效的ProtoScore指标以提升评估的语义对齐能力；从客观角度看，该研究强调了评估指标设计需避免数据分布偏差，并提供了可扩展的测试框架来验证指标的真实语义理解能力。

Abstract: Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study \emph{prototypicality bias} as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark \textsc{\textbf{ProtoBias}} (\textit{\textbf{Proto}typical \textbf{Bias}}), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose \textbf{\textsc{ProtoScore}}, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.

[71] Patch-based Representation and Learning for Efficient Deformation Modeling cs.CVPDF

Ruochen Chen, Thuy Tran, Shaifali Parashar

TL;DR: 本文提出了一种基于面片的表面表示方法PolyFit，通过局部拟合jet函数来表示表面。该方法可从解析函数和真实数据中高效学习，并能泛化到多种表面类型。PolyFit通过更新紧凑的jet系数集而非逐顶点优化，实现了高效表面变形，应用于形状模板重建和服装悬垂模拟，在速度和精度上优于现有方法。

Details

Motivation: 动机是解决计算机视觉和图形学中表面变形任务的计算效率问题，传统方法需要优化大量顶点自由度，计算成本高，而PolyFit旨在通过紧凑的表示来简化变形过程。

Result: 在形状模板重建任务中，PolyFit通过测试时优化实现了与离线物理求解器相当的精度，但速度显著更快；在服装悬垂模拟中，训练的自监督模型在泛化性和推理速度上优于基线，推理速度提升达一个数量级。

Insight: 创新点包括基于面片的局部jet函数表示，支持高效学习和泛化；通过更新紧凑系数而非顶点优化，提高了变形效率；在应用中结合测试时优化和自监督学习，平衡了精度与速度。

Abstract: In this paper, we present a patch-based representation of surfaces, PolyFit, which is obtained by fitting jet functions locally on surface patches. Such a representation can be learned efficiently in a supervised fashion from both analytic functions and real data. Once learned, it can be generalized to various types of surfaces. Using PolyFit, the surfaces can be efficiently deformed by updating a compact set of jet coefficients rather than optimizing per-vertex degrees of freedom for many downstream tasks in computer vision and graphics. We demonstrate the capabilities of our proposed methodologies with two applications: 1) Shape-from-template (SfT): where the goal is to deform the input 3D template of an object as seen in image/video. Using PolyFit, we adopt test-time optimization that delivers competitive accuracy while being markedly faster than offline physics-based solvers, and outperforms recent physics-guided neural simulators in accuracy at modest additional runtime. 2) Garment draping. We train a self-supervised, mesh- and garment-agnostic model that generalizes across resolutions and garment types, delivering up to an order-of-magnitude faster inference than strong baselines.

[72] From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs) cs.CV | cs.LGPDF

Suyash Mishra, Qiang Li, Srikanth Patil, Anubhav Girdhar

TL;DR: 本文提出了一种面向制药行业的个性化视频片段生成框架，该框架整合了音频语言模型和视觉语言模型，旨在自动化处理多模态内容，以解决传统手动标注的低效和质量问题。

Details

Motivation: 传统制药行业的多模态数据（如文本、图像、视频、音频和网页链接）手动标注存在不一致性、质量下降和效率低下问题，尤其是长视频和音频数据（如临床试验访谈和教育研讨会）的处理挑战加剧，因此需要智能、可扩展的自动化解决方案。

Result: 在Video MME基准测试（900个视频）和私有数据集（16,159个制药视频，覆盖14种疾病领域）上评估，该方法实现了3到4倍的速度提升、4倍的成本降低，并在片段连贯性得分（0.348）和信息性得分（0.721）上优于SOTA VLM基线（如Gemini 2.5 Pro），显示出竞争性的片段质量。

Insight: 创新点包括：可复现的Cut & Merge算法（带淡入淡出和时间戳归一化，确保平滑过渡和音视频对齐）、基于角色定义和提示注入的个性化机制（用于定制化输出如营销、培训、监管），以及成本高效的端到端管道策略（平衡ALM/VLM增强处理），为生命科学领域提供透明、可定制且合规的视频摘要方案。

Abstract: Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.

[73] Driving on Registers cs.CV | cs.AI | cs.ROPDF

Ellington Kirby, Alexandre Boulch, Yihong Xu, Yuan Yin, Gilles Puy

TL;DR: 本文提出DrivoR，一种基于Transformer的简单高效端到端自动驾驶架构。该方法利用预训练的视觉Transformer，引入相机感知寄存器令牌将多摄像头特征压缩为紧凑场景表示，显著减少下游计算量而不牺牲精度。这些令牌驱动两个轻量级Transformer解码器生成并评分候选轨迹，评分解码器学习模仿最优策略并预测可解释的子分数（如安全性、舒适性和效率），实现推理时的行为条件驾驶。

Details

Motivation: 动机是设计一个简单高效的纯Transformer架构，用于端到端自动驾驶，旨在通过压缩多摄像头特征来减少计算开销，同时保持高精度，并实现可解释和可行为条件控制的驾驶。

Result: 在NAVSIM-v1、NAVSIM-v2和逼真闭环HUGSIM基准测试中，DrivoR优于或匹配当前强基线模型，展示了其准确、高效和自适应的性能。

Insight: 创新点在于引入相机感知寄存器令牌进行特征压缩，以及使用轻量级解码器进行轨迹生成和可解释评分，这为纯Transformer架构在自动驾驶中的应用提供了高效且可解释的解决方案。

Abstract: We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available via the project page.

[74] UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition cs.CVPDF

Filippo Ghilotti, Samuel Brucker, Nahku Saidy, Matteo Matteucci, Mario Bijelic

TL;DR: 本文提出了一种名为UniLiPs的无监督多模态伪标签方法，用于自动驾驶中的LiDAR数据。该方法利用LiDAR扫描的时间几何一致性，将文本和2D视觉基础模型的线索直接提升并融合到3D空间中，无需人工标注，同时生成3D语义标签、3D边界框和密集LiDAR扫描。

Details

Motivation: 解决自动驾驶应用中未标注LiDAR日志的利用难题，这些数据虽富含密集3D几何信息，但缺乏人工标注导致成本高昂，限制了感知研究的发展。

Result: 在三个数据集上验证了方法的鲁棒泛化能力，相比现有需要额外监督的语义分割和物体检测伪标签方法表现更优；使用该方法生成的一小部分几何一致、密集化的LiDAR数据，在80-150米和150-250米范围内分别将深度预测的MAE提升了51.5%和22.0%。

Insight: 创新点包括基于时间累积LiDAR地图学习强几何先验，以及引入迭代更新规则来强制几何-语义一致性，同时通过不一致性检测运动物体，实现了无监督的多模态3D伪标签生成。

Abstract: Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.

[75] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing cs.CVPDF

Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu

TL;DR: 本文提出Re-Align框架，通过结构化推理引导的对齐方法，弥合了统一多模态模型中理解能力与图像生成能力之间的差距。其核心是In-Context Chain-of-Thought (IC-CoT)范式，用于解耦语义指导和参考关联，并结合基于代理奖励的强化学习训练方案，以提升模型在上下文图像生成与编辑任务上的性能。

Details

Motivation: 解决现有统一多模态模型在上下文图像生成与编辑任务中，其强大的理解能力未能有效迁移到图像生成上，导致用户意图执行不精确的问题。

Result: 在上下文图像生成和编辑任务上，Re-Align在模型规模和资源相当的情况下，优于现有竞争方法。

Insight: 创新点在于提出了结构化推理范式IC-CoT来清晰分离生成目标与参考图像，并设计了利用代理奖励衡量推理文本与生成图像对齐度的强化学习训练方案，从而系统性地提升了意图对齐和生成质量。

Abstract: In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model’s overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

[76] VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding cs.CV | cs.AIPDF

Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal

TL;DR: 本文提出VERSE方法，通过探索视觉语言模型在视觉丰富文档理解任务中的视觉嵌入空间，实现潜在表征的可视化以评估模型可行性，识别问题区域并指导合成数据生成以提升性能。该方法在合成数据集MERIT上训练并在真实数据集MERIT Secret上验证，结果显示VERSE能揭示易出错簇的视觉特征，通过包含这些特征的样本重新训练显著提升F1分数且不损害泛化能力，并使本地模型如Donut和Idefics2在优化后达到或超越GPT-4、Pixtral等SaaS解决方案的性能。

Details

Motivation: 解决视觉语言模型在视觉丰富文档理解任务中视觉嵌入空间分析不足的问题，旨在通过可视化表征识别模型缺陷并指导数据增强以提升性能。

Result: 在MERIT Secret基准测试中，使用VERSE生成的合成数据重新训练后，F1分数显著提升，且本地模型Donut和Idefics2优化后性能达到或超越GPT-4、Pixtral等SOTA SaaS模型。

Insight: 创新点在于结合聚类分析探索视觉嵌入空间以指导针对性数据增强，可借鉴其通过可视化识别问题簇并生成合成数据来高效优化模型的方法，适用于视觉文档理解任务的数据增强和模型调试。

Abstract: This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.

[77] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control cs.CVPDF

Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan

TL;DR: 本文提出了VerseCrafter，一种4D感知的视频世界模型，旨在实现对相机和物体动态的统一且精确控制。其核心是一种新颖的4D几何控制表示，通过静态背景点云和每个物体的3D高斯轨迹来编码世界状态，并渲染为预训练视频扩散模型的调节信号，以生成高保真、视角一致的视频。为解决4D标注数据稀缺问题，还开发了自动数据引擎从野外视频中提取所需控制信号。

Details

Motivation: 现有视频世界模型难以在投影的2D图像平面上对相机和多物体运动提供统一且精确的控制，本文旨在弥合这一差距。

Result: 论文表明，该方法能够生成高保真、视角一致且精确遵循指定动态的视频。

Insight: 创新点在于提出了统一的4D几何控制表示（静态背景点云+物体3D高斯轨迹），这是一种灵活、类别无关的替代方案，优于刚性边界框或参数化模型；同时，开发了自动数据引擎以解决4D标注数据稀缺问题，支持在大规模多样化数据集上训练模型。

Abstract: Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object’s path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.

[78] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering cs.CV | cs.CLPDF

Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Siam Ansary

TL;DR: 本文提出了一种轻量级且可解释的视觉语言框架，用于作物病害视觉问答。该方法结合了Swin Transformer视觉编码器和序列到序列语言解码器，采用两阶段训练策略提升视觉表征学习和跨模态对齐。在大型作物病害数据集上的评估显示，该框架在作物和病害识别上均取得高精度，并在自然语言生成指标上表现优异。

Details

Motivation: 解决作物病害分析中视觉问答任务对精确视觉理解和可靠语言生成的需求，旨在开发一个轻量且可解释的框架。

Result: 在大型作物病害数据集上，模型在分类和自然语言生成指标（如BLEU、ROUGE和BERTScore）上均表现出色，超越了大规模视觉语言基线模型，同时参数量显著减少。

Insight: 创新点包括结合Swin Transformer与序列到序列解码器的轻量级设计、两阶段训练策略以优化视觉表征和跨模态对齐，以及利用Grad-CAM和词元级归因增强可解释性；客观分析认为，任务特定的视觉预训练对提升作物病害视觉问答效果具有关键作用。

Abstract: Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.

[79] Atlas 2 – Foundation models for clinical deployment cs.CV | cs.AI | cs.LGPDF

Maximilian Alber, Timo Milbich, Alexandra Carpen-Amarie, Stephan Tietz, Jonas Dippel

TL;DR: 本文介绍了Atlas 2系列病理学视觉基础模型（包括Atlas 2、Atlas 2-B和Atlas 2-S），旨在解决现有模型在性能、鲁棒性和计算资源需求之间的权衡问题，以促进其在临床中的部署。模型在包含80个公共基准的综合评估中展现了最先进的预测性能、鲁棒性和资源效率。

Details

Motivation: 现有病理学基础模型在性能、鲁棒性和计算需求方面存在权衡，这限制了其在临床环境中的实际部署。

Result: 在涵盖80个公共基准的综合评估中，Atlas 2系列模型在预测性能、鲁棒性和资源效率方面均达到了最先进水平（SOTA）。

Insight: 创新点在于通过大规模数据集（来自三家机构的550万张组织病理学全切片图像）训练，系统性地解决了临床部署的三大关键挑战（性能、鲁棒性、效率），并提供了不同规模（如Atlas 2-B和Atlas 2-S）的模型以适应不同资源约束。

Abstract: Pathology foundation models substantially advanced the possibilities in computational pathology – yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.

[80] Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering cs.CV | cs.AIPDF

Shuliang Liu, Songbo Yang, Dong Fang, Sihang Jia, Yuqi Tang

TL;DR: 本文提出了Vision-Language Introspection（VLI）框架，一种无需训练、基于推理的方法，旨在缓解多模态大语言模型中的物体幻觉问题。VLI通过模拟元认知自我修正过程，首先进行归因自省以诊断幻觉风险并定位因果视觉锚点，然后采用可解释的双因果引导来动态调整推理过程，隔离视觉证据并校准模型置信度。

Details

Motivation: 解决多模态大语言模型中物体幻觉问题，该问题源于模型认知自省能力的根本性失败，即模型盲目信任语言先验而非具体视觉证据，现有方法（如对比解码和静态潜在向量引导）存在局限性，无法精确修正内部语义错位或缺乏实例特异性。

Result: 在先进模型上实现了最先进的性能，在MMHal-Bench上将物体幻觉率降低了12.67%，在POPE上将准确率提高了5.8%。

Insight: 创新点在于提出了一种无需训练、可解释的推理框架，通过概率冲突检测和因果视觉锚点定位来诊断幻觉，并利用动态的双因果引导机制（隔离视觉证据和自适应校准）主动调制推理过程，这比现有方法更精确且具有实例特异性。

Abstract: Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.

[81] CoV: Chain-of-View Prompting for Spatial Reasoning cs.CV | cs.AIPDF

Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen

TL;DR: 本文提出了链式视图提示（CoV），一种无需训练、在测试时进行推理的框架，旨在提升视觉语言模型在3D具身问答任务中的空间推理能力。该方法通过一个从粗到细的探索过程，将静态视觉语言模型转变为主动的视角推理器，首先筛选冗余帧并定位与问题相关的锚定视图，然后通过迭代推理与离散相机动作交替进行细粒度视图调整，从而从底层3D场景表示中收集足够的上下文信息。

Details

Motivation: 解决现有视觉语言模型在3D具身问答中因输入视图固定且有限，导致无法在推理时充分获取问题相关上下文，从而阻碍复杂空间推理的问题。

Result: 在OpenEQA基准测试中，CoV在四个主流视觉语言模型上平均提升了11.56%的LLM-Match分数，其中在Qwen3-VL-Flash上最大增益达13.62%。增加最小动作预算可带来额外平均2.51%的提升（在Gemini-2.5-Flash上峰值达3.73%）。在ScanQA和SQA3D基准上也取得了强劲性能（例如，ScanQA上116 CIDEr / 31.9 EM@1，SQA3D上51.1 EM@1）。

Insight: 主要创新点在于提出了一种无需额外训练、模型无关的测试时推理框架，通过问题对齐的视图选择与开放式视图搜索相结合的策略，有效提升了3D空间推理能力。其核心在于将静态模型动态化为主动探索者，通过迭代的粗选与细调过程高效收集分散且可能被遮挡的上下文信息。

Abstract: Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision–language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56% improvement in LLM-Match, with a maximum gain of +13.62% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51% average improvement, peaking at +3.73% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.

[82] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice cs.CVPDF

Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu

TL;DR: 本文提出VideoAuto-R1视频理解框架，采用’必要时推理’策略，在训练时遵循’思考一次，回答两次’范式：模型先生成初始答案，然后进行推理，最后输出审核后的答案，两者均通过可验证奖励进行监督。在推理时，模型根据初始答案的置信度决定是否进行推理。该方法在视频QA和定位基准测试中实现了最先进的准确性，同时显著提高了效率，平均响应长度减少了约3.3倍。

Details

Motivation: 尽管思维链推理在多模态大语言模型的视频理解任务中表现出强大能力，但其相对于直接回答的必要性和优势尚未得到充分探索。本文首先证明，对于RL训练的视频模型，直接回答往往匹配甚至超越思维链性能，而思维链以更高的计算成本生成逐步分析。

Result: 在视频QA和定位基准测试中，VideoAuto-R1实现了最先进的准确性，同时显著提高了效率，平均响应长度减少了约3.3倍（例如从149个标记减少到44个标记）。此外，在感知导向任务中观察到思维模式激活率较低，而在推理密集型任务中激活率较高。

Insight: 论文的创新点在于提出了一种’必要时推理’的自适应策略，通过’思考一次，回答两次’的训练范式，结合置信度驱动的推理决策机制，在保持或提升性能的同时大幅降低计算开销。客观分析表明，该方法揭示了基于语言的显式推理通常有益但并非总是必要的，为高效视频理解提供了新思路。

Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.

[83] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models cs.CV | cs.AI | cs.CLPDF

William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh

TL;DR: 该论文研究了视觉语言模型（VLMs）中由提示词引发的幻觉现象，即模型倾向于遵从文本提示而忽略视觉证据。通过在受控的对象计数任务中设置提示词高估对象数量（例如，图像中只有三朵睡莲却要求描述四朵），作者发现随着对象数量增加，模型越来越倾向于遵从提示词而忽略视觉差异。通过对三个VLMs进行机制分析，识别出一小部分注意力头，其消融能显著减少提示词引发的幻觉（PIH）至少40%，且无需额外训练。这些PIH头以模型特定的方式介导提示词复制，其消融能增强模型对视觉证据的修正。

Details

Motivation: 动机是探究大型视觉语言模型（VLMs）中常见的幻觉失败模式，即模型过度依赖文本提示而忽视视觉证据，旨在理解其内部机制并减少这种幻觉。

Result: 在受控的对象计数基准测试中，通过消融识别出的特定注意力头（PIH-heads），无需额外训练即可将提示词引发的幻觉（PIH）减少至少40%，并提高模型对视觉证据的修正能力。

Insight: 创新点在于通过机制分析识别出导致提示词引发幻觉的关键注意力头，并证明其消融能有效缓解幻觉，揭示了不同VLMs在实现这些行为时的模型特异性差异，为理解模型内部工作机制提供了新见解。

Abstract: Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.

[84] ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos cs.CVPDF

Rustin Soraki, Homanga Bharadhwaj, Ali Farhadi, Roozbeh Mottaghi

TL;DR: ObjectForesight是一个3D物体中心动力学模型，旨在从短时第一人称视角视频序列中预测刚性物体未来的6自由度位姿和轨迹。该模型在物体级别上显式地以3D形式表示世界，从而能够生成几何基础扎实且时间连贯的预测，捕捉物体的可供性和运动轨迹。

Details

Motivation: 人类能够轻松预测物体如何通过交互移动或变化，本文旨在赋予计算系统类似的能力，使其能够直接从被动视觉观察中预测合理的未来物体运动。

Result: 通过大量实验，ObjectForesight在准确性、几何一致性以及对未见过的物体和场景的泛化能力方面取得了显著提升，为直接从观察中学习基于物理的、以物体为中心的动力学模型建立了一个可扩展的框架。

Insight: 创新点在于提出了一个显式的、以物体为中心的3D动力学模型，不同于在像素或潜在空间中操作的传统世界模型。此外，利用分割、网格重建和3D姿态估计的最新进展，大规模构建了带有伪真实3D物体轨迹的数据集，以支持模型训练。

Abstract: Humans can effortlessly anticipate how objects might move or change through interaction–imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. objectforesight.github.io

[85] Plenoptic Video Generation cs.CVPDF

Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu

TL;DR: 本文提出了PlenopticDreamer框架，旨在解决多视角视频生成中时空一致性的难题。该框架通过自回归训练一个多输入单输出的视频条件模型，并采用相机引导的视频检索策略，从先前生成的视频中自适应选择关键片段作为条件输入，从而同步生成过程中的幻觉内容，确保多视角下的时空连贯性。

Details

Motivation: 现有相机控制的生成式视频重渲染方法（如ReCamMaster）在单视角下表现出色，但在多视角场景中难以保持一致性，生成区域的时空连贯性因生成模型的固有随机性而面临挑战。

Result: 在Basic和Agibot基准测试上的大量实验表明，PlenopticDreamer实现了最先进的视频重渲染效果，在视角同步、视觉保真度、相机控制精度和多样视角转换（如第三人称到第三人称、机器人操作中的头部视角到夹爪视角）方面均表现优异。

Insight: 核心创新点在于通过相机引导的视频检索策略和自回归训练来同步生成幻觉，维持时空记忆。此外，训练中采用的渐进式上下文缩放、自条件机制以及长视频条件机制，分别提升了模型收敛性、对误差累积导致的长程视觉退化的鲁棒性，以及支持生成长视频的能力。

Abstract: Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/

[86] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation cs.CV | cs.AI | cs.ROPDF

Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia

TL;DR: 本文提出RoboVIP方法，通过视觉身份提示增强多视角视频生成，以解决机器人操作数据在多样性、数量和质量上的不足。该方法利用扩散模型，结合示例图像作为视觉引导，生成具有多视角和时间一致性的操作数据，从而提升下游策略模型的性能。

Details

Motivation: 由于硬件和物理设置限制，收集大规模真实世界机器人操作数据困难，现有基于文本提示的图像扩散方法无法可靠指定场景设置，且缺乏多视角和时间一致性，难以满足先进策略模型的需求。

Result: 使用增强数据训练视觉-语言-动作和视觉运动策略模型，在仿真和真实机器人设置中均获得一致的性能提升，但未提及具体基准或SOTA比较。

Insight: 创新点在于引入视觉身份提示，通过示例图像提供显式视觉引导，以生成更可控和一致的多视角视频数据；客观分析认为，该方法通过视觉条件输入弥补了文本提示的不足，提升了数据生成的实用性和可扩展性。

Abstract: The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.

[87] Pixel-Perfect Visual Geometry Estimation cs.CVPDF

Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang

TL;DR: 本文提出了像素级完美的视觉几何模型，包括单目深度估计模型PPD和视频深度估计模型PPVD。它们利用像素空间的扩散变换器，通过引入语义提示扩散和级联架构来提高效率与精度，并采用语义一致性和参考引导的令牌传播来保证视频时序一致性，从而生成高质量、无飞点的点云。

Details

Motivation: 现有几何基础模型在从图像恢复几何时，存在严重的飞点问题和细节丢失，这影响了机器人学和增强现实等应用。本文旨在通过像素空间的生成建模来解决这些问题。

Result: 在生成式单目和视频深度估计模型中取得了最佳性能，生成的3D点云比其他所有模型都显著更干净。

Insight: 创新点在于将语义表示（来自视觉基础模型）作为提示整合到像素空间扩散过程中以增强细节，以及采用级联DiT架构和针对视频的语义一致DiT与参考引导令牌传播，在保证质量的同时提升了计算效率与时序一致性。

Abstract: Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

[88] RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes cs.CVPDF

Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu

TL;DR: 本文提出RL-AWB框架，结合统计方法与深度强化学习来解决夜间低光照场景下的自动白平衡问题。该方法首先采用针对夜间场景设计的统计算法进行显著灰色像素检测和光照估计，然后以此为核心构建首个用于色彩恒常性的深度强化学习模型，模拟专业调校专家动态优化每张图像的参数。作者还引入了首个多传感器夜间数据集以促进跨传感器评估。实验表明，该方法在低光照和良好光照图像上均展现出优异的泛化能力。

Details

Motivation: 夜间色彩恒常性因低光照噪声和复杂照明条件而成为计算摄影中的难题，现有方法在夜间场景下表现不佳，需要一种能适应复杂光照并具有良好泛化能力的自动白平衡方法。

Result: 实验结果表明，该方法在低光照和良好光照图像上均实现了优异的泛化性能，在引入的多传感器夜间数据集上进行了评估，展现了其优越性。

Insight: 创新点在于首次将统计方法与深度强化学习结合用于色彩恒常性任务，并引入了首个多传感器夜间数据集；其核心思想是利用统计方法提供可靠先验，再通过强化学习动态优化，模拟专家调校过程，这为处理复杂光照条件下的图像增强问题提供了新思路。

Abstract: Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/

[89] QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer cs.CVPDF

Daniele Lizzio Bosco, Shuteng Wang, Giuseppe Serra, Vladislav Golyanik

TL;DR: 本文提出了QNeRF，这是首个用于从2D图像进行新视角合成的混合量子-经典模型。它通过参数化量子电路利用量子叠加和纠缠来编码空间和视角相关信息，从而实现了比经典NeRF更紧凑的模型。论文介绍了两种架构变体：Full QNeRF和Dual-Branch QNeRF，并在中等分辨率图像上进行了实验验证。

Details

Motivation: 动机在于结合量子视觉场（QVFs）在模型紧凑性和收敛速度上的潜力，以及神经辐射场（NeRFs）在新视角合成方面的先进能力，旨在解决经典NeRF模型较大、训练密集的问题，探索量子机器学习在计算机视觉中等任务中的竞争力。

Result: 实验结果表明，在中等分辨率图像上训练时，QNeRF匹配或超越了经典NeRF基线，同时使用了不到一半的参数数量。

Insight: 创新点在于首次将混合量子-经典模型应用于新视角合成任务，通过量子电路编码实现模型紧凑化；其中Dual-Branch QNeRF通过分支化量子态准备引入了任务相关的归纳偏置，降低了操作复杂性并确保了可扩展性和硬件兼容潜力。从客观角度看，这为量子机器学习在连续信号表示（如3D表示学习）中的应用提供了有前景的替代方案。

Abstract: Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that – when trained on images of moderate resolution – QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.

[90] Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video cs.CVPDF

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi

TL;DR: Mesh4D是一个前馈模型，用于从单目视频中重建动态物体的4D网格（3D形状+时间维度运动）。模型通过编码器-解码器架构学习紧凑的潜在空间，一次性编码整个动画序列，并利用潜在扩散模型预测完整动画。该方法在推理时无需骨骼信息，在重建和新视角合成任务上表现优异。

Details

Motivation: 解决从单目视频中重建动态物体的完整3D形状和运动（4D重建）的挑战，现有方法通常需要多视角输入或难以稳定表示复杂变形。

Result: 在重建和新视角合成基准测试中优于先前方法，在恢复准确的3D形状和变形方面达到SOTA水平。

Insight: 创新点包括：1) 通过骨骼结构引导训练（无需推理时使用）学习紧凑的潜在空间，提供合理的变形先验；2) 采用时空注意力编码器稳定表示整体变形；3) 结合潜在扩散模型一次性预测完整动画，实现高效4D重建。

Abstract: We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object’s complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object’s overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.

cs.GR [Back]

[91] GenAI-DrawIO-Creator: A Framework for Automated Diagram Generation cs.GR | cs.CVPDF

Jinze Yu, Dayuan Jiang

TL;DR: 本文提出了GenAI-DrawIO-Creator框架，这是一个利用大型语言模型（Claude 3.7）自动化生成和操作draw.io格式（结构化XML）图表的系统。该系统能够根据自然语言、代码甚至图像输入，实时生成和更新网络架构图、流程图等精确图表。

Details

Motivation: 图表对于传达复杂信息至关重要，但创建和修改图表仍然是一项劳动密集型任务。本文旨在通过AI自动化这一过程，解决图表制作效率低下的问题。

Result: 模拟评估表明，该方法显著减少了图表创建时间，并生成了具有高结构保真度的输出。虽然没有明确提及与特定基准或SOTA的比较，但结果突出了其在处理结构化视觉推理任务上的有效性。

Insight: 主要创新点在于将LLM（特别是Claude 3.7）与draw.io的XML格式深度集成，通过专门的提示工程和错误检查确保生成格式良好的XML。其系统设计支持实时更新，并为AI辅助图表应用的研究奠定了基础。

Abstract: Diagrams are crucial for communicating complex information, yet creating and modifying them remains a labor-intensive task. We present GenAI-DrawIO-Creator, a novel framework that leverages Large Language Models (LLMs) to automate diagram generation and manipulation in the structured XML format used by draw.io. Our system integrates Claude 3.7 to reason about structured visual data and produce valid diagram representations. Key contributions include a high-level system design enabling real-time diagram updates, specialized prompt engineering and error-checking to ensure well-formed XML outputs. We demonstrate a working prototype capable of generating accurate diagrams (such as network architectures and flowcharts) from natural language or code, and even replicating diagrams from images. Simulated evaluations show that our approach significantly reduces diagram creation time and produces outputs with high structural fidelity. Our results highlight the promise of Claude 3.7 in handling structured visual reasoning tasks and lay the groundwork for future research in AI-assisted diagramming applications.

cs.AI [Back]

[92] SAGE-32B: Agentic Reasoning via Iterative Distillation cs.AI | cs.CL | cs.LGPDF

Basab Jha, Firoj Paudel, Ujjwal Puri, Ethan Henkel, Zhang Yuting

TL;DR: SAGE-32B是一个专注于智能体推理和长程规划任务的320亿参数语言模型。它基于Qwen2.5-32B预训练模型，通过两阶段的迭代蒸馏训练方法进行微调，旨在提升在任务分解、工具使用和错误恢复方面的能力。该模型在多个智能体推理基准测试中表现出色，并公开了模型权重。

Details

Motivation: 解决当前通用对话模型在智能体循环（如任务分解、工具使用和错误恢复）中表现不足的问题，旨在开发一个专门为智能体推理和长程规划任务优化的模型。

Result: 在MMLU-Pro、AgentBench和MATH-500等智能体推理基准测试中，SAGE-32B在多工具使用场景下取得了比同类规模基线模型更高的成功率，同时在标准推理评估中保持竞争力。

Insight: 主要创新点包括：1) 采用迭代蒸馏的两阶段训练过程，通过严格测试的反馈循环提升推理性能；2) 引入逆推理方法，使用元认知头在执行前预测规划过程中的潜在失败，增强了模型的鲁棒性和前瞻性。这些方法为构建更可靠的智能体系统提供了新思路。

Abstract: We demonstrate SAGE-32B, a 32 billion parameter language model that focuses on agentic reasoning and long range planning tasks. Unlike chat models that aim for general conversation fluency, SAGE-32B is designed to operate in an agentic loop, emphasizing task decomposition, tool usage, and error recovery. The model is initialized from the Qwen2.5-32B pretrained model and fine tuned using Iterative Distillation, a two stage training process that improves reasoning performance through rigorously tested feedback loops. SAGE-32B also introduces an inverse reasoning approach, which uses a meta cognition head to forecast potential failures in the planning process before execution. On agentic reasoning benchmarks including MMLU-Pro, AgentBench, and MATH-500, SAGE-32B achieves higher success rates in multi tool usage scenarios compared to similarly sized baseline models, while remaining competitive on standard reasoning evaluations. Model weights are publicly released at https://huggingface.co/sagea-ai/sage-reasoning-32b

[93] Learning Latent Action World Models In The Wild cs.AI | cs.CVPDF

Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun

TL;DR: 本文提出了一种从野外视频中学习潜在动作世界模型的方法，旨在克服传统世界模型对标注动作标签的依赖，并扩展了现有工作在简单模拟环境中的局限性。

Details

Motivation: 解决在真实世界视频中学习动作预测模型时，因缺乏大规模动作标注、视频多样性（如环境噪声、缺乏统一实体）带来的挑战。

Result: 实验表明，连续且受约束的潜在动作能有效捕捉野外视频中的复杂动作，优于常见的向量量化方法；通过训练控制器将已知动作映射到潜在动作，在规划任务中达到与基于动作标签的基线模型相当的性能。

Insight: 创新点在于提出了适用于野外视频的潜在动作学习框架，通过设计动作应遵循的属性、相关架构选择和评估方法，实现了无需动作标签的动作空间学习，为将潜在动作模型扩展到真实世界提供了基础。

Abstract: Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.

[94] The Language of Bargaining: Linguistic Effects in LLM Negotiations cs.AI | cs.CL | cs.GTPDF

Stuti Sinha, Himanshu Kumar, Aryan Raju Mandapati, Rakshit Sakhuja, Dhruv Kumar

TL;DR: 本文通过多智能体模拟实验，研究了语言选择对大型语言模型（LLMs）在谈判任务中表现的影响。实验在Ultimatum、Buy-Sell和Resource Exchange等游戏中，对比了英语和四种印度语言（印地语、旁遮普语、古吉拉特语、马尔瓦迪语）下的谈判结果。研究发现，语言选择对谈判结果的影响甚至可能超过更换模型，能够改变提议者优势并重新分配盈余，且这种影响具有任务依赖性。

Details

Motivation: 现有研究主要使用英语评估LLMs的谈判能力，这可能产生不完整甚至误导性的结论。本文旨在系统性地探究语言本身（特别是非英语语言）对LLM谈判行为和结果的影响。

Result: 在保持游戏规则、模型参数和激励不变的情况下，实验发现：1）语言选择能显著改变谈判结果，其影响强度可能超过更换模型；2）在分配性游戏中，使用印度语言会降低稳定性；3）在整合性游戏中，使用印度语言能诱导更丰富的探索行为。

Insight: 论文的创新点在于首次系统性地隔离并量化了语言（而非文化内容）对LLM多智能体谈判行为的影响，揭示了仅用英语评估LLM谈判能力的局限性。其核心洞察是，对LLM的公平部署需要进行文化感知（包括语言）的评估，而非单一的英语中心评估。

Abstract: Negotiation is a core component of social intelligence, requiring agents to balance strategic reasoning, cooperation, and social norms. Recent work shows that LLMs can engage in multi-turn negotiation, yet nearly all evaluations occur exclusively in English. Using controlled multi-agent simulations across Ultimatum, Buy-Sell, and Resource Exchange games, we systematically isolate language effects across English and four Indic framings (Hindi, Punjabi, Gujarati, Marwadi) by holding game rules, model parameters, and incentives constant across all conditions. We find that language choice can shift outcomes more strongly than changing models, reversing proposer advantages and reallocating surplus. Crucially, effects are task-contingent: Indic languages reduce stability in distributive games yet induce richer exploration in integrative settings. Our results demonstrate that evaluating LLM negotiation solely in English yields incomplete and potentially misleading conclusions. These findings caution against English-only evaluation of LLMs and suggest that culturally-aware evaluation is essential for fair deployment.

[95] CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts cs.AI | cs.CL | eess.SYPDF

Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik

TL;DR: CircuitLM是一个多智能体大语言模型辅助的电路设计框架，能够将用户用自然语言描述的电路需求，通过五个顺序阶段（组件识别、引脚检索、专家推理、JSON合成和SVG可视化）转化为结构化的、可机器读取的CircuitJSON原理图。

Details

Motivation: 解决现有大语言模型在根据自然语言描述生成电路原理图时，经常产生细节错误、违反电气约束以及输出非机器可读格式的问题，旨在为非专家用户提供可靠的电路原型设计工具。

Result: 在包含100个多样化嵌入式系统提示的数据集上，使用六种不同LLM进行评估，并引入了DMCV混合评估框架来评估结构和电气有效性，在微控制器为中心的设计中实现了高保真度。

Insight: 创新点在于构建了一个包含五个阶段的多智能体流水线，并锚定在一个经过验证、可动态扩展的组件知识库上，通过DMCV评估框架确保生成结果的安全性和准确性，从而弥合了自然语言输入与可部署硬件设计之间的鸿沟。

Abstract: Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronics design, as large language models (LLMs) frequently hallucinate in granular details, violate electrical constraints, and produce non-machine-readable outputs. We present CircuitLM, a novel multi-agent LLM-aided circuit design pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics through five sequential stages: (i) LLM-based component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning by an electronics expert agent, (iv) JSON schematic synthesis, and (v) force-directed SVG visualization. Anchored by a curated, embedding-powered component knowledge base. While LLMs often violate electrical constraints, CircuitLM bridges this gap by grounding generation in a verified and dynamically extensible component database, initially comprising 50 components. To ensure safety, we incorporate a hybrid evaluation framework, namely Dual-Metric Circuit Validation (DMCV), validated against human-expert assessments, which achieves high fidelity in microcontroller-centric designs. We evaluate the system on 100 diverse embedded-systems prompts across six LLMs and introduce DMCV to assess both structural and electrical validity. This work bridges natural language input to deployable hardware designs, enabling reliable circuit prototyping by non-experts. Our code and data will be made public upon acceptance.

[96] BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents cs.AI | cs.CLPDF

Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo

TL;DR: 本文提出了BackdoorAgent框架，用于统一分析和评估基于大语言模型（LLM）的智能体工作流程中的后门攻击。该框架将攻击面划分为规划、记忆和工具使用三个阶段，并构建了一个涵盖四种典型智能体应用（Agent QA、Agent Code、Agent Web、Agent Drive）的标准化基准。实证分析表明，植入单个阶段的后门触发器可以在多个步骤中持续存在并通过中间状态传播，揭示了智能体工作流程本身对后门威胁的脆弱性。

Details

Motivation: 当前研究对LLM智能体后门攻击的分析较为零散，通常孤立地分析个别攻击向量，缺乏从智能体中心视角理解后门触发器在跨阶段交互和传播中的行为。本文旨在填补这一空白，提供一个统一、模块化且阶段感知的框架来系统化分析智能体工作流程中的后门威胁。

Result: 在构建的标准化基准上进行了实证分析。例如，在使用GPT作为骨干模型时，观察到触发器在43.58%的规划攻击、77.97%的记忆攻击和60.28%的工具使用攻击中持续存在，定量地证明了后门触发器在智能体工作流程中的持久性和传播性。

Insight: 创新点在于提出了一个以智能体为中心的、统一的、模块化的后门攻击分析框架（BackdoorAgent），将攻击面结构化到智能体工作流程的三个功能阶段，并构建了涵盖语言和多模态场景的标准化基准。从客观角度看，该研究将后门攻击分析从孤立模型层面提升到了多步骤、跨阶段的智能体工作流程层面，为理解和评估智能体系统的安全性提供了新的系统性视角和工具。

Abstract: Large language model (LLM) agents execute tasks through multi-step workflows that combine planning, memory, and tool use. While this design enables autonomy, it also expands the attack surface for backdoor threats. Backdoor triggers injected into specific stages of an agent workflow can persist through multiple intermediate states and adversely influence downstream outputs. However, existing studies remain fragmented and typically analyze individual attack vectors in isolation, leaving the cross-stage interaction and propagation of backdoor triggers poorly understood from an agent-centric perspective. To fill this gap, we propose \textbf{BackdoorAgent}, a modular and stage-aware framework that provides a unified, agent-centric view of backdoor threats in LLM agents. BackdoorAgent structures the attack surface into three functional stages of agentic workflows, including \textbf{planning attacks}, \textbf{memory attacks}, and \textbf{tool-use attacks}, and instruments agent execution to enable systematic analysis of trigger activation and propagation across different stages. Building on this framework, we construct a standardized benchmark spanning four representative agent applications: \textbf{Agent QA}, \textbf{Agent Code}, \textbf{Agent Web}, and \textbf{Agent Drive}, covering both language-only and multimodal settings. Our empirical analysis shows that \textit{triggers implanted at a single stage can persist across multiple steps and propagate through intermediate states.} For instance, when using a GPT-based backbone, we observe trigger persistence in 43.58% of planning attacks, 77.97% of memory attacks, and 60.28% of tool-stage attacks, highlighting the vulnerabilities of the agentic workflow itself to backdoor threats. To facilitate reproducibility and future research, our code and benchmark are publicly available at GitHub.

[97] Neurosymbolic Retrievers for Retrieval-augmented Generation cs.AI | cs.CL | cs.IR | cs.LGPDF

Yash Saxena, Manas Gaur

TL;DR: 本文提出了神经符号检索增强生成（Neurosymbolic RAG）框架，通过将知识图谱的符号推理与神经检索技术相结合，旨在解决传统RAG系统中检索、重排序和生成组件因内部推理过程不透明而导致的解释性差、调试困难和信任缺失问题。

Details

Motivation: 传统RAG系统由检索器、重排序器和生成器三个神经组件构成，其内部推理过程不透明，这在高风险领域（如心理健康评估）中严重影响了系统的可解释性、可调试性和可信度。

Result: 在心理健康风险评估任务的初步结果表明，该神经符号方法在提升透明度的同时，也改善了整体性能。

Insight: 论文的创新点在于将符号知识（如知识图谱路径和领域工作流）显式地整合到神经检索过程中，具体提出了三种方法：MAR（利用可解释的符号特征调制查询嵌入）、KG-Path RAG（通过遍历知识图谱增强查询）和Process Knowledge-infused RAG（基于已验证工作流对检索内容重排序），从而为文档选择提供了清晰、可解释的依据，并增强了检索过程的清晰度。

Abstract: Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency. However, traditional RAG systems consist of three interconnected neural components - the retriever, re-ranker, and generator - whose internal reasoning processes remain opaque. This lack of transparency complicates interpretability, hinders debugging efforts, and erodes trust, especially in high-stakes domains where clear decision-making is essential. To address these challenges, we introduce the concept of Neurosymbolic RAG, which integrates symbolic reasoning using a knowledge graph with neural retrieval techniques. This new framework aims to answer two primary questions: (a) Can retrievers provide a clear and interpretable basis for document selection? (b) Can symbolic knowledge enhance the clarity of the retrieval process? We propose three methods to improve this integration. First is MAR (Knowledge Modulation Aligned Retrieval) that employs modulation networks to refine query embeddings using interpretable symbolic features, thereby making document matching more explicit. Second, KG-Path RAG enhances queries by traversing knowledge graphs to improve overall retrieval quality and interpretability. Lastly, Process Knowledge-infused RAG utilizes domain-specific tools to reorder retrieved content based on validated workflows. Preliminary results from mental health risk assessment tasks indicate that this neurosymbolic approach enhances both transparency and overall performance

[98] A Method for Constructing a Digital Transformation Driving Mechanism Based on Semantic Understanding of Large Models cs.AI | cs.CLPDF

Huayi Liu

TL;DR: 本研究提出了一种结合大语言模型（LLM）与知识图谱的方法，用于构建数字化转型驱动机制。该方法首先使用微调的BERT模型进行实体识别和关系抽取，并利用GPT-4生成语义增强的向量表示；其次设计双层图神经网络（GNN）架构，融合LLM输出的语义向量与业务元数据，构建动态可扩展的企业知识图谱；然后引入强化学习优化决策路径生成，通过奖励函数驱动机制迭代。在制造业案例中，该方法显著提升了数字化转型机制的智能化水平和执行效率。

Details

Motivation: 解决企业在数字化转型过程中面临的非结构化数据语义理解不足以及驱动机制缺乏智能决策依据的问题。

Result: 在制造业案例中，该方法将设备故障场景的响应时间从7.8小时降低至3.7小时，F1值达到94.3%，年度数字化转型成本中的决策错误补偿降低了45.3%。

Insight: 创新点在于将大模型（BERT、GPT-4）的语义理解能力与结构化知识图谱（通过GNN构建）相结合，并引入强化学习进行决策优化，形成了一种动态、可迭代的智能驱动机制框架。

Abstract: In the process of digital transformation, enterprises are faced with problems such as insufficient semantic understanding of unstructured data and lack of intelligent decision-making basis in driving mechanisms. This study proposes a method that combines a large language model (LLM) and a knowledge graph. First, a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model is used to perform entity recognition and relationship extraction on multi-source heterogeneous texts, and GPT-4 is used to generate semantically enhanced vector representations; secondly, a two-layer graph neural network (GNN) architecture is designed to fuse the semantic vectors output by LLM with business metadata to construct a dynamic and scalable enterprise knowledge graph; then reinforcement learning is introduced to optimize decision path generation, and the reward function is used to drive the mechanism iteration. In the case of the manufacturing industry, this mechanism reduced the response time for equipment failure scenarios from 7.8 hours to 3.7 hours, the F1 value reached 94.3%, and the compensation for decision errors in the annual digital transformation cost decreased by 45.3%. This method significantly enhances the intelligence level and execution efficiency of the digital transformation driving mechanism by integrating large model semantic understanding with structured knowledge.

[99] TourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning cs.AI | cs.CL | cs.LGPDF

Yinuo Wang, Mining Tan, Wenxiang Jiao, Xiaoxi Li, Hao Wang

TL;DR: 本文提出TourPlanner框架，用于解决旅行规划中的复杂决策问题。该框架通过个性化召回与空间优化（PReSO）构建候选兴趣点集，采用竞争共识思维链（CCoT）进行多路径推理以探索可行解空间，并在强化学习阶段引入基于sigmoid的门控机制，动态优先满足硬约束后再优化软约束。实验表明，该框架在旅行规划基准测试中达到了最先进的性能。

Details

Motivation: 现有旅行规划方法面临三大挑战：在保持高召回率的同时有效剪枝候选兴趣点（POIs）；单一路径推理限制了在可行解空间中的探索能力；同时优化硬约束和软约束存在困难。TourPlanner旨在通过多路径推理和约束门控强化学习来解决这些问题。

Result: 在旅行规划基准测试上的实验结果表明，TourPlanner实现了最先进的性能，在行程的可行性和用户偏好对齐方面显著超越了现有方法。

Insight: 创新点包括：1. 引入PReSO工作流构建空间感知的候选POI集；2. 提出CCoT多路径推理范式，增强对可行解空间的探索能力；3. 在强化学习中集成基于sigmoid的门控机制，实现硬约束优先满足下的动态软约束优化。这些方法为复杂约束下的序列决策问题提供了可借鉴的框架设计思路。

Abstract: Travel planning is a sophisticated decision-making process that requires synthesizing multifaceted information to construct itineraries. However, existing travel planning approaches face several challenges: (1) Pruning candidate points of interest (POIs) while maintaining a high recall rate; (2) A single reasoning path restricts the exploration capability within the feasible solution space for travel planning; (3) Simultaneously optimizing hard constraints and soft constraints remains a significant difficulty. To address these challenges, we propose TourPlanner, a comprehensive framework featuring multi-path reasoning and constraint-gated reinforcement learning. Specifically, we first introduce a Personalized Recall and Spatial Optimization (PReSO) workflow to construct spatially-aware candidate POIs’ set. Subsequently, we propose Competitive consensus Chain-of-Thought (CCoT), a multi-path reasoning paradigm that improves the ability of exploring the feasible solution space. To further refine the plan, we integrate a sigmoid-based gating mechanism into the reinforcement learning stage, which dynamically prioritizes soft-constraint satisfaction only after hard constraints are met. Experimental results on travel planning benchmarks demonstrate that TourPlanner achieves state-of-the-art performance, significantly surpassing existing methods in both feasibility and user-preference alignment.

[100] Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning cs.AI | cs.CLPDF

Yuyang Hu, Jiongnan Liu, Jiejun Tan, Yutao Zhu, Zhicheng Dou

TL;DR: 本文提出了一种名为CompassMem的事件中心化记忆框架，旨在解决智能代理在长视野场景中记忆组织与检索的局限性。该框架受事件分割理论启发，将经验增量式地分割为事件，并通过显式逻辑关系构建事件图，作为逻辑地图支持结构化、目标导向的记忆导航与推理。

Details

Motivation: 现有基于大语言模型的智能代理在长视野任务中，其记忆机制通常以扁平方式组织，依赖简单的基于相似性的检索技术，难以显式捕捉经验间的逻辑关系，导致记忆访问与推理脱节。

Result: 在LoCoMo和NarrativeQA基准测试上的实验表明，CompassMem在多种骨干模型上均能持续提升检索与推理性能。

Insight: 创新点在于将事件分割理论与图结构相结合，构建显式逻辑关系的事件图作为记忆的逻辑地图，实现了超越浅层语义检索的结构化、目标导向记忆导航，从而支持长视野依赖的逻辑推理。

Abstract: Large language models (LLMs) are increasingly deployed as intelligent agents that reason, plan, and interact with their environments. To effectively scale to long-horizon scenarios, a key capability for such agents is a memory mechanism that can retain, organize, and retrieve past experiences to support downstream decision-making. However, most existing approaches organize and store memories in a flat manner and rely on simple similarity-based retrieval techniques. Even when structured memory is introduced, existing methods often struggle to explicitly capture the logical relationships among experiences or memory units. Moreover, memory access is largely detached from the constructed structure and still depends on shallow semantic retrieval, preventing agents from reasoning logically over long-horizon dependencies. In this work, we propose CompassMem, an event-centric memory framework inspired by Event Segmentation Theory. CompassMem organizes memory as an Event Graph by incrementally segmenting experiences into events and linking them through explicit logical relations. This graph serves as a logic map, enabling agents to perform structured and goal-directed navigation over memory beyond superficial retrieval, progressively gathering valuable memories to support long-horizon reasoning. Experiments on LoCoMo and NarrativeQA demonstrate that CompassMem consistently improves both retrieval and reasoning performance across multiple backbone models.

[101] Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models cs.AI | cs.CLPDF

Shuyang Jiang, Yuhao Wang, Ya Zhang, Yanfeng Wang, Yu Wang

TL;DR: 本文提出了一种名为Miner的新方法，用于解决大型推理模型在无批评者强化学习中的数据效率低下问题。该方法通过挖掘策略的内在不确定性作为自监督奖励信号，无需外部监督、辅助模型或额外推理成本。Miner引入了两个关键创新：基于令牌的焦点信用分配机制和自适应优势校准。在Qwen3-4B和Qwen3-8B基础模型上的六个推理基准测试中，Miner实现了最先进的性能。

Details

Motivation: 当前针对大型推理模型的无批评者强化学习方法在训练正同质提示（所有轨迹都正确）时效率极低，导致优势估计为零而浪费轨迹。本文旨在解决这一数据效率问题。

Result: 在Qwen3-4B和Qwen3-8B基础模型上的六个推理基准测试中，Miner相比GRPO在Pass@1上取得了最高4.58的绝对提升，在Pass@K上取得了最高6.66的提升，达到了最先进的性能水平。

Insight: 论文的创新点在于：1）提出令牌级焦点信用分配机制，动态放大关键不确定令牌的梯度并抑制过度自信的令牌；2）自适应优势校准，无缝整合内在奖励和可验证奖励。核心洞察是，利用潜在的不确定性对于推理模型的高效、可扩展强化学习训练既是必要的也是充分的。

Abstract: Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \uline{M}ine \uline{in}trinsic mast\uline{er}y (Miner), that repurposes the policy’s intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among the other four algorithms, yielding up to \textbf{4.58} absolute gains in Pass@1 and \textbf{6.66} gains in Pass@K compared to GRPO. Comparison with other methods targeted at exploration enhancement further discloses the superiority of the two newly proposed innovations. This demonstrates that latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models.

[102] AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search cs.AI | cs.CLPDF

Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou

TL;DR: 本文提出AT²PO框架，一种基于树搜索的智能体回合制策略优化方法，用于解决多轮任务中智能体强化学习面临的探索多样性不足、稀疏奖励分配和策略优化不对齐三大核心挑战。

Details

Motivation: 动机在于现有智能体强化学习方法在多轮任务中存在探索有限、稀疏奖励难以分配以及策略更新粒度与决策过程不匹配的问题，需要一种统一框架来优化这些方面。

Result: 在七个基准测试中，AT²PO相比最先进的基线方法平均提升了1.84个百分点，消融实验验证了各模块的有效性。

Insight: 创新点包括引入回合级树结构实现熵引导的树扩展和回合级信用分配，以及提出与智能体交互决策粒度对齐的回合制策略优化目标，该目标可独立于树搜索集成到多轮强化学习流程中。

Abstract: LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT$^2$PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT$^2$PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at https://github.com/zzfoutofspace/ATPO.

[103] DR-LoRA: Dynamic Rank LoRA for Mixture-of-Experts Adaptation cs.AI | cs.CLPDF

Guanzhi Deng, Bo Li, Ronghao Chen, Huacan Wang, Linqi Song

TL;DR: 本文提出了一种名为DR-LoRA的动态秩LoRA框架，用于高效微调混合专家（MoE）大语言模型。该方法通过专家显著性评分机制，根据任务需求动态调整各专家的LoRA秩，形成异构的秩分布，从而在相同参数预算下实现更优的性能和参数利用率。

Details

Motivation: 现有方法为MoE模型中的所有专家分配相同的LoRA秩，忽视了专家间的功能特化，导致资源错配：任务相关专家参数不足，而不相关专家却获得冗余参数。

Result: 在多个基准测试上的实验表明，DR-LoRA在相同参数预算下，持续优于标准LoRA和静态分配策略，实现了更优的任务性能。

Insight: 创新点在于引入了动态、异构的秩分配机制，通过结合专家路由频率和LoRA秩重要性来量化专家需求，实现了参数根据任务需求的自适应分配，提升了微调效率。

Abstract: Mixture-of-Experts (MoE) has become a prominent paradigm for scaling Large Language Models (LLMs). Parameter-efficient fine-tuning (PEFT), such as LoRA, is widely adopted to adapt pretrained MoE LLMs to downstream tasks. However, existing approaches assign identical LoRA ranks to all experts, overlooking the intrinsic functional specialization within MoE LLMs. This uniform allocation leads to resource mismatch, task-relevant experts are under-provisioned while less relevant ones receive redundant parameters. We propose a Dynamic Rank LoRA framework named DR-LoRA, which dynamically grows expert LoRA ranks during fine-tuning based on task-specific demands. DR-LoRA employs an Expert Saliency Scoring mechanism that integrates expert routing frequency and LoRA rank importance to quantify each expert’s demand for additional capacity. Experts with higher saliency scores are prioritized for rank expansion, enabling the automatic formation of a heterogeneous rank distribution tailored to the target task. Experiments on multiple benchmarks demonstrate that DR-LoRA consistently outperforms standard LoRA and static allocation strategies under the same parameter budget, achieving superior task performance with more efficient parameter utilization.

[104] Higher-Order Knowledge Representations for Agentic Scientific Reasoning cs.AI | cond-mat.mtrl-sci | cs.CL | cs.LGPDF

Isabella A. Stewart, Markus J. Buehler

TL;DR: 本文提出了一种基于超图的高阶知识表示方法，用于支持自主科学推理系统。该方法通过构建超图来编码多实体关系，避免了传统知识图谱中成对约束的局限性，并应用于约1100篇关于生物复合支架的文献，构建了一个包含161,172个节点和320,201条超边的全局超图，其展现出无标度拓扑结构。通过为自主系统配备超图遍历工具，系统能够桥接语义上遥远的概念，并生成针对新型复合材料的机制假设。

Details

Motivation: 解决科学探究中需要整合异构实验数据、跨领域知识和机制证据的问题，传统大型语言模型依赖检索增强但缺乏结构深度，而传统知识图谱的成对约束无法捕捉控制涌现物理行为的不可约高阶交互。

Result: 在约1100篇生物复合支架文献的语料库上，构建了包含161,172个节点和320,201条超边的全局超图，显示出无标度拓扑（幂律指数约1.23）。通过超图遍历工具，系统成功生成了针对新型复合材料（如通过壳聚糖中间体将氧化铈与PCL支架连接）的机制假设。

Insight: 创新点在于使用超图作为高阶知识表示来编码多实体关系，避免成对扩展的组合爆炸，并保留科学表述的共现上下文；通过超图拓扑作为可验证的护栏，实现“无教师”的自主推理系统，加速科学发现。

Abstract: Scientific inquiry requires systems-level reasoning that integrates heterogeneous experimental data, cross-domain knowledge, and mechanistic evidence into coherent explanations. While Large Language Models (LLMs) offer inferential capabilities, they often depend on retrieval-augmented contexts that lack structural depth. Traditional Knowledge Graphs (KGs) attempt to bridge this gap, yet their pairwise constraints fail to capture the irreducible higher-order interactions that govern emergent physical behavior. To address this, we introduce a methodology for constructing hypergraph-based knowledge representations that faithfully encode multi-entity relationships. Applied to a corpus of ~1,100 manuscripts on biocomposite scaffolds, our framework constructs a global hypergraph of 161,172 nodes and 320,201 hyperedges, revealing a scale-free topology (power law exponent ~1.23) organized around highly connected conceptual hubs. This representation prevents the combinatorial explosion typical of pairwise expansions and explicitly preserves the co-occurrence context of scientific formulations. We further demonstrate that equipping agentic systems with hypergraph traversal tools, specifically using node-intersection constraints, enables them to bridge semantically distant concepts. By exploiting these higher-order pathways, the system successfully generates grounded mechanistic hypotheses for novel composite materials, such as linking cerium oxide to PCL scaffolds via chitosan intermediates. This work establishes a “teacherless” agentic reasoning system where hypergraph topology acts as a verifiable guardrail, accelerating scientific discovery by uncovering relationships obscured by traditional graph methods.

[105] ConMax: Confidence-Maximizing Compression for Efficient Chain-of-Thought Reasoning cs.AI | cs.CLPDF

Minda Hu, Zexuan Qiu, Zenan Xu, Kun Li, Bo Zhou

TL;DR: 本文提出了一种名为ConMax的新型强化学习框架，旨在自动压缩大型推理模型（LRMs）的思维链（CoT）推理轨迹，以解决推理过程中的“过度思考”问题，即在保证准确性的同时减少冗余计算成本。

Details

Motivation: 动机在于，尽管广泛的思维链生成对于实现复杂认知行为（如自我验证和回溯）至关重要，但它常导致“过度思考”，产生冗余推理路径，从而增加计算成本而不提升准确性。现有的压缩技术应用于推理轨迹时，往往会损害逻辑连贯性或产生过高的采样成本。

Result: 在五个推理数据集上的广泛实验表明，ConMax在效率与性能之间取得了优越的权衡。具体而言，它在强基线模型的基础上将推理长度减少了43%，而准确率仅下降0.7%，证明了其为LRMs生成高质量、高效训练数据的有效性。

Insight: 创新点在于将压缩问题形式化为一个奖励驱动的优化问题，通过训练一个策略来剪枝冗余，该策略通过最大化一个加权组合（包括用于预测保真度的答案置信度和用于推理有效性的思维置信度）来实现，并利用一个冻结的辅助LRM进行评估。这提供了一种在保持逻辑连贯性的同时自动压缩推理轨迹的新方法。

Abstract: Recent breakthroughs in Large Reasoning Models (LRMs) have demonstrated that extensive Chain-of-Thought (CoT) generation is critical for enabling intricate cognitive behaviors, such as self-verification and backtracking, to solve complex tasks. However, this capability often leads to ``overthinking’’, where models generate redundant reasoning paths that inflate computational costs without improving accuracy. While Supervised Fine-Tuning (SFT) on reasoning traces is a standard paradigm for the ‘cold start’ phase, applying existing compression techniques to these traces often compromises logical coherence or incurs prohibitive sampling costs. In this paper, we introduce ConMax (Confidence-Maximizing Compression), a novel reinforcement learning framework designed to automatically compress reasoning traces while preserving essential reasoning patterns. ConMax formulates compression as a reward-driven optimization problem, training a policy to prune redundancy by maximizing a weighted combination of answer confidence for predictive fidelity and thinking confidence for reasoning validity through a frozen auxiliary LRM. Extensive experiments across five reasoning datasets demonstrate that ConMax achieves a superior efficiency-performance trade-off. Specifically, it reduces inference length by 43% over strong baselines at the cost of a mere 0.7% dip in accuracy, proving its effectiveness in generating high-quality, efficient training data for LRMs.

[106] Reinforced Efficient Reasoning via Semantically Diverse Exploration cs.AI | cs.CLPDF

Ziqi Zhao, Zhaochun Ren, Jiahong Zou, Liu Yang, Zhiwei Xu

TL;DR: 本文提出了一种名为ROSE的强化学习方法，旨在通过语义多样化的探索来增强大型语言模型的推理能力。该方法通过结合基于语义熵的分支策略和ε探索机制来促进多样化的推理探索，并设计了长度感知的段级优势估计器以提高推理效率。

Details

Motivation: 现有基于蒙特卡洛树搜索的强化学习方法在增强语言模型推理时，仍面临探索多样性有限和推理效率低下的问题。

Result: 在多个数学推理基准测试上使用Qwen和Llama模型进行的广泛实验验证了ROSE方法的有效性和效率。

Insight: 创新点包括引入语义熵来量化推理路径的语义不确定性以指导分支选择，以及设计长度感知的奖励机制来鼓励简洁正确的推理链，从而在探索多样性和推理效率之间取得更好平衡。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs). Monte Carlo Tree Search (MCTS)-based extensions improve upon vanilla RLVR (e.g., GRPO) by providing tree-based reasoning rollouts that enable fine-grained and segment-level credit assignment. However, existing methods still suffer from limited exploration diversity and inefficient reasoning. To address the above challenges, we propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs. To encourage more diverse reasoning exploration, our method incorporates a semantic-entropy-based branching strategy and an $\varepsilon$-exploration mechanism. The former operates on already sampled reasoning rollouts to capture semantic uncertainty and select branching points with high semantic divergence to generate new successive reasoning paths, whereas the latter stochastically initiates reasoning rollouts from the root, preventing the search process from becoming overly local. To improve efficiency, we design a length-aware segment-level advantage estimator that rewards concise and correct reasoning while penalizing unnecessarily long reasoning chains. Extensive experiments on various mathematical reasoning benchmarks with Qwen and Llama models validate the effectiveness and efficiency of ROSE. Codes are available at https://github.com/ZiqiZhao1/ROSE-rl.

[107] Token-Level LLM Collaboration via FusionRoute cs.AI | cs.CL | cs.LGPDF

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang

TL;DR: 本文提出FusionRoute，一种轻量级token级多LLM协作框架，通过路由器在解码时动态选择专家模型并生成互补logit来优化下一词分布，解决了通用大模型成本高昂与专用小模型泛化不足的困境。

Details

Motivation: 解决单一通用大模型训练部署成本过高与专用小模型泛化能力有限之间的矛盾，实现高效且性能强大的多模型协作。

Result: 在Llama-3和Gemma-2模型系列上，于数学推理、代码生成和指令遵循等多样基准测试中，性能优于序列级和token级协作、模型融合及直接微调方法，并在各自任务上与领域专家模型保持竞争力。

Insight: 创新点在于理论证明了纯专家路由的局限性，并通过可训练的互补生成器扩展策略类别，在温和条件下恢复最优值函数，实现了token级动态路由与logit修正的协同优化。

Abstract: Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert’s next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

cs.CR [Back]

[108] Decentralized Privacy-Preserving Federal Learning of Computer Vision Models on Edge Devices cs.CR | cs.CVPDF

Damian Harenčák, Lukáš Gajdošech, Martin Madaras

TL;DR: 本文研究了在边缘设备上实现去中心化隐私保护联邦学习的方法，分析了多种隐私增强技术（如同态加密、梯度压缩、梯度噪声等）对卷积神经网络和分割网络的影响，并在NVIDIA Jetson TX2模块上进行了概念验证模拟。

Details

Motivation: 联邦学习虽通过共享模型参数而非原始数据来保护隐私，但研究表明参数信息仍可能泄露私人数据，且现有方法主要关注服务器端隐私风险，忽略了其他客户端的恶意行为，因此需要针对服务器和客户端双方提升隐私保护。

Result: 分析了梯度压缩和梯度噪声对分类任务中卷积神经网络准确性的负面影响，并展示了在分割网络中数据重建的困难性；在NVIDIA Jetson TX2边缘设备模块上实现了联邦学习过程的概念验证模拟。

Insight: 创新点在于综合评估多种隐私保护方法在联邦学习中对神经网络的影响，并强调去中心化环境中客户端间的隐私风险；客观来看，该研究为边缘设备上的隐私保护联邦学习提供了实用的技术分析和实验验证，有助于推动安全协作训练的实际应用。

Abstract: Collaborative training of a machine learning model comes with a risk of sharing sensitive or private data. Federated learning offers a way of collectively training a single global model without the need to share client data, by sharing only the updated parameters from each client’s local model. A central server is then used to aggregate parameters from all clients and redistribute the aggregated model back to the clients. Recent findings have shown that even in this scenario, private data can be reconstructed only using information about model parameters. Current efforts to mitigate this are mainly focused on reducing privacy risks on the server side, assuming that other clients will not act maliciously. In this work, we analyzed various methods for improving the privacy of client data concerning both the server and other clients for neural networks. Some of these methods include homomorphic encryption, gradient compression, gradient noising, and discussion on possible usage of modified federated learning systems such as split learning, swarm learning or fully encrypted models. We have analyzed the negative effects of gradient compression and gradient noising on the accuracy of convolutional neural networks used for classification. We have shown the difficulty of data reconstruction in the case of segmentation networks. We have also implemented a proof of concept on the NVIDIA Jetson TX2 module used in edge devices and simulated a federated learning process.

cs.LG [Back]

[109] ArtCognition: A Multimodal AI Framework for Affective State Sensing from Visual and Kinematic Drawing Cues cs.LG | cs.CV | cs.HC | cs.IRPDF

Behrad Binaei-Haghighi, Nafiseh Sadat Sajadi, Mehrad Liviyan, Reyhane Akhavan Kharazi, Fatemeh Amirkhani

TL;DR: 本文提出了一种名为ArtCognition的多模态AI框架，用于通过数字绘画中的视觉和运动学线索来评估人的情感和心理状态。该框架融合了最终画作的静态视觉特征和绘画过程中的动态行为线索，并采用检索增强生成（RAG）架构，将分析与心理学知识相结合，以提高可解释性。

Details

Motivation: 通过非语言渠道客观评估人的情感和心理状态是一个重大挑战，本文旨在探索数字绘画这一未被充分挖掘的模态，用于情感感知，并自动化分析广泛使用的心理学工具——房树人（HTP）测试。

Result: 结果表明，融合视觉和行为运动学线索比单独使用任一模态能提供更细致的评估。提取的多模态特征与标准化心理指标之间存在显著相关性，验证了该框架作为可扩展工具支持临床医生的潜力。

Insight: 创新点在于将数字绘画的静态视觉特征与动态绘画过程的行为运动学线索进行独特融合，并引入RAG架构将低层特征与高层心理学解释联系起来，增强了模型的可解释性并减少了幻觉风险，为非侵入式情感状态评估和辅助心理健康技术提供了新方法。

Abstract: The objective assessment of human affective and psychological states presents a significant challenge, particularly through non-verbal channels. This paper introduces digital drawing as a rich and underexplored modality for affective sensing. We present a novel multimodal framework, named ArtCognition, for the automated analysis of the House-Tree-Person (HTP) test, a widely used psychological instrument. ArtCognition uniquely fuses two distinct data streams: static visual features from the final artwork, captured by computer vision models, and dynamic behavioral kinematic cues derived from the drawing process itself, such as stroke speed, pauses, and smoothness. To bridge the gap between low-level features and high-level psychological interpretation, we employ a Retrieval-Augmented Generation (RAG) architecture. This grounds the analysis in established psychological knowledge, enhancing explainability and reducing the potential for model hallucination. Our results demonstrate that the fusion of visual and behavioral kinematic cues provides a more nuanced assessment than either modality alone. We show significant correlations between the extracted multimodal features and standardized psychological metrics, validating the framework’s potential as a scalable tool to support clinicians. This work contributes a new methodology for non-intrusive affective state assessment and opens new avenues for technology-assisted mental healthcare.

[110] IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation cs.LG | cs.CVPDF

Yinghao Tang, Xueding Liu, Boyuan Zhang, Tingfeng Lan, Yupeng Xie

TL;DR: 本文提出了首个用于评估文本到信息图生成可靠性的基准测试IGenBench，包含600个涵盖30种信息图类型的测试用例。作者设计了一个自动化评估框架，将可靠性验证分解为基于10种问题类型的原子是/否问题，并利用多模态大语言模型进行验证。对10个最先进的文本到图像模型进行了全面评估，揭示了模型在生成信息图时存在的关键可靠性问题。

Details

Motivation: 当前文本到图像模型虽然能生成美观的图像，但其在生成信息图（结合数据可视化、文本和插图的复合视觉制品）时的可靠性尚不明确，生成的图表可能看似正确但包含容易被忽视的错误（如扭曲的数据编码或不正确的文本内容）。

Result: 在IGenBench上对10个SOTA T2I模型进行了评估，结果显示模型性能存在三层等级，表现最佳的模型在问题级准确率（Q-ACC）上达到0.90，但在信息图级准确率（I-ACC）上仅为0.49；数据相关维度（如数据完整性，准确率仅0.21）成为普遍瓶颈；所有模型都难以实现端到端的完全正确。

Insight: 论文的主要创新点在于构建了首个专门针对文本到信息图生成可靠性的基准测试和自动化评估框架，将复杂的可靠性评估分解为可自动验证的原子问题。其客观分析揭示了当前模型在生成复合、数据密集型视觉内容时的核心局限性，特别是数据准确性和多元素协调一致性方面的挑战，为未来模型开发指明了方向。

Abstract: Infographics are composite visual artifacts that combine data visualizations with textual and illustrative elements to communicate information. While recent text-to-image (T2I) models can generate aesthetically appealing images, their reliability in generating infographics remains unclear. Generated infographics may appear correct at first glance but contain easily overlooked issues, such as distorted data encoding or incorrect textual content. We present IGENBENCH, the first benchmark for evaluating the reliability of text-to-infographic generation, comprising 600 curated test cases spanning 30 infographic types. We design an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on a taxonomy of 10 question types. We employ multimodal large language models (MLLMs) to verify each question, yielding question-level accuracy (Q-ACC) and infographic-level accuracy (I-ACC). We comprehensively evaluate 10 state-of-the-art T2I models on IGENBENCH. Our systematic analysis reveals key insights for future model development: (i) a three-tier performance hierarchy with the top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions emerging as universal bottlenecks (e.g., Data Completeness: 0.21); and (iii) the challenge of achieving end-to-end correctness across all models. We release IGENBENCH at https://igen-bench.vercel.app/.

[111] A Vision for Multisensory Intelligence: Sensing, Synergy, and Science cs.LG | cs.AI | cs.CL | cs.CVPDF

Paul Pu Liang

TL;DR: 本文提出了未来十年多感官人工智能的研究愿景，旨在将AI从数字模态（如文本、视觉、音频）扩展到涵盖语言、视觉、听觉、触觉、味觉和嗅觉的完整多感官体验，以改变人类与AI的交互方式。

Details

Motivation: 当前人工智能主要局限于文本、视觉和音频等数字模态，而人类对世界的体验是多感官融合的，因此需要推动AI向更丰富的多感官感知与交互发展。

Result: 作为一篇愿景论文，未提及具体定量结果或基准测试，但介绍了MIT媒体实验室多感官智能小组的一系列项目、资源和最新进展演示。

Insight: 创新点在于提出了通过传感、科学和协同三个相互关联的主题来推进多感官AI领域：扩展AI的传感能力以超越数字媒介；建立量化多模态异质性与交互、统一建模架构与表示以及理解跨模态迁移的原理性科学；学习模态之间及人机之间的协同，涵盖多感官整合、对齐、推理、生成、泛化和体验。

Abstract: Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross-modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see https://mit-mi.github.io/.

[112] The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs cs.LG | cs.AI | cs.CLPDF

Jiale Zhao, Xing Mou, Jinlin Wu, Hongyuan Yu, Mingrui Sun

TL;DR: 本文针对医学多模态大语言模型（Medical MLLMs）在安全对齐方面的脆弱性，首先建立了一个多维安全评估框架，揭示了现有模型在通用和医学特定安全维度上的普遍漏洞，尤其是对跨模态越狱攻击的敏感性。研究发现医学微调过程会导致模型原有安全对齐的灾难性遗忘。为此，论文提出了一种新颖的‘参数空间干预’方法，通过从原始基础模型中提取内在安全知识表示，并在构建医学能力时将其注入目标模型，同时设计细粒度参数搜索算法以优化安全与医学性能的权衡。实验表明，该方法能在不依赖额外领域安全数据的情况下，显著增强Medical MLLMs的安全护栏，同时最小化对核心医学性能的影响。

Details

Motivation: 医学多模态大语言模型在专业任务上取得显著进展，但其安全性研究滞后，存在实际部署风险，且医学微调过程常导致模型原有安全对齐的灾难性遗忘，需要解决这一挑战以实现安全可靠的模型部署。

Result: 实验结果表明，所提出的参数空间干预方法显著增强了Medical MLLMs的安全护栏，同时最小化了对核心医学性能的退化，且不依赖额外的领域特定安全数据。

Insight: 创新点在于提出了一个多维安全评估框架来系统评测Medical MLLMs的安全性，并引入了一种新颖的‘参数空间干预’方法，通过提取和注入基础模型的内在安全知识表示，结合细粒度参数搜索算法，实现了安全与医学性能的有效权衡，为模型安全对齐提供了高效的数据无关解决方案。

Abstract: Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, particularly highlighting their fragility against cross-modality jailbreak attacks. Furthermore, we find that the medical fine-tuning process frequently induces catastrophic forgetting of the model’s original safety alignment. To address this challenge, we propose a novel “Parameter-Space Intervention” approach for efficient safety re-alignment. This method extracts intrinsic safety knowledge representations from original base models and concurrently injects them into the target model during the construction of medical capabilities. Additionally, we design a fine-grained parameter search algorithm to achieve an optimal trade-off between safety and medical performance. Experimental results demonstrate that our approach significantly bolsters the safety guardrails of Medical MLLMs without relying on additional domain-specific safety data, while minimizing degradation to core medical performance.

[113] Mitigating Position-Shift Failures in Text-Based Modular Arithmetic via Position Curriculum and Template Diversity cs.LG | cs.CLPDF

Nikolay Yudin

TL;DR: 本文研究字符级Transformer在文本形式模加法任务上的鲁棒性，特别关注输入格式变化（如字符位置偏移和自然语言模板变化）导致的性能下降。作者提出了一种结合显式表达式边界标记、位置课程、多样化模板混合和一致性训练的训练方法，显著提升了模型对位置偏移和模板分布外泛化的鲁棒性，同时保持高分布内准确率。

Details

Motivation: 基于grokking文献的见解，研究模型在输入格式变化下的鲁棒性，而非仅关注分布内准确率。发现现有模型在字符位置偏移或使用分布外自然语言模板时会出现灾难性失败，旨在解决这种先前被忽视的失败模式。

Result: 在p=97的模加法任务上，基线模型在分布内表现良好，但在位置偏移和模板分布外测试中崩溃。提出的训练方法在三个随机种子下显著提升了鲁棒性，同时保持高分布内准确率，而ALiBi风格的消融实验在该设置下无法学习任务。

Insight: 创新点包括：结合边界标记、位置课程、模板多样性和一致性训练的训练方案；强调在噪声监督下引导程序泛化需要显式训练数据分布中缺失的不变性；提供了可复现的评估协议和实验材料。从客观角度看，该方法通过结构化课程和增强输入多样性来强制模型学习不变性，是一种简单有效的鲁棒性提升策略。

Abstract: Building on insights from the grokking literature, we study character-level Transformers trained to compute modular addition from text, and focus on robustness under input-format variation rather than only in-distribution accuracy. We identify a previously under-emphasized failure mode: models that achieve high in-distribution accuracy can fail catastrophically when the same expression is shifted to different absolute character positions (“position shift”) or presented under out-of-distribution natural-language templates. Using a disjoint-pair split over all ordered pairs for p=97, we show that a baseline model reaches strong in-distribution performance yet collapses under position shift and template OOD. We then introduce a simple training recipe that combines (i) explicit expression boundary markers, (ii) position curriculum that broadens the range of absolute positions seen during training, (iii) diverse template mixtures, and (iv) consistency training across multiple variants per example. Across three seeds, this intervention substantially improves robustness to position shift and template OOD while maintaining high in-distribution accuracy, whereas an ALiBi-style ablation fails to learn the task under our setup. Our results suggest that steering procedural generalization under noisy supervision benefits from explicitly training invariances that are otherwise absent from the data distribution, and we provide a reproducible evaluation protocol and artifacts.

[114] Rate or Fate? RLV$^\varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards cs.LG | cs.AI | cs.CLPDF

Ali Rad, Khashayar Filom, Darioush Keivan, Peyman Mohajerin Esfahani, Ehsan Kamalinejad

TL;DR: 本文提出RLV$^\varepsilon$R框架，通过多臂老虎机模型分析带噪声验证的强化学习动态，揭示了验证噪声对学习过程的影响取决于Youden指数J=TPR-FPR的正负性，并展示了噪声主要影响收敛速度而非最终结果的特性。

Details

Motivation: 针对实际RLVR中验证器存在噪声（如不完善的单元测试、人工标注误差、LLM评判偏差）的问题，探究噪声是仅减缓学习速度还是会导致学习失败。

Result: 在可控实验和可验证编程任务中验证了理论预测：当J>0时错误模式被抑制（学习），J=0时过程中性，J<0时错误模式主导（反学习）；噪声主要影响收敛时间而非最终结果。

Insight: 创新点在于将RLVR动态建模为基于推理模式分组的复制器方程，揭示Youden指数J作为学习成败的相变阈值，为分析RLVR稳定性、收敛性和算法干预提供了通用框架。

Abstract: Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs: sample a completion, verify it, and update. In practice, however, the verifier is almost never clean–unit tests probe only limited corner cases; human and synthetic labels are imperfect; and LLM judges (e.g., RLAIF) are noisy and can be exploited–and this problem worsens on harder domains (especially coding) where tests are sparse and increasingly model-generated. We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)? To address this, we develop an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO and validated in controlled experiments. Modeling false positives and false negatives and grouping completions into recurring reasoning modes yields a replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode competition and a one-dimensional evolution for the mass on incorrect modes, whose drift is determined solely by Youden’s index J=TPR-FPR. This yields a sharp phase transition: when J>0, the incorrect mass is driven toward extinction (learning); when J=0, the process is neutral; and when J<0, incorrect modes amplify until they dominate (anti-learning and collapse). In the learning regime J>0, noise primarily rescales convergence time (“rate, not fate”). Experiments on verifiable programming tasks under synthetic noise reproduce the predicted J=0 boundary. Beyond noise, the framework offers a general lens for analyzing RLVR stability, convergence, and algorithmic interventions.

[115] Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training cs.LG | cs.CLPDF

Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen

TL;DR: 该论文发现大语言模型在基于可验证奖励的强化学习训练过程中表现出强烈的线性演化特征，模型权重和输出对数概率均与训练步数呈强线性相关。基于这一观察，作者提出通过权重外推和对数概率外推方法预测未来模型状态，从而显著减少计算开销。

Details

Motivation: 解决RLVR训练中因漫长探索阶段导致计算成本高昂的问题，探索利用训练过程中的线性规律来加速优化。

Result: 在四个基准测试上，权重外推方法取得了与标准RL训练相当的性能，而对数概率外推方法在所有基准上均超越了持续RL训练的表现。

Insight: 揭示了RLVR训练中模型演化的线性本质，创新性地提出通过外推中间检查点来预测未来模型状态，为高效后训练提供了新思路。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training. Unlike supervised fine-tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log-probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on all four benchmarks by extrapolating beyond the step range where RL training remains stable.

[116] On the Hidden Objective Biases of Group-based Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Aleksandar Fontana, Marco Simoni, Giulio Rossolini, Andrea Saracino, Paolo Mori

TL;DR: 本文对基于群体的强化学习方法（如GRPO）进行了理论分析，揭示了其在奖励优化与训练目标之间的结构不匹配问题。通过统一的代理公式，研究发现这些方法存在非均匀群体权重导致梯度偏差、AdamW优化器使训练动态对奖励缩放不敏感、以及动量可能使策略更新超出预期裁剪区域等系统性偏差。

Details

Motivation: 尽管基于群体的强化学习方法（如GRPO）在大语言模型后训练中广泛使用且经验上成功，但其奖励优化与底层训练目标之间存在结构不匹配，本文旨在从理论上分析这些方法的隐藏目标偏差。

Result: 研究通过理论分析揭示了GRPO风格方法的系统性偏差，包括梯度偏差、对奖励缩放的不敏感性以及动量导致的更新越界，这些发现为未来方法设计提供了原则性指导。

Insight: 创新点在于将GRPO风格方法统一到代理公式框架下进行分析，揭示了非均匀群体权重、AdamW优化器交互和动量效应导致的根本性偏差，为改进基于群体的强化学习算法提供了理论依据。

Abstract: Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of current approaches and provide principled guidance for the design of future formulations.

cs.RO [Back]

[117] UNIC: Learning Unified Multimodal Extrinsic Contact Estimation cs.RO | cs.AI | cs.CVPDF

Zhengtong Xu, Yuki Shirai

TL;DR: 本文提出UNIC，一种无需先验知识或相机标定的统一多模态外接触估计框架，通过视觉、本体感知和触觉模态的融合，实现对新物体和非结构化环境的泛化能力。

Details

Motivation: 现有接触估计方法依赖预定义接触类型、固定抓取配置或相机标定等限制性假设，难以泛化到新物体和非结构化环境，本文旨在解决这一问题。

Result: 在未见过的接触位置实现平均Chamfer距离误差9.6毫米，对未见物体表现良好，在模态缺失和动态相机视角下保持鲁棒性，验证了其作为接触丰富操作的实用能力。

Insight: 创新点包括基于场景可供性图的统一接触表示、结合随机掩码的多模态融合机制，实现了无需先验的端到端数据驱动学习，为多模态感知在机器人操作中的应用提供了新思路。

Abstract: Contact-rich manipulation requires reliable estimation of extrinsic contacts-the interactions between a grasped object and its environment which provide essential contextual information for planning, control, and policy learning. However, existing approaches often rely on restrictive assumptions, such as predefined contact types, fixed grasp configurations, or camera calibration, that hinder generalization to novel objects and deployment in unstructured environments. In this paper, we present UNIC, a unified multimodal framework for extrinsic contact estimation that operates without any prior knowledge or camera calibration. UNIC directly encodes visual observations in the camera frame and integrates them with proprioceptive and tactile modalities in a fully data-driven manner. It introduces a unified contact representation based on scene affordance maps that captures diverse contact formations and employs a multimodal fusion mechanism with random masking, enabling robust multimodal representation learning. Extensive experiments demonstrate that UNIC performs reliably. It achieves a 9.6 mm average Chamfer distance error on unseen contact locations, performs well on unseen objects, remains robust under missing modalities, and adapts to dynamic camera viewpoints. These results establish extrinsic contact estimation as a practical and versatile capability for contact-rich manipulation.

[118] Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration cs.RO | cs.CVPDF

Xingyi He, Adhitya Polavaram, Yunhao Cao, Om Deshmukh, Tianrui Wang

TL;DR: 本文提出了CorDex框架，旨在从单个人类演示中学习灵巧手的功能性抓取。该框架通过基于对应的数据引擎在仿真中生成多样化的高质量训练数据，并利用多模态预测网络整合视觉与几何信息，实现了对新物体实例的泛化，显著超越了现有基线方法。

Details

Motivation: 解决灵巧手功能性抓取的两个主要瓶颈：大规模数据集的稀缺性，以及学习模型中缺乏语义与几何推理的整合。

Result: 在多个物体类别上的广泛实验表明，CorDex在未见过的物体实例上泛化良好，并显著优于最先进的基线方法（SOTA）。

Insight: 创新点包括基于对应的数据引擎从单次演示生成多样化合成数据，以及多模态预测网络中的局部-全局融合模块和重要性感知采样机制，实现了鲁棒且计算高效的功能性抓取预测。

Abstract: Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.

cs.CY [Back]

[119] Generative Teaching via Code cs.CY | cs.AI | cs.CL | cs.HC | cs.MAPDF

Yuheng Wang, Runde Yang, Lin Wu, Jie Zhang, Jingru Fan

TL;DR: 本文提出了一种名为’生成式教学’的新范式，旨在通过多智能体框架TeachMaster，利用代码作为中间语义媒介，自动化生成结构清晰、可编辑的教育视频，从而解决高质量在线教育内容制作成本高、周期长的问题。

Details

Motivation: 当前高质量在线教育的可扩展性受到劳动密集型手动内容创建的高成本和慢周期的阻碍，且现有视频生成方法在确保教学结构和精确控制方面存在不足。

Result: 实验验证表明，TeachMaster在不损害结构连贯性或视觉保真度的前提下，显著提升了生产效率，为可扩展教育提供了稳健的解决方案。

Insight: 创新点在于将教育者从手动创作者转变为高层导演，并引入以代码为媒介的多智能体协作框架，实现了教育视频的可解释、可编辑的自动化生成，这为可控内容生成提供了新思路。

Abstract: The scalability of high-quality online education is hindered by the high costs and slow cycles of labor-intensive manual content creation. Despite advancements in video generation, current approaches often fail to ensure pedagogical structure and precise control due to their pixel-level, black-box nature. In this paper, we propose Generative Teaching, a novel paradigm that transitions educators from manual creators to high-level directors, allowing them to focus on pedagogical intent while autonomous agents handle the execution. To realize this vision, we introduce TeachMaster, a multi-agent framework that leverages code as an intermediate semantic medium. Unlike traditional video generation methods, TeachMaster orchestrates a collaborative team of agents–spanning planning, design, and rendering–to automate the production of interpretable, editable, and curriculum-ready educational videos. Experiments validate that TeachMaster significantly boosts production efficiency without compromising structural coherence or visual fidelity, providing a robust solution for scalable education.

cs.SE [Back]

[120] Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review cs.SE | cs.CLPDF

Daoan Zhang, Shuo Zhang, Zijian Jin, Jiebo Luo, Shengyu Fu

TL;DR: Sphinx是一个基于大语言模型的统一框架，用于自动化代码拉取请求（PR）审查，它通过结构化数据生成、基于检查清单的评估基准和清单奖励策略优化三个关键组件，解决了现有方法在噪声监督、上下文理解不足和评估指标不充分方面的挑战。

Details

Motivation: 自动化PR审查对于确保软件质量至关重要，但现有方法面临噪声监督、上下文理解有限和评估指标不足的问题，Sphinx旨在通过一个统一的框架来克服这些限制。

Result: 实验表明，使用Sphinx训练的模型在审查完整性和精确度上达到了最先进水平，在检查清单覆盖率上比专有和开源基线高出高达40%。

Insight: 创新点包括：通过比较伪修改和合并代码生成上下文丰富的结构化数据；引入基于检查清单的评估基准，超越表面指标如BLEU；以及提出清单奖励策略优化，使用基于规则的可解释奖励来对齐模型行为与实际审查实践。从客观角度看，这些方法提升了模型的上下文感知、技术精确性和实际部署能力。

Abstract: Pull request (PR) review is essential for ensuring software quality, yet automating this task remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics. We present Sphinx, a unified framework for LLM-based PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points, moving beyond surface-level metrics like BLEU; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real-world review practices. Extensive experiments show that models trained with Sphinx achieve state-of-the-art performance on review completeness and precision, outperforming both proprietary and open-source baselines by up to 40% in checklist coverage. Together, Sphinx enables the development of PR review models that are not only fluent but also context-aware, technically precise, and practically deployable in real-world development workflows. The data will be released after review.

Zhao Tian

TL;DR: 本文针对现有语言模型在处理复杂编程任务时的局限性，从数据质量、模型架构和推理能力三个互补方向提出了一系列改进技术，旨在推动语言模型在软件开发中的实际应用。

Details

Motivation: 现有语言模型在复杂编程场景中表现不佳，主要受限于数据质量、模型架构和推理能力，本研究旨在系统性地解决这些挑战。

Result: 论文提出了一系列技术（如CODA、CodeDenoise、LEAM/LEAM++、muFiX和Specine），但摘要中未提及具体的定量结果或基准测试表现。

Insight: 创新点包括：通过代码差异引导的对抗增强和去噪技术提升数据质量；利用语法指导的模型架构增强代码理解；以及通过提示技术和基于代理的技术提升模型推理能力，这些方法为代码相关任务的模型优化提供了系统化思路。

Abstract: Recent advances in language models (LMs) have driven significant progress in various software engineering tasks. However, existing LMs still struggle with complex programming scenarios due to limitations in data quality, model architecture, and reasoning capability. This research systematically addresses these challenges through three complementary directions: (1) improving code data quality with a code difference-guided adversarial augmentation technique (CODA) and a code denoising technique (CodeDenoise); (2) enhancing model architecture via syntax-guided code LMs (LEAM and LEAM++); and (3) advancing model reasoning with a prompting technique (muFiX) and an agent-based technique (Specine). These techniques aim to promote the practical adoption of LMs in software development and further advance intelligent software engineering.

eess.IV [Back]

[122] Scalable neural pushbroom architectures for real-time denoising of hyperspectral images onboard satellites eess.IV | cs.CVPDF

Ziyao Yi, Davide Piccinini, Diego Valsesia, Tiziano Bianchi, Enrico Magli

TL;DR: 本文提出了一种可扩展的神经网络架构，用于在卫星上实时处理推扫式高光谱图像的降噪问题，旨在满足低功耗、高实时性和容错性的需求。

Details

Motivation: 下一代地球观测卫星需要在星载设备上直接部署智能模型，以减少地面处理链的延迟，但星载高光谱成像器的独特约束（如低功耗、实时性和辐射容错）在传统计算机视觉研究中未被充分探索。

Result: 所提架构在低功耗硬件上能实时处理图像（即处理一行数据的时间不超过下一行数据的采集时间），并在降噪质量上与更复杂的SOTA模型竞争。

Insight: 创新点包括：采用混合降噪器以实现容错和动态功耗扩展；设计因果逐行处理架构以匹配推扫式传感器的采集过程，大幅降低内存需求；在功耗、容错和降噪质量之间提供了可权衡的设计空间。

Abstract: The next generation of Earth observation satellites will seek to deploy intelligent models directly onboard the payload in order to minimize the latency incurred by the transmission and processing chain of the ground segment, for time-critical applications. Designing neural architectures for onboard execution, particularly for satellite-based hyperspectral imagers, poses novel challenges due to the unique constraints of this environment and imaging system that are largely unexplored by the traditional computer vision literature. In this paper, we show that this setting requires addressing three competing objectives, namely high-quality inference with low complexity, dynamic power scalability and fault tolerance. We focus on the problem of hyperspectral image denoising, which is a critical task to enable effective downstream inference, and highlights the constraints of the onboard processing scenario. We propose a neural network design that addresses the three aforementioned objectives with several novel contributions. In particular, we propose a mixture of denoisers that can be resilient to radiation-induced faults as well as allowing for time-varying power scaling. Moreover, each denoiser employs an innovative architecture where an image is processed line-by-line in a causal way, with a memory of past lines, in order to match the acquisition process of pushbroom hyperspectral sensors and greatly limit memory requirements. We show that the proposed architecture can run in real-time, i.e., process one line in the time it takes to acquire the next one, on low-power hardware and provide competitive denoising quality with respect to significantly more complex state-of-the-art models. We also show that the power scalability and fault tolerance objectives provide a design space with multiple tradeoffs between those properties and denoising quality.

Table of Contents

cs.CL [Back]

[1] RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation cs.CL | cs.IRPDF

[2] Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems cs.CL | cs.AI | cs.CY | cs.HCPDF

[3] TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation cs.CL | cs.AI | cs.LGPDF

[4] FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback cs.CL | cs.CV | cs.LG | cs.SEPDF

[5] Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis cs.CL | cs.AI | cs.SIPDF

[6] LLMs for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach cs.CL | cs.AIPDF

[7] Complexity Agnostic Recursive Decomposition of Thoughts cs.CL | cs.AI | cs.ITPDF

[8] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation cs.CLPDF

[9] MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking cs.CL | cs.AIPDF

[10] Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs cs.CL | cs.AI | cs.CYPDF

[11] LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation cs.CLPDF

[12] GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence cs.CLPDF

[13] FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback cs.CLPDF

[14] Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization cs.CLPDF

[15] On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions cs.CL | cs.AI | cs.LGPDF

[16] When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation cs.CLPDF

[17] Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR cs.CLPDF

[18] From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset cs.CL | cs.AIPDF

[19] MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark cs.CLPDF

[20] ToolGate: Contract-Grounded and Verified Tool Execution for LLMs cs.CL | cs.AI | cs.FLPDF

[21] See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation cs.CL | cs.CVPDF

[22] PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards cs.CLPDF

[23] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking cs.CLPDF

[24] AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs cs.CLPDF

[25] Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval cs.CLPDF

[26] PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks cs.CL | cs.AIPDF

[27] Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence cs.CLPDF

[28] NC2C: Automated Convexification of Generic Non-Convex Optimization Problems cs.CL | cs.AIPDF

[29] RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection cs.CLPDF

[30] MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News cs.CLPDF

[31] A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs cs.CLPDF

[32] V-FAT: Benchmarking Visual Fidelity Against Text-bias cs.CL | cs.CV | cs.LG | cs.MMPDF

[33] GenProve: Learning to Generate Text with Fine-Grained Provenance cs.CLPDF

[34] A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction cs.CL | cs.SDPDF

[35] Text as a Universal Interface for Transferable Personalization cs.CL | cs.AIPDF

[36] Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization cs.CLPDF

[37] Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models cs.CL | cs.AI | q-bio.NCPDF

[38] Agent-as-a-Judge cs.CL | cs.AIPDF

[39] RelayLLM: Efficient Reasoning via Collaborative Decoding cs.CL | cs.AI | cs.LGPDF

[40] Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems cs.CLPDF

[41] Measuring and Fostering Peace through Machine Learning and Artificial Intelligence cs.CL | cs.CY | cs.LGPDF

[42] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization cs.CL | cs.AI | cs.LGPDF

cs.CV [Back]

[43] Unified Text-Image Generation with Weakness-Targeted Post-Training cs.CV | cs.AIPDF

[44] ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers cs.CVPDF

[45] Comparative Analysis of Custom CNN Architectures versus Pre-trained Models and Transfer Learning: A Study on Five Bangladesh Datasets cs.CV | cs.LGPDF

[46] PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache cs.CVPDF

[47] Combining facial videos and biosignals for stress estimation during driving cs.CVPDF

[48] Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection cs.CV | cs.AIPDF

[49] 3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation cs.CV | cs.AIPDF

[50] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization cs.CV | cs.CLPDF

[51] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving cs.CVPDF

[52] Vision-Language Agents for Interactive Forest Change Analysis cs.CV | cs.AI | cs.CLPDF

[53] All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction cs.CVPDF

[54] MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing cs.CVPDF

[55] HUR-MACL: High-Uncertainty Region-Guided Multi-Architecture Collaborative Learning for Head and Neck Multi-Organ Segmentation cs.CV | cs.AIPDF

[56] HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment cs.CVPDF

[57] Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning cs.CV | cs.CLPDF

[58] DB-MSMUNet:Dual Branch Multi-scale Mamba UNet for Pancreatic CT Scans Segmentation cs.CVPDF

[59] HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution cs.CVPDF

[60] WebCryptoAgent: Agentic Crypto Trading with Web Informatics cs.CVPDF

[61] Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models cs.CVPDF

[62] On the Holistic Approach for Detecting Human Image Forgery cs.CVPDF

[63] AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection cs.CVPDF

[64] Skeletonization-Based Adversarial Perturbations on Large Vision Language Model’s Mathematical Text Recognition cs.CVPDF

[65] GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models cs.CV | cs.AIPDF

[66] CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models cs.CV | cs.AI | cs.CL | cs.MMPDF

[67] PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference cs.CVPDF

[68] SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models cs.CVPDF

[69] Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform cs.CV | cs.LGPDF

[70] Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics cs.CV | cs.AIPDF

[71] Patch-based Representation and Learning for Efficient Deformation Modeling cs.CVPDF

[72] From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs) cs.CV | cs.LGPDF

[73] Driving on Registers cs.CV | cs.AI | cs.ROPDF

[74] UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition cs.CVPDF

[75] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing cs.CVPDF

[76] VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding cs.CV | cs.AIPDF

[77] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control cs.CVPDF