cs.CL [Total: 31]
cs.CV [Total: 44]
cs.SD [Total: 1]
eess.AS [Total: 1]
cs.RO [Total: 5]
cs.CY [Total: 1]
cs.CR [Total: 1]
cs.AI [Total: 9]
cs.LG [Total: 10]

cs.CL [Back]

[1] EDEN: A Large-Scale Corpus of Clinical Notes for Italian cs.CL | cs.AIPDF

Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini

TL;DR: 本文介绍了EDEN（急诊科电子病历），这是一个针对意大利语的大型临床笔记语料库，包含约400万份匿名急诊科病历，并标注了约6000份病历的132项结构化信息，旨在支持大型语言模型在医疗领域的应用。

Details

Motivation: 解决意大利语临床数据稀缺问题，为医疗领域大型语言模型开发提供高质量数据集。

Result: 构建了目前最大的免费意大利语临床笔记语料库，并基于Gemma-27B和MedGemma-27B模型为零样本CRF填充任务提供了基线结果。

Insight: 提出CRF填充作为结构化信息提取新基准，通过多轮临床专家标注解决项目表述歧义，创建了高不平衡但结构丰富的医疗数据集。

Abstract: We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

[2] Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures cs.CLPDF

Ishani Mondal, Javad Baghirov, Jordan Boyd-Graber

TL;DR: 本文提出了一种基于论文的图表到视频生成任务，旨在通过生成带有叙述和区域定位的逐步讲解视频来解释复杂的科学图表。为此，作者引入了MINARD（通过区域分解进行叙述架构的多模态解释）流水线，并发布了包含新评估指标的FigTalk基准。在FigTalk上，MINARD能够生成类人且忠实于论文的叙述，并在自动和人工评估中超越了现有方法。

Details

Motivation: 当前视频生成系统和基准缺乏根据论文内容、通过视觉高亮和逐步叙述来解释复杂科学图表的能力，这阻碍了对这类信息的有效理解。

Result: 在FigTalk基准上，MINARD在自动和人工评估中均优于现有方法，特别是在叙述条件化的图表空间定位方面，并生成了类人且忠实于论文的叙述。

Insight: 创新点在于提出了‘基于论文的图表到视频生成’这一新任务，并设计了MINARD流水线来生成顺序定位的叙述；同时，引入了包含顺序和组件级定位指标的FigTalk基准，为评估提供了新标准。

Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation

[3] Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation cs.CLPDF

Zahra Habibzadeh, Paria Khoshtab, Amir Mesbah, Yadollah Yaghoobzadeh

TL;DR: 本文提出将抽象谚语转化为忠实道德叙事的任务，定义为‘受限语义解压缩’，并以波斯谚语为测试平台研究大语言模型的抽象到具象生成能力。作者构建了包含谚语、人工撰写故事及明确含义的PAND数据集，通过混合评估框架发现当前LLM存在‘解压缩鸿沟’：模型虽能保持表面流畅性，却难以忠实体现谚语隐含的道德与因果结构。研究进一步表明显式推理和迭代优化可部分缓解此问题，说明错误主要源于抽象意义到叙事形式的转换困难而非知识缺失。

Details

Motivation: 解决大语言模型将抽象文化知识（如谚语）转化为具象叙事时存在的语义忠实性问题，探索模型在深层文化理解与语义落地方面的能力局限。

Result: 在波斯谚语数据集PAND上，通过结合人工校准的LLM-as-a-Judge与结构化指标的混合评估框架，发现当前LLM在表面流畅性上表现良好，但在道德与因果结构忠实性上存在显著差距；显式推理和迭代优化能部分提升性能。

Insight: 创新点在于提出‘受限语义解压缩’任务框架，并构建文化特定的谚语叙事数据集；客观分析揭示了LLM在抽象到具象转换中的系统性‘解压缩鸿沟’，为文化知识落地任务提供了可借鉴的评估方法与优化方向（如推理增强）。

Abstract: Transforming a dense, abstract proverb into an engaging and morally faithful narrative requires deep cultural understanding and robust semantic grounding. We frame this problem as a \emph{constrained semantic decompression} task and study proverb-conditioned story generation as a testbed for abstraction-to-realization in large language models (LLMs). Focusing on Persian, we introduce the Proverb Aligned Narrative Dataset (PAND), pairing proverbs with human-written stories and explicit meanings. By a hybrid evaluation framework that combines human-calibrated LLM-as-a-Judge with structural metrics, we analyze model behavior across multiple prompting regimes. Our findings reveal a persistent \emph{decompression gap}: current LLMs often achieve strong surface-level fluency while failing to faithfully instantiate the underlying moral and causal structure encoded in proverbs. We further show that explicit reasoning and iterative refinement can partially mitigate these failures, suggesting that many decompression errors arise from difficulties in translating abstract meaning into narrative form rather than a complete lack of relevant knowledge. Our proposed task naturally extends to other forms of compressed cultural knowledge.

[4] MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction cs.CLPDF

Mohammadreza Riyazat, Vian Lelo, Rameen Jafri, Yumna Khan, Abeer Badawi

TL;DR: 该论文提出了MARD（镜像增强推理蒸馏）方法，用于机制级别的药物-药物相互作用（DDI）预测。该方法引入了一个可复现的标注和评估协议，包含结构化的分类法、防泄漏的冷分割协议以及可审计的推理指标。通过结合三项训练创新，构建了一个70亿参数的推理模型，在药物对新颖性场景下显著优于包括GPT-4o在内的基线模型，且成本极低。

Details

Motivation: 解决现有DDI预测仅关注药物是否相互作用，而缺乏对具体作用机制（如涉及何种酶、药效轴、作用方向和证据）进行细粒度识别的问题。

Result: 在2026年4月DrugBank版本上的32个系统对比中，MARD-7B是唯一在药物对新颖性条件下保持高精度的系统，比最佳基线高出13.9个百分点，比GPT-4o高出6.7个百分点，且成本仅为前沿API的约1%。分析表明其准确率在罕见药物上反而提升，显示出抗记忆化特征。

Insight: 创新点包括：1）提出一个结构化的、防泄漏的机制级别DDI评估框架；2）训练中的三项关键技术：基于方向标签的单令牌KL散度、带程序化硬负例的PRM加权DPO、防泄漏的机制感知检索通道；3）利用DrugBank结构化字段实现过程奖励标签的自动验证，无需人工或LLM评判。

Abstract: Mechanism-level drug-drug interaction (DDI) prediction requires identifying which enzyme or pharmacodynamic axis is implicated, in which direction, and with which evidence – not merely whether two drugs interact. We introduce a reproducible mechanism-level DDI labelling and evaluation protocol with a structured 7-family/147-subtype taxonomy, leakage-safe cold-split protocols, and auditable reasoning metrics for evaluating pharmacological prediction beyond flat interaction classification. We propose a pipeline that produces a 7B reasoning MARD (Mirror-Augmented Reasoning Distillation), combining three training innovations: a single-token KL divergence on direction tag that ties the model’s prediction, per-loss PRM-weighted DPO with programmatic hard negatives, and a leakage-safe mechanism-aware retrieval channel. Process-reward step labels are automatically verifiable against DrugBank-structured fields, requiring no human or LLM judges. On the April-2026 DrugBank release, our MARD-7B is the only system in a 32-system comparison whose accuracy survives drug-pair novelty, beating the best baseline by +13.9 pp and GPT-4o by +6.7 pp at ~1% of frontier API cost. Further analysis reveals an anti-memorisation signature where accuracy improves on rarely seen drugs, suggesting that gain comes from structured pharmacological reasoning rather than drug-frequency memorisation. We release corpus, DDI-PRM, retrieval index, and training code.

[5] Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models cs.CLPDF

Darpan Aswal, Thomas Palmeira Ferraz, Yongxin Zhou, Maxime Peyrard

TL;DR: 该论文批判性地分析了潜在推理模型（LRMs）中可观察到的潜在状态模式（如类似BFS的边界或可解码的算术计算）是否真正解释了模型的内部推理机制。通过对Coconut和CODI等LRMs与缺乏特定循环或课程学习的对照组进行比较，研究发现这些模式在对照组中同样出现，且并不总是因果性地影响模型行为。论文主张应将潜在思维视为隐藏的计算过程而非隐藏的解释，并强调LRM的可解释性需要匹配的对照组和因果测试。

Details

Motivation: 针对当前研究常将LRMs中可观察的潜在状态模式（如可解码的算术步骤）直接视为内部推理机制的证据，论文旨在通过因果和几何分析检验这种关联是否成立，并探究潜在思维如何真正影响模型行为。

Result: 因果干预实验表明，潜在思维的利用程度是渐进的而非二元的，其行为影响随因果效应大小而缩放；几何分析揭示这种效应集中在低秩方向上，且其步进几何结构随行为影响力增强而变得更结构化。

Insight: 创新点在于提出并验证了评估LRM可解释性的新框架：仅凭可解码性、注意力或静态结构不足以确立机制，必须结合匹配的对照组和因果测试（如干预分析）来区分相关性与因果性，并将潜在思维重新定义为具有渐变因果影响力的隐藏计算过程。

Abstract: Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts. Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for internal reasoning mechanisms. Evaluating two LRMs (Coconut and CODI) against controls lacking the proposed recurrence or curriculum, we find these patterns also appear in the controls and do not always causally affect behavior. Causal interventions reveal that latent-thought utilization is not binary but graded, scaling with a thought’s causal effect on model behavior. Geometric analyses reveal this effect concentrates in low-rank directions whose step-to-step geometry grows more structured as their behavioral influence increases. Latent thoughts should therefore be treated as hidden computation, not hidden explanation: decodability, attention, or static structure alone cannot establish mechanism. LRM interpretability thus requires matched controls and causal tests.

[6] Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants cs.CL | cs.LGPDF

Shuxian Fan, Seonwoo Min, Youna Hu, Botao Xia, Jayakrishnan Unnikrishnan

TL;DR: 本文介绍了Shopping Reasoning Bench，这是一个由零售领域专家构建的基准测试，用于评估多轮对话购物助手的开放式推理、领域专业知识和细粒度质量标准。该基准包含525个任务（单轮和多轮）和10863个重要性加权的二元评估准则，涵盖偏好细化、权衡分析和兼容性评估等需求。对GPT、Claude和Gemini等九个模型的评估显示，总体通过率仅为57-77%，多轮任务中模型在可选高标准准则上得分较低，且性能随对话轮次增加而下降。

Details

Motivation: 现有基准无法全面评估真实购物对话所需的开放式多轮推理、领域专业知识和准则级质量，而购物推理在语言模型应用中具有独特性，需平衡主观偏好、预算约束和跨产品权衡，这些能力在以往电子商务和通用基准中缺失。

Result: 在Shopping Reasoning Bench上评估了GPT、Claude和Gemini三个系列的九个模型，总体通过率为57-77%。在多轮任务中，所有模型在可选高标准准则上的得分比必需准则低13-29分，且性能随对话进展下降4-18分，表明当前模型仅能处理基本购物辅助，未达到专家级建议水平。

Insight: 创新点在于构建了一个专家撰写的、具有细粒度评估准则（基于五类推理和十五个子类别的分类法）的购物推理基准，突出了多轮对话中主观权衡和渐进性能下降的挑战，为未来购物助手开发提供了严格的测试平台。

Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57–77% overall. On multi-turn missions, all models score 13–29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4–18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

[7] Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review cs.CLPDF

Xinyu Zhao, Rana Muhammad Shahroz Khan, Zhen Xu, Zhen Tan, Tianlong Chen

TL;DR: 本文提出了PaperGuard，首个针对多模态AI同行评审中跨模态对抗攻击的综合性基准框架。该框架包含多领域数据集、针对文本和图像的攻击方法（如GCG和PGD），以及基于分块嵌入搜索的防御机制，旨在评估和提升AI评审系统在学术论文评审中的鲁棒性。

Details

Motivation: 当前AI同行评审的鲁棒性研究主要局限于文本模态，忽视了科学论文中图像所承载的核心证据，且现有攻击（如越狱）与领域特定的评审攻击（如操纵评分）存在本质区别，缺乏有效的防御手段。

Result: 在多个先进模型上的广泛实验证实，AI评审系统普遍存在脆弱性；PaperGuard建立了首个基准、协议和可操作的防御方法，为构建可信赖的AI辅助学术评审奠定了基础。

Insight: 创新点在于首次系统性地关注多模态同行评审中的跨模态对抗攻击，并提出了针对长上下文学术论文的实用防御策略（分块嵌入搜索），将攻击目标从通用安全违规转向领域特定的定向失效，推动了AI评审安全性的研究。

Abstract: The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., “inflate this score”) rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

[8] Localizing Anchoring Pathways in Language Models cs.CL | cs.AIPDF

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman

TL;DR: 本文研究了语言模型中锚定效应的机制，通过控制实验发现无关数字会影响模型判断，并利用归因方法定位了负责锚定敏感信号的神经通路。

Details

Motivation: 解决语言模型在数值推理中受无关数字（锚点）影响而产生判断偏差的问题，旨在揭示这种锚定敏感信号在模型内部的传递路径。

Result: 在Qwen和Llama的7B-8B基础及指令微调模型上，使用基于归因的电路定位方法，发现边级方法比节点级方法更准确地恢复信号，且高低锚点电路在模型内转移性强，但基础与微调变体间转移不可靠。

Insight: 创新点在于定义了追踪锚定行为的logit差异度量，并应用电路定位技术揭示了锚定相关决策信号在语言模型中的机械性通路，表明训练后调整会改变关键路径的重要性。

Abstract: Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B–8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

[9] LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling cs.CLPDF

Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai

TL;DR: 本文提出了LoHoSearch基准测试，旨在评估长视野搜索代理在超越人类难度上限的复杂搜索任务上的性能。该基准包含544个人工验证的问题，覆盖11个领域，通过基于知识图谱的自动化流程构建，确保问题具有大搜索空间和结构复杂性。实验显示，即使最强模型准确率也仅为34.74%，现有上下文管理策略提升有限，为搜索代理的长视野推理和上下文管理提供了更严格的标准。

Details

Motivation: 现有搜索代理基准（如BrowseComp）主要由人类标注，受限于标注者缺乏全局实体统计视角，无法系统最大化搜索空间和结构复杂度，导致难度上限难以突破，需要更挑战性的基准来推动模型发展。

Result: 在LoHoSearch基准上，最强模型准确率仅为34.74%，现有上下文管理策略仅带来最高+6.8%的提升，远低于先前基准的增益，表明该基准更具挑战性。

Insight: 创新点在于利用覆盖700多万维基百科实体的知识图谱自动化构建基准，通过选择大搜索空间的关系并组合成结构复杂的问题，确保答案唯一且可验证，为评估长视野推理提供了更接近真实复杂场景的测试平台。

Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

[10] Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study cs.CL | cs.LGPDF

Yvonne Qiu, Dezhi Yu, ShuoJia Fu

TL;DR: 本文提出了一种使用直接偏好优化（DPO）技术来微调大型语言模型的方法。实验结果表明，DPO简化了训练流程，提高了计算效率，并取得了有竞争力的性能。评估使用了BLEU、ROUGE和余弦相似度等指标，表明模型实现了有效的学习和收敛，但需要进一步研究以解决观察到的训练不稳定性问题。

Details

Motivation: 论文的动机是探索一种更简化的方法来微调聊天机器人模型，旨在通过直接偏好优化技术改进传统的强化学习微调流程。

Result: 在BLEU、ROUGE和余弦相似度等指标上的评估显示，该方法实现了有效的学习和收敛，性能具有竞争力，但训练过程中存在不稳定性。

Insight: 摘要宣称的创新点在于应用DPO简化训练流程并提升效率；从客观角度看，将DPO应用于聊天机器人微调，并系统评估其效果与稳定性，为高效微调提供了实证参考。

Abstract: We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.

[11] PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue cs.CLPDF

Wen Zhang, Xiaocui Yang, Zhuoyue Gao, Shi Feng, Daling Wang

TL;DR: PRISM是一个用于共情语音对话的多智能体框架，它将语音感知、响应生成和语音合成解耦为协调组件，通过韵律到语言的翻译机制稳定大语言模型推理，并支持按需调用外部知识工具以生成共情对话。

Details

Motivation: 现有共情语音对话系统面临级联流水线丢弃声学线索、端到端语音模型缺乏对情感和知识整合的可解释控制等问题，PRISM旨在解决这些挑战。

Result: 实验结果表明，PRISM在客观和主观指标上，在共情、韵律适当性和文本响应生成质量方面均取得了一致的提升。

Insight: 创新点在于提出韵律到语言的翻译机制来稳定LLM推理，以及通过多智能体框架实现语音感知、响应生成和合成的协调解耦，从而增强对情感和知识的可控整合。

Abstract: Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while end-to-end speech models lack interpretable control over emotion and knowledge integration. To address these challenges, we propose PRISM, a multi-agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis into coordinated components. PRISM introduces a prosody-to-language translation mechanism to stabilize large language model reasoning and enables on-demand invocation of external knowledge tools for empathetic dialogue generation. Experimental results demonstrate that PRISM achieves consistent improvements in empathy, prosodic appropriateness, and text response generation quality across objective and subjective metrics. Our code is available at: https://github.com/Bxzfrm/PRISM.

[12] SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents cs.CLPDF

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Qun Liu, Chen Luo

TL;DR: 本文提出了SENTINEL，一个基于失败的强化学习框架，用于训练使用工具的语言模型智能体。该框架通过分析智能体在任务执行中的失败轨迹，识别其弱点，并自动生成针对性的训练任务，从而更高效地提升智能体的能力。

Details

Motivation: 当前训练可靠的使用工具的语言模型智能体面临挑战，传统的强化学习方法在固定任务分布上训练时，任务难度与智能体不断进化的能力不匹配，导致大量训练资源浪费在无信息量的任务上。

Result: 在Tau2-Bench Retail基准测试上，使用Qwen3-4B-Thinking-2507模型，SENTINEL将Pass^1分数从66.4提升至74.9，并且在Pass^k指标上全面优于基于通用合成任务的强化学习方法。

Insight: 核心创新在于将智能体的失败（而非预定义任务）作为驱动训练的信号源，通过Controller-Proposer-Solver循环实现“失败分析-弱点定位-针对性任务生成”的闭环，这是一种可扩展且高效的策略优化方法。

Abstract: Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy’s evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver’s rollout failures into targeted training tasks. SENTINEL follows a Controller–Proposer–Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.

[13] Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL cs.CLPDF

Shu Tong Luo, Wenqin Liu, Rui Liu, Mingming Gong, Jiaxian Guo

TL;DR: 论文针对大语言模型在多轮对话中因关键信息分散导致准确率大幅下降的‘Lost in Conversation’问题，提出了一种通过训练模型维护紧凑滚动记忆而非关注全部历史的方法。为了可扩展训练，作者引入了一种低成本的分片流水线，将单轮问答数据集自动转换为多轮信息碎片化对话，无需人工标注。在GSM8K数据集上训练的记忆增强策略显著提升了多轮推理准确率，并能零样本泛化到更难的数学和领域外长上下文问答任务。

Details

Motivation: 解决大语言模型在多轮对话中，当关键信息分散在不同轮次时，即使拥有完整上下文，模型准确率也会显著下降（高达65%）的‘Lost in Conversation’性能退化问题。

Result: 仅在GSM8K分片数据上训练的记忆增强模型，显著提升了多轮对话准确率，并能零样本泛化到更难的数学（如MATH）和领域外长上下文问答（如NarrativeQA）基准上。实验表明，即使在测试时提供完整历史，经过记忆训练的模型也优于依赖完整上下文的基线模型。

Insight: 核心创新点是提出用训练模型维护紧凑滚动记忆（而非处理不断增长的历史）来缓解信息碎片化带来的推理退化，以及设计了一种低成本、可扩展的数据集分片流水线来自动生成多轮训练数据。客观来看，其关键洞见在于‘学习压缩’（通过记忆机制）比单纯暴露于完整上下文更能诱导出更鲁棒的增量推理能力。

Abstract: When a user reveals task-critical information across several conversation turns, LLM accuracy drops by up to 65% despite full context availability. We show that this Lost in Conversation degradation can be substantially mitigated by training models to maintain a compact rolling memory instead of attending to a growing history. To make such training scalable, we introduce a low-cost sharding pipeline that converts single-turn QA datasets into multi-turn fragmented-information episodes, eliminating the need for hours of manual annotation. Training only on sharded GSM8K, our memory-augmented policy significantly improves multi-turn accuracy and generalises zero-shot to harder math and out-of-domain long-context QA. Moreover, memory-trained models outperform full-history baselines even when given the full history at test time, suggesting that learning to compress induces more robust incremental reasoning than full-context exposure alone.

[14] No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions cs.CLPDF

Xu Yang, Zhizhou Sha, Junbo Li, Jian Yu, Yifan Sun

TL;DR: 本文研究了AI审稿系统的一个新型攻击方式——对抗性重新包装，即仅修改论文的呈现层面内容（如摘要、贡献陈述、讨论等），而保持科学证据不变，就能显著影响AI审稿人的评分。攻击成功率高达75.1%，平均得分提升+1.21/10。

Details

Motivation: 随着AI生成的审稿意见从实验工具转变为同行评审基础设施，现有鲁棒性研究主要关注隐藏指令和提示注入等显式攻击。本文旨在探究一个更隐蔽且与政策更相关的失效模式：无需修改方法、实验、图表或结果，仅通过改变呈现层面的内容即可操纵AI审稿。

Result: 在三种主流AI审稿系统上，对抗性重新包装攻击的成功率达到75.1%，平均得分提升+1.21/10。攻击效果并非源于普通的文本润色，而是通过改变审稿人对论文的解释（如重新定位相关工作、扩展分析性讨论）来实现的。

Insight: 创新点在于揭示了AI审稿的两个深层结构性失效模式：一是AI审稿人更容易被’打动’而非被’说服’，突出优点能可靠提升感知价值，而试图化解弱点常适得其反；二是AI审稿人可能混淆’表面解决’与’实际解决’限制，导致未变的证据被重新解读为更强的科学贡献。这提示部署风险不仅来自恶意隐藏指令，更在于论文呈现本身成为了可优化的表面。

Abstract: As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.

[15] G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents cs.CL | cs.AIPDF

Minjun Choi, Yoonjin Jang, Sangwon Youn, Youngjoong Ko

TL;DR: 本文提出G-Long框架，通过图增强记忆管理解决长程对话中LLM的上下文推理限制与处理效率问题。它利用微调的小语言模型进行结构化三元组提取与关联检索，并引入基于T5摘要器交叉注意力信号的新型重要性评分机制来识别关键记忆。实验表明，G-Long在多个基准测试中实现了最先进的响应生成与记忆检索性能，同时显著降低了计算开销。

Details

Motivation: 现有长程对话系统依赖非结构化记忆存储易导致信息丢失，或使用计算成本高的LLM带来高延迟，难以保持长期一致性。

Result: 在MSC基准上响应质量提升达9.8%，在LME基准上检索召回率提升达40.8%，实现了最先进的性能，并显著减少了计算开销。

Insight: 创新点包括利用微调小语言模型进行结构化三元组提取以降低操作成本，以及基于T5摘要器交叉注意力信号设计注意力感知的重要性评分机制来筛选关键记忆，为高效长程对话系统提供了可借鉴的图增强记忆管理方案。

Abstract: While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.

[16] EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge cs.CLPDF

Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng

TL;DR: 本文提出了EvoBrowseComp，一个动态演进的基准测试，用于评估搜索代理在实时网络知识上的检索与推理能力。该基准包含400个英文和400个中文复杂问题，通过三智能体协作框架从实时网络合成，旨在避免测试集污染和参数记忆，确保评估的真实性。

Details

Motivation: 现有基准（如BrowseComp）依赖静态知识，易受测试集污染和参数记忆影响，导致模型可通过事实回忆而非真实检索获得高分，无法准确衡量浏览推理能力。

Result: 实验表明EvoBrowseComp难度较高，需要广泛的横向搜索，为自动更新、高难度的基准测试建立了可扩展范式，能跟上不断演进的世界知识和智能体能力。

Insight: 创新点包括三智能体协作框架（QA合成、信息过滤、高层指导）实现全自动合成污染免费问题，以及通过实时网络遍历确保时间新鲜性，为动态基准测试提供了可扩展方法。

Abstract: Search Agents – large language models augmented with search tools – have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.

[17] NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning cs.CL | cs.AIPDF

Feng Lyu, Huiqin Yan, Sijing Duan, Hao Wu, Shuang Gu

TL;DR: 该论文提出了NTS-CoT框架，旨在缓解大型语言模型在新闻时间线摘要任务中产生的幻觉问题。该框架利用思维链推理，通过三个核心模块——Element-CoT、日期选择和Causal-CoT——分别处理摘要中的不忠实内容和时间-事件摘要中的信息遗漏，从而生成更准确、更全面的时间线摘要。

Details

Motivation: 在线新闻的快速更新使得追踪事件发展变得困难，凸显了对时间线摘要的需求。然而，基于LLM的时间线摘要中，模型生成内容偏离源新闻的幻觉问题仍然是一个关键且未被充分研究的挑战。

Result: 在三个TLS基准上的定量分析和人工评估表明，NTS-CoT框架优于现有的最先进基线方法，有效缓解了幻觉问题，并提升了基于LLM的时间线摘要性能。

Insight: 论文的创新点在于明确识别了时间线摘要中幻觉的两种主要类型（摘要不忠实和信息遗漏），并针对性地设计了结合思维链推理的模块化框架。从客观角度看，将思维链推理结构化地应用于复杂的多文档摘要任务，并融合时序显著性和事件因果推理，是一个值得借鉴的思路。

Abstract: The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hallucinations, where LLM-generated content deviates from source news, still remain a critical issue in LLM-based TLS and are not well studied in existing works. To bridge this gap, we identify two primary types of hallucinations: unfaithful content during news summarization and information omission in date-event summarization. Then, we propose NTS-CoT, a novel framework that leverages Chain-of-Thought (CoT) reasoning to mitigate hallucinations in TLS. The framework consists of three key modules: i) Element-CoT to capture essential news elements for faithful summarization, ii) Date Selection to combine temporal saliency and event prominence for timestamp selection, and iii) Causal-CoT to infer causal relationships and reduce omissions in date-event summarization. Extensive experiments, including quantitative analysis on three TLS benchmarks and human evaluation, demonstrate that NTS-CoT outperforms state-of-the-art baselines, effectively mitigating hallucinations and improving LLM-based TLS performance. Our source code is available at https://anonymous.4open.science/r/NTS-CoT .

[18] SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection cs.CLPDF

Fuqiang Niu, Bowen Zhang

TL;DR: 该论文提出了SICI（立场推断复杂度指数），一个七维度的语义-语用复杂度诊断指标，用于评估LLM在立场检测任务中处理目标-文本对的难度。研究发现SICI能比表面代理指标更好地预测LLM的准确率，并揭示了LLM错误随复杂度增加呈现阶段性变化：低复杂度示例易导致过度归因，高复杂度示例则集中预测为’无立场’，且这种相变结构在不同模型间普遍存在。

Details

Motivation: 基于提示的LLM越来越多地用于立场检测，但现有方法（如清晰指令、推理提示、检索或辩论）难以修复困难示例，因此需要一种诊断工具来量化任务复杂度并理解LLM的失败模式。

Result: 在SemEval-2016和VAST数据集上，SICI预测LLM准确率的性能优于表面代理指标，且具有较高的跨评分者信度（α=0.771）。实验表明GPT-3.5、GPT-4o-mini、DeepSeek-V3和GPT-4o等模型均呈现相似的错误相变结构，更强模型仅移动边界；15种干预方法的研究显示提示、检索和辩论通常只能沿归因-弃权轴移动预测，无法消除高复杂度瓶颈。

Insight: 创新点在于提出可量化的语义-语用复杂度指标SICI，揭示了LLM立场检测错误的结构化模式（低/中/高复杂度下的不同错误机制），并证明现有干预方法难以突破高复杂度瓶颈，为理解LLM能力边界提供了新视角。

Abstract: Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target–text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($α=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution–abstention axis rather than removing the high-complexity bottleneck.

[19] Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization cs.CL | cs.LGPDF

Mariia Onyshchuk, Maksym-Vasyl Tarnavskyi, Marta Sumyk

TL;DR: 本文提出了一种基于最优传输（OT）的无监督方法，用于检测神经机器翻译（NMT）中的幻觉和抽象摘要中的不忠实内容。通过分析解码器各层的交叉注意力分布与参考分布之间的几何距离，研究发现OT信号在NMT中能有效检测源文本脱离的幻觉，但在摘要任务中因失败模式不同而效果有限。

Details

Motivation: 动机是扩展无监督的最优传输方法，从检测NMT幻觉迁移到检测抽象摘要的忠实性问题，并深入探究解码器不同层在检测中的角色与局限性。

Result: 在Fairseq DE-EN模型（N=3,414）上，Wass-to-Unif和Wass-to-Data方法能互补检测不同类型幻觉，且检测信号集中在L1-L4层。在AggreFact数据集（N=1,116）上，无监督OT检测器在CNN/XSum上达到57.2%/57.6%的平衡准确率，显著低于有监督的MiniCheck-Flan-T5-L（69.9%/74.3%）。

Insight: 创新点在于将OT分析扩展到解码器所有层，揭示了不同层对检测的贡献及局限性；核心洞察是OT检测仅适用于因源文本脱离导致的失败（如NMT幻觉），而对于注意力正确但内容误述的摘要不忠实问题（下游失败）则存在根本性限制，这为任务无关的可解释性工具提供了原则性边界。

Abstract: Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN model ($N=3{,}414$), showing that Wass-to-Unif and Wass-to-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1–L4 with L5 anti-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFact ($N=1{,}116$) achieves $57.2%$/$57.6%$ balanced accuracy on CNN/XSum – above chance but substantially below supervised MiniCheck-Flan-T5-L($69.9%$/$74.3%$). This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration-based OT metrics by construction. Structural experiments on T5-base confirm consistent decoder organisation across depth, with Layer~~3 showing peak concentration and Layer~~12 being most critical for generation quality. Together, the results establish OT on cross-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention.

[20] When Similar Means Different: Evaluating LLMs on Arabic–Hebrew Cognates cs.CLPDF

Junhong Liang, Noor Abo Mokh, Bashar Alhafni

TL;DR: 本文介绍了SemCog Bench，一个用于评估大语言模型在阿拉伯语-希伯来语同源词识别和语义消歧方面能力的基准测试。研究发现，尽管模型在真正同源词上表现良好，但在假朋友词和现代借词上性能显著下降，表明模型过度依赖表面形式相似性，且句子级上下文仅带来有限改进。

Details

Motivation: 阿拉伯语和希伯来语作为密切相关的闪米特语言，共享大量词汇，包括真正同源词、误导性的假朋友词和现代借词，这对大语言模型的跨语言语义理解能力提出了挑战。

Result: 在SemCog Bench基准上评估开源和商业LLMs，模型在真正同源词上准确率高，但在假朋友词和借词上性能急剧下降；句子级上下文仅带来适度改进，未能有效克服误导性形式信号。

Insight: 论文创新点在于构建了首个针对阿拉伯语-希伯来语同源词问题的基准测试SemCog Bench，并揭示了当前LLMs在跨语言形式-意义冲突解决上的根本局限性，即过度依赖表面相似性而语义推理能力不足。

Abstract: Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce SemCog Bench, a curated benchmark of 1,858 Arabic–Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. We evaluate open-source and commercial LLMs across multiple input representations (raw, diacritized, Romanized, and phonetic) and reveal a critical gap in cross-lingual reasoning. While models achieve high accuracy on true cognates, performance drops sharply on false friends and loanwords, reflecting a strong reliance on surface-form similarity. Furthermore, sentence-level context yields only modest improvements, suggesting that contextual cues alone are insufficient to overcome misleading form-based signals. These findings reveal a fundamental limitation of current LLMs in resolving cross-lingual form–meaning conflicts and establish SemCog Bench as a rigorous benchmark for multilingual semantic reasoning. Our code and data are publicly available.

[21] Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation cs.CLPDF

Ryota Kawamatsu, Anum Afzal, Yuki Saito, Shinnosuke Takamichi, Graham Neubig

TL;DR: 本文提出了一种低延迟实时音频游戏解说系统，通过基于LLM的并行文本生成技术，直接从实时游戏视频生成语音解说。该系统通过并行化文本生成与语音播放，并预先缓冲多个候选语句，解决了传统串行流程导致的长时间沉默问题。

Details

Motivation: 传统端到端游戏解说系统采用串行流程（捕获帧、生成文本、合成语音），导致语句间存在长且不自然的沉默，影响用户体验。本文旨在解决这一延迟瓶颈。

Result: 在快节奏游戏视频上的实验表明，并行设计将平均语句间沉默时间从9.6秒减少到0.3秒，与专业解说沉默模式的相似性提升超过40%。一项针对120名经验玩家的用户研究证实了感知解说节奏的显著改善。

Insight: 核心创新在于将文本生成与语音播放并行化，并引入候选语句缓冲机制，实现了近乎实时的解说生成。这为低延迟实时音频生成系统提供了一种有效的架构设计思路。

Abstract: We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines capture frames, generate text, and synthesize speech sequentially for each utterance, and do not request the next generation until speech playback has completed. This strict sequentiality causes long and unnatural silence between utterances. To address this latency bottleneck, our system runs text generation in parallel with speech playback and buffers multiple candidate utterances ahead of time, enabling immediate synthesis at playback boundaries. Experiments on fast-paced game videos show that our parallel design reduces the mean inter-utterance silence from 9.6 seconds to 0.3 seconds compared to sequential baselines. It also improves similarity to professional speaking–silence timing patterns by over 40 %, and a user study with 120 experienced game players confirms significantly improved perceived speaking rhythm. Our demo video is available at: https://youtu.be/pmrRUlvav8M.

[22] From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent cs.CLPDF

Haishuo Fang, Yue Feng, Iryna Gurevych

TL;DR: 本文提出了一种名为ProReviewer的主动式科学论文评审智能体，旨在解决现有基于大语言模型的自动同行评审方法在生成有具体证据支持的深度评审方面的不足。该方法将评审过程建模为马尔可夫决策过程，通过维护结构化的评审日志来引导智能体主动调查论文中的可疑部分。实验表明，ProReviewer在多个质量维度上显著优于基于提示的基线方法和微调基线。

Details

Motivation: 现有基于大语言模型的自动同行评审方法缺乏灵活性，无法像人类评审者那样基于累积的证据主动调查论文中的可疑部分，导致难以生成有深度、有证据支持的评审意见。

Result: 实验表明，基于8B参数骨干模型、通过监督微调和强化学习优化的ProReviewer，在五个质量维度上的平均得分最高，相对基于提示的方法（使用更大的前沿LLM）提升了高达39%，相对最强的微调基线提升了16%。在人工评估中也获得了最高的胜率。

Insight: 核心创新点在于将主动调查的评审过程形式化为马尔可夫决策过程，并引入结构化的评审日志作为智能体的工作空间来追踪证据和中间发现。这为构建更自主、更深入的AI评审系统提供了一种可借鉴的框架。

Abstract: Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

[23] Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data cs.CLPDF

Qixu Chen, Satoshi Nakamura

TL;DR: 该论文提出了一种利用音频大语言模型（Audio-LLMs）来过滤语音到语音翻译（S2ST）训练数据的方法。该方法采用一种名为Rank-to-Distill的两阶段策略，无需人工标注，首先生成伪标签，然后训练音频大语言模型直接从原始配对语音中做出保留/丢弃决策，以提升数据质量。

Details

Motivation: 大规模挖掘的语料库为端到端语音到语音翻译提供了丰富的训练数据，但其中可能包含噪声、未对齐和语义错误，过滤这些噪声数据对于保持稳健的语音翻译性能至关重要。

Result: 在CVSS-C和SpeechMatrix基准测试上的实验表明，该方法相比未过滤的训练数据带来了一致的性能提升，在端到端S2ST任务上实现了高达+1.4的ASR-BLEU分数增益。

Insight: 论文的创新点在于提出了一种无需人工标注、可扩展的两阶段Rank-to-Distill策略来训练音频大语言模型进行数据过滤，该模型能够联合捕捉声学保真度和跨语言语义一致性，从而选择高质量的语音条件数据。

Abstract: Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech translation performance. We study how to train an audio-language model to make keep/drop decisions on paired speech directly from audio. To obtain reliable supervision without manual labels, we adopt a scalable two-stage Rank-to-Distill strategy. A lightweight ranker generates keep/drop pseudo-labels from noisy speech pairs, then trains an audio large language model to predict keep/drop directly from raw paired speech. The resulting model jointly captures acoustic fidelity and cross-lingual semantic consistency for the selection of speech-conditioned data. Experiments on CVSS-C and SpeechMatrix show consistent improvements over unfiltered training, yielding up to +1.4 ASR-BLEU for end-to-end S2ST.

[24] ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages cs.CL | cs.AIPDF

Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

TL;DR: 本文提出了ArogyaSutra，一个用于印度语言多模态医学推理的多智能体框架，并构建了ArogyaBodha大规模多语言多模态医学问答数据集。该框架基于演员-评论家架构，集成了工具使用和双记忆机制，旨在解决现有英语中心MLLMs在印度农村等多语言、低资源医疗场景中的性能限制。

Details

Motivation: 现有多模态大语言模型在通用领域表现出色，但在医疗等专业领域，尤其是在多语言和低资源场景（如印度农村地区，患者常使用本地语言和医学图像进行复杂查询）中性能有限，限制了AI医疗辅助的公平可及性。

Result: 实验表明，所提出的数据集和框架提高了所有印度语言的多语言医学推理准确性，消融实验验证了每个组件的贡献。

Insight: 创新点在于构建了覆盖多种身体系统、成像模态和临床领域的大规模多语言多模态医疗数据集，并设计了一个结合工具使用、双记忆机制以及利用存储的演员-评论家模拟轨迹进行蒸馏的多智能体推理框架，专门针对低资源语言医疗场景。

Abstract: Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

[25] LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories cs.CL | cs.AI | cs.LG | cs.MM | cs.ROPDF

Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li

TL;DR: 本文提出LabVLA，一个用于在科学实验室中执行实验协议（protocols）的视觉-语言-动作（VLA）模型。为了解决实验室场景数据稀缺和机器人形态多样的瓶颈，作者构建了RoboGenesis仿真数据引擎来生成结构化演示，并设计了一个两阶段训练方法（FAST动作令牌预训练和流匹配后训练）来训练模型。在LabUtopia基准测试中，LabVLA在分布内和分布外设置下均取得了最高的平均成功率。

Details

Motivation: 当前AI系统能辅助科学实验的阅读、假设生成和协议规划，但协议的实际物理执行仍需人工操作。现有的VLA模型主要在家庭和桌面场景训练，缺乏对科学实验室中特定仪器、透明液体和固定工作流程的理解，因此需要开发能适应实验室环境和多样机器人形态的模型。

Result: 在LabUtopia基准测试上，LabVLA在分布内（in-distribution）和分布外（out-of-distribution）设置下的平均成功率均超过了所有评估的基线模型，达到了最高水平。

Insight: 创新点在于针对实验室场景的数据和模型设计瓶颈，提出了RoboGenesis仿真数据引擎来生成高质量、结构化的训练数据，以及一个两阶段训练方法（FAST预训练使视觉语言骨干具备动作感知，流匹配后训练附加动作专家），实现了在复杂、结构化实验室环境中的有效策略学习。

Abstract: Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

[26] One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders cs.CL | cs.AIPDF

Minghao Luo, Liang Chen

TL;DR: 该论文提出了FORGE基准测试，用于评估检索增强型大语言模型在生成推荐时对网页内容污染的脆弱性。研究发现，所有测试的12个商业和开源模型都容易受到污染内容（如伪造的虚假评论和促销页面）的影响，导致推荐虚假产品，单个污染页面即可导致高达27%的欺骗率，而替换前三搜索结果时欺骗率可升至73.8%。

Details

Motivation: 随着检索增强型大语言模型越来越多地通过检索实时网页内容来提供日常消费推荐，这带来了新的风险：模型可能消费被污染的网页内容（如旨在误导推荐的虚假评论和促销页面），从而无意中推广虚假产品。论文旨在量化这种风险的程度。

Result: 在FORGE基准（涵盖15个类别、5个消费场景的225个真实产品）上的实验表明，所有测试模型均表现出脆弱性。脆弱性在不同类别间差异显著，当模型对相关产品缺乏稳定的先验知识时尤为严重。推理过程不仅未能缓解脆弱性，反而常常生成虚假的社会证明来为错误推荐辩护。

Insight: 论文的创新点在于提出了一个可控的、可复现的基准测试FORGE，用于系统评估生成式推荐器对网页内容污染的敏感性。客观分析认为，其核心洞察在于揭示了检索增强型LLMs在推荐任务中一个普遍且严重的漏洞：即使少量污染内容也能显著误导模型，且模型的推理能力可能加剧而非缓解这一问题，这挑战了当前增强模型可靠性的常见假设。

Abstract: Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: generative recommenders may consume polluted web content, such as fake reviews and promotional pages crafted to mislead recommendations. We ask: to what extent do search-augmented LLMs become unwitting promoters of fake products when consuming polluted retrieval results? To answer this, we introduce FORGE (Fake Online Recommendations in Generative Environments), a benchmark for measuring fake-product promotion under controlled web-content pollution. Given an upstream search result, FORGE locally rewrites real products in retrieved web pages into fake ones to simulate web-content pollution, and measures how often the LLM recommends the fake product. FORGE covers 225 real-world products across 15 categories and 5 consumer scenarios. Across 12 commercial and open-weights LLMs, all models are vulnerable: a single polluted page yields fooled rates of up to 27%, while the full top-3 replacement raises this to 73.8%. Vulnerability varies substantially across categories, increasing when models lack stable prior knowledge of the relevant products. Reasoning does not mitigate this vulnerability; instead, it often generates spurious social proof to justify false recommendations. We evaluate three defenses: skepticism prompting and consensus filtering (over model priors or cross-document evidence). Skepticism can exacerbate vulnerability, much like reasoning, while filtering risks suppressing legitimate products. We release FORGE at https://github.com/leoluolol/forge-benchmark.

[27] Operads for compositional reasoning in LLMs cs.CL | math.CTPDF

Nathaniel Bottman, Kyle Richardson

TL;DR: 本文提出使用数学中的operad（操作数）理论为LLM中的问题分解提供严格数学框架，将问题模板建模为operad中的操作，将答案组合视为代数结构，并引入operadic consistency（操作数一致性）作为评估QA模型可靠性的新指标。

Details

Motivation: 当前LLM推理中广泛使用的问题分解策略缺乏严谨的数学基础，需要建立理论框架来描述问题分解与答案组合的规范结构。

Result: 在配套论文中，对12个LLM和4个多跳QA数据集的评估显示，operadic consistency与准确性强相关，且优于基于温度的自一致性基线方法。

Insight: 创新点在于首次将operad理论引入LLM推理，为问题分解提供形式化数学语言；提出的operadic consistency超越了传统自一致性，通过分析问题分解树的局部坍塌来评估模型可靠性，为多步推理分析开辟了新方向。

Abstract: Question decomposition, i.e. breaking a complex query into simpler sub-queries whose answers are composed to produce a final answer, is a widely used strategy for improving LLM reasoning, yet it currently lacks a rigorous mathematical foundation. In this paper, we propose operads, mathematical structures that model many-in, one-out operations and compositions thereof, as a natural framework for describing question decomposition. We define the questions operad $Q$, in which operations correspond to question templates and composition corresponds to substitution of sub-answers, and show how QA models can be interpreted as algebras over $Q$. Beyond reframing existing practice, this operadic perspective points toward new methods, in particular a notion of operadic consistency, which measures whether a QA model’s answers agree across the partial collapses of a question decomposition tree. Empirical evaluation of operadic consistency is reported in our companion paper (Bottman, Liu, and Richardson, 2026), which finds it strongly correlated with accuracy across twelve LLMs and four multi-hop QA datasets and outperforming standard temperature-based self-consistency baselines. We argue that operads are the natural mathematical home for question decomposition, and that invariants such as operadic consistency open new directions for analyzing and improving the reliability of multi-step reasoning.

[28] Recursive Agent Harnesses cs.CLPDF

Elias Lumer, Sahil Sen, Kevin Paul, Vamse Kumar Subbiah

TL;DR: 本文提出递归代理框架（RAH），这是一种结合了递归语言模型和代码代理的模式，通过让父代理生成并执行脚本来并行创建子代理，以处理细粒度任务。RAH在长上下文推理任务上显著提升了性能，特别是在Oolong-Synthetic基准测试中，将Codex基准从71.75%提升至81.36%，使用Claude Sonnet 4.5时达到89.77%。

Details

Motivation: 动机在于结合递归语言模型（RLMs）的递归调用与生产级代码代理的并行子代理生成能力，以解决长上下文推理问题，并探索工具增强的代理递归模式。

Result: 在Oolong-Synthetic基准测试（199个样本，上下文长度桶达400万token）中，RAH使用GPT-5将Codex基线从71.75%提升至81.36%，使用Claude Sonnet 4.5时达到89.77%，证明了框架本身而非模型带来的性能增益。

Insight: 创新点在于将递归单元从无工具的模型调用扩展为具备文件系统工具、代码执行和规划能力的完整代理框架，实现了代码优先的递归设计，可借鉴于分布式AI系统构建。

Abstract: Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic’s dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Harness (RAH) and frame it as harness recursion, the code-first extension to the model recursion of RLMs. A parent agent generates and runs an executable script that spawns subagent harnesses in parallel for fine-grained workloads and uses structured function calls for small subtasks. We provide a controlled evaluation on long-context reasoning. With the backbone held fixed at GPT-5 to match the published Codex and RLM baselines, RAH improves the Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), a gain attributable to the harness rather than the model. With a stronger backbone, Claude Sonnet 4.5, the same design reaches 89.77%.

[29] HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents cs.CLPDF

Yaxin Du, Yifan Zhou, Yujie Ge, Jiajun Wang, Xianghe Pang

TL;DR: 本文提出了HyperTool，一种统一的MCP风格可执行工具接口，旨在解决工具增强LLM代理中存在的执行粒度不匹配问题。通过允许模型以代码块形式调用工具、操作返回值并在本地传递中间结果，将确定性的工具子程序折叠为单个外部调用，从而减少模型在推理轨迹中管理低级数据流的负担。

Details

Motivation: 当前工具增强LLM代理通常依赖逐步原子工具调用，导致执行粒度不匹配：局部确定性的工具工作流被展开为重复的模型可见决策，消耗上下文并迫使模型在轨迹中管理低级数据流。

Result: 在MCP-Universe基准测试中，HyperTool将Qwen3-32B的平均准确率从15.69%提升至35.29%，Qwen3-8B从9.93%提升至33.33%，并超越了GPT-OSS和Kimi-k2.5的平均准确率，显著改善了多步工具使用性能。

Insight: 创新点在于将工具执行单元从原子调用提升为可编程代码块，允许模型封装确定性工具子程序，减少推理轨迹中的低级决策开销；客观来看，这种接口设计有望提升工具组合任务的执行效率和模型推理能力。

Abstract: Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally deterministic tool workflows are unfolded into repeated model-visible decisions, consuming context and forcing the model to manage low-level dataflow in the trace. We introduce \textbf{HyperTool}, a unified executable MCP-style tool interface that changes the model-visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool-format trajectories from cross-tool compositional tasks and verify them in real MCP environments. On MCP-Universe, HyperTool improves average accuracy from 15.69% to 35.29% on Qwen3-32B and from 9.93% to 33.33% on Qwen3-8B, and surpass GPT-OSS and Kimi-k2.5 on average accuracy, showing that our HyperTool can substantially improve multi-step tool use.

[30] Operadic consistency: a label-free signal for compositional reasoning failures in LLMs cs.CL | cs.LGPDF

Nathaniel Bottman, Yinhong Liu, Kyle Richardson

TL;DR: 该论文提出了一种名为’操作一致性’（OC）的无标签信号，用于检测大型语言模型在组合推理任务中的失败。该方法的核心思想是：模型对一个组合查询的直接答案，应该与其根据对该查询的分解步骤组合出的答案一致。通过在12个指令调优LLM和4个多跳问答数据集上的实验，论文证明了OC与模型准确率高度相关，并且在选择性预测任务中优于现有的置信度基线方法。

Details

Motivation: 动机在于无需真实标签即可在推理时检测LLM的组合推理失败。现有方法（如自我一致性、语义熵）依赖于问题内采样和自我评估，而本文从操作理论（Operad Theory）出发，提出了一种互补的诊断信号。

Result: 在四个多跳QA数据集（HotpotQA, DROP, MuSiQue, StrategyQA）上，OC与模型准确率在所有数据集上都表现出强相关性（Pearson r ∈ [0.86, 0.94]），是唯一在所有四个数据集上r ≥ 0.85的信号。在选择性预测任务中，在等成本（K=3）预算下，OC相比调优后的CoT-SC基线在AUARC和AUROC指标上均有显著提升。

Insight: 主要创新点在于将操作理论形式化思想引入LLM置信度估计，提出了一个基于组合一致性的、无需真实标签的失败检测信号。该信号不依赖于自我评估，而是检查模型对同一问题的直接回答与分步组合回答之间的一致性，为模型可靠性评估提供了新的视角和有效工具。

Abstract: Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model’s direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \in [0.86, 0.94]$, all $p \leq 0.0004$), and is the only signal we evaluate with $r \geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model’s own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.

[31] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning cs.CL | cs.AIPDF

Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen, Avinash Atreya

TL;DR: 本文提出了一种名为检索增强强化微调（RA-RFT）的后训练框架，旨在通过类比推理来提升语言模型在复杂任务上的表现。该框架通过黄金相关性蒸馏训练一个能够根据预期推理收益（而非语义相似性）来检索上下文的检索器，并利用检索到的类比示范，结合可验证结果奖励进行强化微调，从而教导模型学习利用有效的推理路径。

Details

Motivation: 传统基于词汇或语义相似性的检索增强生成（RAG）机制在处理复杂推理任务时存在局限，因为语义相似的问题可能需要完全不同的解决策略，而表面不同的问题可能共享相同的底层推理模式。因此，需要一种能够基于推理模式进行类比检索的方法。

Result: 在具有挑战性的数学推理基准测试中，RA-RFT持续优于标准的强化微调方法。例如，在AIME 2025基准上，对于Qwen3-1.7B和Qwen3-4B模型，其平均@32准确率分别比GRPO方法提高了7.1和2.8个百分点。

Insight: 核心创新在于将检索标准从语义相似性转向“预期推理收益”，实现了基于推理模式的类比检索。这为模型提供了互补的解决方案策略和不同的推理支架，是一种与奖励设计或训练课程改进正交的、互补的提升维度。

Abstract: Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively – suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.

cs.CV [Back]

[32] Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM cs.CVPDF

Shreyas Narasimhiah Ramesh, P. D. Rathika, Mahasweta Sarkar, Kristen Wells, Michel Audette

TL;DR: 本文提出了一种基于立体视觉和人体姿态估计的便携式、低功耗跌倒预测与检测系统，该系统部署在AMD Kria K26边缘计算模块上。系统采用RGB-D相机获取数据，通过YOLOX、A2J和CNN三阶段流水线处理，在保护隐私（丢弃RGB帧）的同时，实现了实时的跌倒活动分类。

Details

Motivation: 解决老年人跌倒可能造成严重伤害的问题，旨在开发一种非侵入式、保护隐私、可实时运行的便携式跌倒检测系统，以支持老年人的健康监护。

Result: 在特定评估标准下，量化后模型的准确率分别为：YOLOX（IoU>=50%）为74%，A2J姿态估计（mAP，10-cm规则）为84.13%，CNN跌倒分类准确率为75.85%。系统吞吐量从单线程的2.5 FPS提升到多线程的4.5 FPS。

Insight: 创新点在于将完整的基于深度学习的跌倒检测流水线（目标检测、3D姿态估计、活动分类）部署到低功耗边缘设备，并通过丢弃RGB帧和仅使用深度数据进行姿态估计来实现隐私保护。系统设计采用了量化模型和并行化策略以提升边缘设备上的运行效率。

Abstract: Background and Objective: Falls among elderly people can cause serious injury and reduce quality of life. Timely prediction and detection are essential to prevent harm and support well-being. We propose a portable, low-power, battery-operated, vision-based fall prediction and detection system using HPE on an AMD Kria K26 System-on-Module (SOM). The objective is a non-intrusive, privacy-preserving system for real-time fall detection. Methods: The system uses an Intel RealSense D455 range-sensing camera connected to the K26 SOM by USB. It captures synchronized RGB and depth frames, 640 x 480 x 3 and 640 x 480 pixels, at 60 FPS. The SOM runs a three-stage pipeline with quantized YOLOX, Anchor-to-Joint (A2J), and fall-detection models. YOLOX identifies human bounding boxes from RGB frames, then discards the RGB frames to preserve privacy. A2J uses depth frames to estimate 15 joint keypoints per person. A CNN uses selected joint coordinates (x, y, z) to classify fall activity. YOLOX was trained on CrowdHuman; A2J on ITOP, MP-3DHP, UR Fall Detection, and a custom SDSU PSG dataset; and the CNN on UR Fall Detection and SDSU PSG. The design used a single-core DPU with a serial pipeline and a dual-core DPU running YOLOX and A2J with multiple threads. Results: Quantized accuracy was evaluated using IoU >= 50% for YOLOX, mAP with a 10-cm rule for A2J, and classification accuracy, (TP + TN)/(TP + TN + FP + FN), for the CNN. Accuracies were 74%, 84.13%, and 75.85%. Throughput improved from 2.5 FPS for the single-threaded pipeline to 4.5 FPS for the multi-threaded version. Conclusion: Results demonstrate the feasibility of privacy-preserving fall detection on an AMD Kria K26 edge device. On-device HPE and fall classification runs without cloud dependency, supporting elderly monitoring and assistive healthcare. Future work will improve model accuracy and speed.

[33] HairPort: In-context 3D-aware Hair Import and Transfer for Images cs.CV | cs.GRPDF

Alireza Heidari, Amirhossein Alimohammadi, Wallace Michel Pinto Lira, Adi Bar-Lev, Ali Mahdavi-Amiri

TL;DR: 本文提出了HairPort，一种3D感知的发型迁移框架，用于解决图像间发型迁移在姿态和尺度差异较大时效果不佳的问题。该方法通过分离头发移除与迁移步骤，并强制执行几何一致性，实现了更准确的发型转移。

Details

Motivation: 现有发型迁移方法在源图像和目标图像存在较大姿态和尺度差异时效果有限，因为它们需要合成缺失的头发内容而非直接迁移。本文旨在解决这一挑战，实现跨大姿态和尺度差异的准确发型迁移。

Result: 在定性评估中，HairPort在保持身份和姿态一致性方面优于现有方法。定量评估也显示其性能提升，具体基准未在摘要中明确提及，但声称在质量和数量上均超越现有方法。

Insight: 创新点包括引入Bald Converter（基于FLUX.1 Kontext的LoRA上下文适应）进行逼真秃头图像生成，以及3D感知的迁移流程（先重建并重渲染参考发型）。此外，构建了包含6000对图像的Baldy数据集用于训练。

Abstract: Transferring hairstyles between images is an important but challenging task in computer graphics, computer vision, and visual effects. It enables users to explore new looks without physically altering their hair, with applications in virtual try-on systems, augmented reality, and entertainment. Most prior works operate best under small pose gaps, and they fall short under large viewpoint and scale differences, where missing hair content must be synthesized rather than transferred. We propose HairPort, a 3D-aware hairstyle transfer framework that attempts to solve these issues by explicitly separating hair removal from transfer and enforcing geometric consistency before synthesis. We introduce a Bald Converter, which produces realistic bald versions of faces through LoRA-based in-context adaptation of FLUX.1 Kontext. To train our Bald Converter, we introduce a new dataset, Baldy, containing 6,000 paired bald and original images across diverse identities and conditions. We also use a 3D-Aware Transfer Pipeline that reconstructs and re-renders the reference hairstyle from the target viewpoint before compositing it onto the source image. Being 3D aware, our method supports large pose and scale discrepancies between the source and target. Finally, a conditional flow-matching generator synthesizes the transferred result from the bald source and geometry-aligned reference guidance. Together, our method enables accurate, pose-consistent, and identity-preserving hairstyle transfer, outperforming existing methods both qualitatively and quantitatively.

[34] Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs cs.CV | cs.AIPDF

Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

TL;DR: 本文提出了一种针对医学大视觉语言模型（LVLMs）的细粒度偏好优化方法，旨在解决现有直接偏好优化（DPO）方法在医学领域面临的三个关键局限：序列级奖励信号未能区分临床关键信息、静态参考响应导致分布偏移、以及缺乏显式的视觉对齐约束。该方法通过结合双向词元级KL正则化和视觉对比对齐目标，构建了一个细粒度的、基于策略的对齐框架，通过最小化编辑模型输出来纠正临床错误，同时保留语言风格。

Details

Motivation: 现有医学LVLMs存在事实不一致、视觉基础薄弱以及与临床反馈错位的问题，而传统的DPO方法在医学领域面临奖励信号粒度粗、依赖静态参考导致分布偏移、以及缺乏显式视觉对齐约束等局限性，因此需要一种更精细的优化方法。

Result: 在多种医学成像任务和临床文本生成基准上的广泛实验验证了该方法的有效性，表明其能提升模型的临床准确性和视觉基础能力。

Insight: 创新点在于提出了双向词元级KL正则化和视觉对比对齐目标，通过配对干净与病变图像来惩罚缺乏视觉证据的响应，并构建基于策略的细粒度偏好对，仅纠正临床错误片段，从而实现了更精准的医学对齐。

Abstract: Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

[35] Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning cs.CVPDF

Sieu Tran, Duc Nguyen, Hao Vo, Khoa Vo, Ngan Le

TL;DR: 本文提出了一种名为Dual-State Slot Attention (DSSA) 的完全自监督框架，用于无监督视频以物体为中心的学习。该框架通过将每个物体槽（slot）解耦为负责单帧外观的局部状态和负责跨帧稳定物体身份的标识状态，解决了现有方法在快速运动和部分遮挡等挑战性场景中难以维持稳定物体身份的问题。

Details

Motivation: 现有基于槽（slot）的方法通常将物体的单帧外观和跨帧身份信息编码在同一个槽向量中，这导致了重建（需要对外观变化敏感）与时间一致性（需要对变化保持不变）的目标冲突，从而引发槽交换问题。此外，Slot Attention中的令牌重归一化会放大弱关注槽，使其吸收其他物体的令牌，破坏槽与物体的对应关系。

Result: 在MOVi-C、MOVi-D和YouTube-VIS基准测试上的实验表明，DSSA在分割质量和时间一致性方面持续优于现有方法，并且在物体识别和视频动态预测等下游任务上也取得了更强的性能。

Insight: 核心创新点在于将物体表示解耦为局部状态和标识状态，分别对应外观和身份，从而对齐了重建与时间一致性这两个目标。此外，提出的竞争调制聚合（CMA）机制通过降低弱匹配槽的更新权重，防止了它们吸收其他物体的令牌，稳定了槽与物体的对应关系。

Abstract: Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.

[36] ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation cs.CV | cs.LGPDF

Jiangtao Kong, Peijun Zhao, Chun-Fu Chen, Youngwook Do, Shaohan Hu

TL;DR: 本文提出了一种用于开放式图像到文本生成（OpenITG）的高效持续对齐方法（ECA），旨在解决增量学习（IL）中因视觉数据类别随时间漂移而导致的知识遗忘问题。ECA通过混合查询模块、基于Fisher信息的动态扩展机制和嵌入字典重放技术，在不访问历史原始数据的情况下，持续调整预训练视觉语言模型的对齐模块，以保持高质量的跨模态表示。

Details

Motivation: 针对开放式图像到文本生成任务中，现实场景下视觉数据的主要类别会随时间演变而漂移，现有增量学习方法难以在适应新数据的同时有效保留历史知识，导致灾难性遗忘问题。

Result: 在构建的四个更贴近真实场景的IL OpenITG基准测试上，ECA相比基线方法显著缓解了灾难性遗忘，并提升了增量学习性能，具体表现为生成文本的准确性和上下文相关性得到改善。

Insight: 创新点在于提出’持续对齐’概念，并设计了无示例的增量学习框架，核心机制包括混合查询模块实现任务特定特征适应、基于Fisher信息的动态结构扩展以最小化干扰，以及嵌入字典重放来保留过去知识，为跨模态模型的持续学习提供了新思路。

Abstract: Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA’s performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at https://github.com/Snowball0823/ECA.

[37] SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images cs.CVPDF

Xiaoxiao Sun, Ruotian Zhang, Junzhe Huang, James Burgess, Serena Yeung-Levy

TL;DR: 本文提出了SalArt-VQA，一个用于诊断视觉语言模型（VLMs）是否理解AI生成图像中显著伪影的细粒度基准。该基准包含950张图像和3,681个人工编写的多项选择题，涵盖四种对齐的问题类型，以评估伪影存在检测、语义定位、空间定位和基于证据的缺陷识别。在20个VLMs上的评估揭示了仅靠图像级检测准确率所掩盖的失败模式，表明高检测准确率并不等同于基于视觉证据的伪影理解。

Details

Motivation: 当前VLMs越来越多地用于检测AI生成图像中的可见伪影，但其分析伪影的能力尚不明确。图像级的正确决策可能隐藏重要失败，例如模型可能依赖错误的视觉线索、选择错误区域或描述图像不支持的缺陷。

Result: 在20个VLMs上的评估显示，最强的模型在伪影图像上的检测召回率达到99.37%，但仅在53.26%的图像上正确回答了所有四个伪影侧问题。基准揭示了敏感性与校准之间的权衡：敏感模型常做出无支持的伪影声明，而保守模型则主要通过遗漏真实伪影来避免误报。

Insight: 创新点在于提出了一个细粒度的诊断基准SalArt-VQA，通过多种问题类型直接评估VLMs对伪影的理解深度，而不仅仅是图像级分类。这暴露了高检测准确率下隐藏的失败模式，强调了基于局部视觉证据进行伪影分析的重要性，为VLM的可靠评估提供了新维度。

Abstract: Vision-language models (VLMs) are increasingly used to detect whether AI-generated images contain visible artifacts, yet their ability to analyze such artifacts remains poorly understood. A correct image-level decision can still hide important failures: a model may correctly flag an artifact while relying on the wrong visual cue, selecting the wrong region, or describing a defect that the image does not support. To evaluate these behaviors directly, we introduce SalArt-VQA, a diagnostic benchmark for fine-grained SALient ARTifact understanding in AI-generated images. SalArt-VQA contains 950 images and 3,681 human-authored multiple-choice questions spanning artifact images, matched real reference images, and paired generated reference images. Four aligned question types evaluate presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification, while the reference splits test calibration and abstention when the annotated defect is absent. Across 20 VLMs, SalArt-VQA reveals failures that image-level detection accuracy hides: the strongest model reaches 99.37% detection recall on artifact images but answers all four artifact-side questions correctly on only 53.26% of images. Comparing artifact images with artifact-free references reveals a sensitivity-calibration tradeoff: sensitive models often make unsupported artifact claims, while conservative models avoid false alarms largely by missing real artifacts. These results show that high artifact detection accuracy alone does not imply grounded artifact understanding. SalArt-VQA exposes these hidden failure modes and provides a fine-grained evaluation of whether VLM artifact claims are supported by local visual evidence.

[38] GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models cs.CVPDF

Garvita Allabadi, Matteo Sodano, Roberto Estevão, Yuxiong Wang, Vikram Adve

TL;DR: 本文提出了一种名为GRIP的反馈引导提示检索框架，用于提升大型多模态模型（LMMs）的多模态上下文学习（M-ICL）性能。该方法通过利用LMM的反馈进行对比学习，超越了传统的基于特征相似度的检索，能够识别出真正有助于提升模型预测效果的上下文示例。

Details

Motivation: 现有M-ICL方法通常基于特征空间相似性选择上下文示例，但作者通过系统分析发现，视觉上相似的样本并不总是最能有效提升上下文学习性能的样本，因此需要一种更智能的检索机制。

Result: 在分类、图像描述和视觉问答（VQA）三个多模态任务上，GRIP在Qwen2.5-VL-7B模型上持续优于基于相似度的检索方法，在Idefics2-8B模型的分类任务上提升最为显著。此外，基于一个开源LMM反馈训练的检索器可以无需重新训练直接迁移到GPT-4o和Gemini等闭源模型上。

Insight: 核心创新在于利用模型自身的预测反馈（而非仅输入特征）来指导检索器的训练，通过对比学习区分有益和有害的上下文示例，实现了检索目标与最终任务性能的对齐。这种反馈驱动的、可迁移的检索框架为高效部署M-ICL提供了新思路。

Abstract: In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA). Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance. To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.

[39] VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving cs.CVPDF

Thach Nguyen, Danhua Guo, Tom Lampo, Fei Wu, Burhan Yaman

TL;DR: 本文提出了VLADriveBench框架，用于评估自动驾驶中视觉-语言-动作（VLA）模型的思维链（CoT）推理与驾驶动作之间的关系。该框架结合了观察性指标（提及、幻觉、矛盾、动作对齐）和CoT干预协议，以提供对CoT-动作关系的互补性评估。通过对两种架构下的三个模型进行评估，研究发现观察性对齐得分最高的模型其CoT可能是附带现象，而得分较低的模型其CoT却具有强因果性。

Details

Motivation: 现有基准仅评估驾驶轨迹质量，而未能评估VLA模型生成的思维链（CoT）推理是否与驾驶动作相关、一致或具有因果联系，因此需要一个新的评估框架来填补这一空白。

Result: 在VLADriveBench上评估了两种架构的三个模型，发现ORION在观察性对齐上得分最高但其CoT是附带现象，而Alpamayo v1.5得分较低但其CoT具有强因果性，且视觉显著性调节了CoT影响的程度。

Insight: 创新点在于提出了首个专门评估VLA模型中CoT与动作关系的基准框架，结合了观察性和干预性分析；客观分析表明，仅凭观察性指标（如对齐得分）可能不足以判断CoT的实际因果效用，需要干预协议来揭示其真实影响，这为未来VLA模型的评估提供了更全面的视角。

Abstract: Vision-language-action (VLA) models generate chain-of-thought (CoT) reasoning alongside driving trajectories, but existing benchmarks evaluate only trajectory quality and do not assess whether the CoT is relevant, consistent, or causally connected to the driving action. We introduce VLADriveBench, a framework that combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to provide complementary views of the CoT-action relationship. Applying VLADriveBench to three models across two architectures, we find that the two analyses can diverge sharply: ORION scores highest on observational alignment yet its CoT is epiphenomenal, while Alpamayo v1.5 scores lower yet its CoT is strongly causal, with visual salience gating the extent of CoT influence.

[40] DIMOS: Disentangling Instance-level Moving Object Segmentation cs.CV | cs.AIPDF

Hongxiang Huang, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Zeke Xie

TL;DR: 本文提出了一种名为DIMOS的双重解耦特征提取框架，用于解决移动实例分割（MIS）任务中，尤其是在小实例、快速运动和低光照等挑战性条件下，现有多模态方法因事件相机特征稀疏且外观与运动信息纠缠而导致的性能受限问题。该方法通过分别解耦图像和事件模态中的外观与运动信息，并结合多粒度跨模态对齐，实现了更有效的特征融合，从而提升了分割性能。

Details

Motivation: 移动实例分割在交通监控、自动驾驶等领域有广泛应用，但现有基于事件相机和图像融合的多模态方法仍难以分割小移动实例，因为事件特征稀疏且外观与运动信息纠缠，限制了跨模态融合的有效性。

Result: 实验结果表明，该方法在多模态移动实例分割任务中达到了最先进的性能水平，特别是在快速运动和低光照等挑战性条件下对小实例的分割效果显著提升。

Insight: 核心创新点在于提出了一个双重解耦特征提取框架，分别从图像和事件模态中分离外观和运动信息，并引入多粒度跨模态对齐机制，以增强特征密度和融合效果，从而更有效地利用时空细节。从客观角度看，这种对模态内部信息进行解耦再跨模态对齐的思路，为解决多模态感知中的特征纠缠和稀疏性问题提供了新途径。

Abstract: Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

[41] Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning cs.CV | cs.AIPDF

Changye Li, Meng Lu, Yi Wu, Ligeng Zhu

TL;DR: 本文提出了PERIA，一种工具增强的视觉代理，用于解决空间推理任务。它通过引入视觉感知和交互工具来获取细粒度空间证据，并采用统一的训练方法优化多工具使用行为。实验表明，PERIA-8B在多个基准测试中显著优于同类模型，并与更大规模的模型性能相当。

Details

Motivation: 现有视觉语言模型在需要主动证据获取和多步视觉交互的空间推理任务上存在局限，仅依赖视觉编码器的隐式表示不足以恢复细粒度空间证据。

Result: 在8个数据集的13个基准测试中，PERIA-8B相比Qwen3-8B骨干网络在分布内和分布外基准上分别提升10.0%和4.4%，并超越同类尺寸的先前SOTA基线7.0%-14.8%，性能与Qwen3-VL-235B-A22B-Thinking和GPT-5等更大模型相当。

Insight: 创新点在于设计了轻量级的视觉感知与交互工具家族，以及结合监督工具使用轨迹合成、复合奖励和OR-GIGPO策略优化的统一训练方法，有效增强了模型在空间推理中的主动证据获取和多步交互能力。

Abstract: While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

[42] Language-Guided Abstraction for Visual Reasoning cs.CVPDF

Xu-Jing Ye, Yuan-Gen Wang, Ruping Wang

TL;DR: 本文提出L-VARC框架，通过引入语言引导的LUPI分支来增强视觉推理能力，以解决ARC任务中纯语言方法参数量大和纯视觉方法难以捕获高层语义的问题。该框架利用DeepSeek-V3对语言描述进行语义压缩，并通过跨注意力投影器对齐视觉与语义特征，训练后丢弃LUPI分支，最终得到一个仅1800万参数的轻量级模型。

Details

Motivation: 当前ARC任务的主流方法要么依赖参数量巨大的LLM，要么是纯视觉方法容易在像素级模式上过拟合且难以捕获高层语义，因此需要一种能结合语言先验知识来引导视觉推理的轻量高效方法。

Result: 大量实验表明，L-VARC有效利用语言先验提升了视觉推理性能，在ARC基准上超越了现有最先进方法（SOTA）。消融研究进一步验证了语义压缩模块和跨注意力投影器两个新设计的贡献。

Insight: 创新点在于提出了一个语言引导的LUPI训练分支，将语言描述作为特权信息来增强视觉模型的语义理解，并在推理时丢弃该分支以保持模型轻量；通过语义压缩将原始描述适配到标准文本编码器，并设计跨模态对齐机制，为结合语言先验进行视觉推理提供了新思路。

Abstract: The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github.com/GZHU-DVL/L-VARC.

Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu, Xinglong Xu

TL;DR: 本文针对多模态交错思维中存在的模态隔离问题，提出了一种名为MoTiF的两阶段训练框架。该框架通过分解推理循环为原子操作，定义模态转换损失来量化跨模态幻觉和视觉利用不足，并利用反射式监督微调和基于流的强化学习直接优化模态转换的保真度。

Details

Motivation: 动机在于解决统一多模态模型在复杂长链交错思维任务中出现的模态隔离问题，即生成的图像与文本上下文偏离，而后续文本又忽略视觉证据，导致两种模态交替但未真正相互告知。

Result: 在四个视觉谜题基准测试上，这种转换级别的监督显著提高了跨模态连贯性和最终任务准确性，证明了其有效性。

Insight: 创新点在于将模态转换作为独立的监督目标，提出了量化转换损失的方法和两阶段训练框架，核心洞察是有效的交错推理需要在模态边界进行显式的结构监督，而不仅仅是扩大规模或进行最终任务优化。

Abstract: Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.

[44] Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension cs.CV | cs.CLPDF

Shenglai Zeng, Qirui Wang, Kai Guo, Xinnan Dai, Xianxuan Long

TL;DR: 本文提出了一种名为AGAR（注意力引导自适应渲染）的训练无关、模型无关方法，用于提升视觉文本理解（VTC）任务中视觉语言模型（VLM）的性能。该方法通过分析VLM中后期层的注意力机制，识别出关键视觉区域，将其映射回文本片段并放大重渲染，从而帮助模型更好地利用视觉信息进行问答。

Details

Motivation: 现有VTC流程将渲染和布局视为固定、与内容无关的预处理步骤，且缺乏对VLM内部如何处理可视化文本的机制性理解。研究发现VLM存在‘定位但不利用’的现象：证据定位注意力在中后期层显著出现，但与答案正确性脱钩。

Result: 在九个VTC基准测试（包括短文本、长上下文和多页记忆问答）和四个VLM骨干网络上进行的广泛实验表明，AGAR能一致提升现成VLM的性能，作为即插即用增强；与VLM后训练结合可带来进一步增益；且在视觉和文本输入退化情况下保持鲁棒。

Insight: 创新点在于利用VLM自身的注意力机制（特别是中后期层）来指导自适应渲染，通过放大关键文本区域来弥补模型定位与利用之间的脱节。这提供了一种无需训练、模型无关的机制性干预方法，可提升VLM在长文本理解任务中的表现。

Abstract: Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM’s own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.

[45] Multi-Label Test-Time Adaptation with Bayesian Conditional Priors cs.CV | cs.LGPDF

Qiru Li, Ao Zhou, Zhiwei Jiang, Zifeng Cheng, Cong Wang

TL;DR: 本文提出了一种名为贝叶斯条件先验估计的无梯度测试时适应方法，用于解决冻结视觉语言模型在多标签识别任务中面对分布偏移时的脆弱性问题。该方法通过选择高置信度的锚标签，并基于在线估计的标签共现统计，在logit空间进行闭式贝叶斯精炼，显式地促进兼容标签并抑制不兼容标签，从而注入标签依赖性，无需调整主干网络。

Details

Motivation: 动机在于解决冻结视觉语言模型在分布偏移下进行多标签识别时，标准零样本推理独立评分标签、忽略共现结构，导致产生不连贯标签集（主导概念压制较弱但兼容的标签）的问题。

Result: 在标准多标签基准测试和多个CLIP主干网络上，BCP方法一致优于强基线TTA方法，例如将RN50的平均mAP从57.31提升至69.22，将ViT-B/16从62.61提升至71.79。

Insight: 创新点在于将零样本logits视为固定图像-文本似然下的边缘后验代理，并将偏移引起的错误主要归因于不匹配的标签先验；通过在线估计锚条件先验进行贝叶斯精炼，该方法具有闭式解和点互信息解释，以可忽略的开销显式建模标签依赖性。

Abstract: Multi-label recognition with frozen Vision-Language Models (VLMs) is brittle under distribution shift: standard zero-shot inference scores labels independently, ignoring co-occurrence structure and producing incoherent label sets where dominant concepts suppress weaker but compatible labels. We introduce Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method that injects label dependency without tuning the backbone. BCP views zero-shot logits as a proxy for marginal posteriors under a fixed image-text likelihood and attributes shift-induced errors mainly to a mismatched label prior. For each test image, it selects a high-confidence anchor label and applies an anchor-conditioned Bayesian refinement. This update is closed-form in logit space and admits a pointwise mutual information (PMI) interpretation, explicitly promoting compatible labels and suppressing incompatible ones. BCP operates without target annotations by estimating anchor-conditioned priors online from the unlabeled test stream via lightweight second-order co-occurrence statistics, adding negligible overhead beyond a single forward pass. Across standard multi-label benchmarks and multiple CLIP backbones, BCP consistently outperforms strong TTA baselines, e.g., improving RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79.

[46] A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis cs.CV | cs.AIPDF

Manex Atxa, Bruno Simoes, Julen Balzategui

TL;DR: 本文提出了一种基于三维体视频数据的实时个性化人体工效学姿态分析方法。该方法通过分析3D点云数据，克服了传统摄像头固定视角和遮挡的限制，实现了多角度实时姿态推断。系统结合了先进的3D数据技术和传统2D姿态估计算法，为工作场所安全健康监测提供了可扩展的实用解决方案。

Details

Motivation: 解决传统摄像头因固定视角和遮挡问题导致人体姿态评估数据不全面的局限性，满足工作场所对实时安全与健康监测日益增长的需求。

Result: 通过在RGB-D摄像头采集的负重任务数据集上进行案例研究，训练了个性化深度学习分类器，实现了对新流数据的实时推理。

Insight: 创新点在于利用3D点云数据进行多角度分析以克服视角限制，并采用用户手动标注数据训练个性化分类器的混合交互式学习框架，将前沿3D技术与传统2D算法结合实现实时工效学评估。

Abstract: This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

Haoran Zhang, Haokun Zhang, Pengyu Liu, Yujia Zhang, Weibao Xue

TL;DR: 本文提出了一种用于微手势识别的多模态框架，集成了骨骼关键点、3D热图体积和高分辨率RGB视觉特征。该框架通过显著性引导的特征提取、针对长尾分布的平滑加权与正交语义嵌入损失，以及用于跨主体泛化的跨模态伪标签策略，在未修剪视频中实现了有效的微手势识别。

Details

Motivation: 解决微手势识别中因信噪比极低、类分布严重长尾以及跨主体评估场景中固有的域偏移所带来的挑战。

Result: 在第四届MiGA-IJCAI挑战赛的Track 1上，该框架取得了68.13%的F1分数，获得了第四名，展现了其竞争力。

Insight: 创新点包括：用于跨主体无监督域自适应的跨模态伪标签策略，以及结合平滑加权与正交语义嵌入损失以保护尾部类别的机制。从客观角度看，其多模态融合与针对特定挑战（如长尾和域偏移）的定制化损失设计具有借鉴意义。

Abstract: Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13%, securing the 4th place.

[48] Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video cs.CVPDF

Sathira Silva, Abrham Kahsay Gebreselasie, Muhammad Umer Sheikh, Kartik Kuckreja, Daniel Harari

TL;DR: 本文提出BabyMind方法，通过引入对象优先的归纳偏置来解决婴儿视角视频中语言接地学习的问题。该方法利用基于掩码的区域接口提取候选对象嵌入，通过跟踪在短时间窗口内链接成轻量级对象文件，并使用原型空间多实例对比目标将语音与对象文件包对齐。

Details

Motivation: 从婴儿视角的自然体验中学习接地语言意义面临两个模糊性：命名指称何时出现以及在杂乱帧中的位置。在SAYCam风格数据中，照顾者语音稀疏且与自我中心视频弱同步，导致单帧对比配对产生噪声正样本，其中目标对象缺失或与干扰物纠缠。

Result: 在SAYCam-S数据集上，BabyMind将Labeled-S 15强制选择准确率提高了+2.6个百分点，优于CVCL方法，并在词汇内分布外基准测试中取得一致增益。

Insight: 创新点在于引入对象优先的归纳偏置，通过离线掩码区域接口提取对象嵌入、跨帧跟踪构建对象文件，以及使用原型空间多实例对比目标对齐语音与对象包。客观分析认为，该方法通过跟踪一致性和全局对象一致性正则化器稳定学习，并将对象文件结构转移到评估时使用的全局帧嵌入中，有效处理了稀疏噪声监督下的对比学习问题。

Abstract: Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at https://github.com/sathiiii/BabyMind.

[49] SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object Tracking cs.CVPDF

Alexander Holmberg

TL;DR: 本文提出了一种名为选择性掩码传播（selective mask propagation）的多目标跟踪算法，该算法在轻量级基础跟踪器与计算昂贵的视频目标分割（VOS）模型之间进行智能调度。算法仅在检测到分配不确定性高的困难帧窗口时，才调用VOS模型进行身份修正，从而在保证跟踪精度的同时显著降低了计算开销。该方法无需训练，将基础跟踪器和VOS模型视为黑盒，并在DanceTrack和SportsMOT基准上验证了其有效性。

Details

Motivation: 多目标跟踪任务的难度分布呈现重尾特性：大部分帧对轻量级跟踪器来说是容易的，但少数帧本质上是困难的。虽然VOS模型能在困难帧中更好地保持目标身份，但其计算和内存成本过高。本文旨在设计一种方法，仅在必要时调用VOS模型，以在精度和效率之间取得平衡。

Result: 在DanceTrack数据集上，该方法提升了三种不同基础跟踪器的性能。在SportsMOT数据集上，结合全局轨迹关联的SAM3-Deep-EIoU版本取得了86.8 HOTA的分数，达到了该基准的state-of-the-art（SOTA）水平。

Insight: 核心创新点在于提出了一个基于分配不确定性信号触发的选择性调度机制，实现了对困难帧的按需、低成本处理。该方法框架是训练无关且模块化的，允许直接替换更强大的VOS模型来持续提升性能，为构建高效、鲁棒的多目标跟踪系统提供了一种通用思路。

Abstract: Multi-object tracking has a heavy-tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment-uncertainty signal fires. The base tracker’s output is modified only when the VOS model makes a confident prediction that contradicts the base tracker’s identity assignment; weak or inconclusive predictions preserve the base output. The method is training-free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3-Deep-EIoU with global track association achieves state-of-the-art performance on the benchmark with 86.8 HOTA.

[50] TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment cs.CV | cs.AIPDF

Yu Meng, Xiangyang Luo, Letian Li, Wenyuan Jiang, Chen Gao

TL;DR: 本文提出了TetherCache，一种无需训练、即插即用的缓存管理策略，用于解决自回归视频扩散模型在生成长视频时因KV缓存限制和历史上下文分布漂移导致的质量退化与时间漂移问题。该方法通过GRAB机制选择信息丰富且多样化的历史帧，并通过TAME机制对齐召回记忆的统计分布，从而稳定生成长达数分钟的视频。

Details

Motivation: 自回归视频扩散模型在扩展到分钟级视频生成时面临挑战：有限的KV缓存预算无法保留完整历史，而反复依赖自生成帧会导致上下文分布漂移累积，引发视觉伪影、质量下降和时间漂移。

Result: 在VBench-Long基准测试的30秒、60秒和240秒设置下，TetherCache持续提升了长视频生成质量。特别是在240秒生成中，它显著提高了整体和语义分数，并将质量漂移从7.84降低至1.33，证明了其在稳定长时域自回归视频扩散中的有效性。

Insight: 创新点在于将缓存划分为sink、memory和recent区域，并引入GRAB（基于注意力多样性的门控召回）和TAME（通过记忆编辑的可信对齐）两种互补机制，在固定缓存预算下平衡信息保留与多样性，并通过轻量编辑缓解特征漂移污染，无需额外训练即可提升长视频生成的稳定性。

Abstract: Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.

[51] SeamEdit: A Black-Box VLM-Agnostic Pipeline for Large-Image Semantic Editing cs.CV | cs.GR | cs.MMPDF

Xiangyu Lyu, Dan Lei

TL;DR: SeamEdit是一种无需训练、模型无关的黑盒流程，用于对大图像进行语义区域编辑。它通过五阶段后处理流程（基于覆盖的瓦片分解、黑盒VLM修复、几何与颜色一致性校正、基于接缝风险的多候选排序、动态规划曲线接缝融合）来解决直接应用闭源模型进行瓦片编辑时产生的语义变形、画布级对齐漂移和可见接缝伪影等问题，从而在保持高生成质量的同时实现与周围内容的自然融合。

Details

Motivation: 解决对大图像进行语义区域编辑时，现有方法要么依赖白盒模型，要么直接应用闭源模型（如具备修复能力的视觉语言模型VLM）进行瓦片编辑会引入语义变形、对齐漂移和可见接缝等失败模式，无法同时满足高生成质量和与周围内容自然融合的要求。

Result: 论文提出的方法有效减少了接缝可见性，并支持对任意瓦片区域进行语义修改。摘要中未提及具体的定量结果或基准测试对比。

Insight: 创新点在于提出了一种完全后处理的、模型无关的黑盒流程，将任何具备修复能力的VLM视为黑盒预言机，通过多阶段校正和融合策略（特别是基于接缝风险的多候选排序和动态规划曲线接缝融合）来系统性缓解瓦片编辑的典型问题，无需对底层生成模型进行任何训练或修改。

Abstract: Semantic region editing for large images must satisfy two requirements at the same time: high generative quality and natural integration with surrounding content. Some related methods rely on white-box models and leave the strong generation capability of closed-source models underexplored. Directly applying closed-source models to tiled editing, however, introduces several failure modes: semantic deformation, canvas-level alignment drift, and visible seam artifacts. This paper presents SeamEdit, a training-free and model-agnostic pipeline that treats any VLM with inpainting capability as a black-box oracle. SeamEdit mitigates these issues through a five-stage post-hoc pipeline: overlay-based tile decomposition, black-box VLM inpainting, geometric and color-consistency correction, seam-risk-based multi-candidate ranking, and dynamic-programming curved seam fusion. The pipeline reduces seam visibility and supports semantic modification of arbitrary tile regions.

[52] LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck cs.CVPDF

Peixi Wu, Biao Yang, Feipeng Ma, Bosong Chai, Bo Lin

TL;DR: 本文提出LaME（Latent Reasoning Multimodal Embedding），一种通过信息瓶颈在潜在空间中进行推理的多模态嵌入方法。它使用固定容量的可学习推理令牌，在单次前向传播中完成推理，解决了显式思维链推理计算成本高、依赖标注质量的问题，在保持竞争力的性能同时实现了60倍的推理加速。

Details

Motivation: 现有基于思维链（CoT）的推理驱动多模态嵌入方法存在两个核心局限：自回归推理计算成本高，不适合低延迟检索；嵌入性能严重依赖CoT标注质量，大规模训练不可靠。本文旨在探索文本CoT是否是多模态嵌入的最佳推理形式，并研究能否在潜在空间中实现有效的嵌入推理。

Result: 在MMEB-v2和MRMR基准测试上，LaME取得了有竞争力的性能，甚至超过了一些基于显式CoT的模型。其推理速度比显式CoT方法快60倍，比潜在基线方法快2倍，吞吐量与判别式嵌入模型相当。

Insight: 创新点在于将面向嵌入的潜在推理形式化为弱监督信息瓶颈问题，使用固定数量的可学习推理令牌作为瓶颈，在单次前向传播中完成推理。通过结构化解耦对比学习和自回归目标，并使用两阶段训练确保稳定收敛，从而摆脱了对高质量CoT标注的依赖，实现了高效推理。

Abstract: Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregressive CoT reasoning incurs high computational cost, making it impractical for low-latency retrieval; and (ii) embedding performance is heavily coupled with CoT annotation quality, making large-scale training unreliable. These raise fundamental questions: Is textual CoT the optimal form of reasoning for embedding, and can effective embedding reasoning be accomplished in latent space? To this end, we propose LaME (Latent Reasoning Multimodal Embedding), which formulates embedding-oriented latent reasoning as a weakly supervised information bottleneck. LaME employs K learnable reason tokens as a fixed-capacity bottleneck, completing all reasoning within a single forward pass. The two weak supervision signals structurally decouple contrastive from autoregressive objectives and eliminate dependence on CoT annotations, while a two-stage training pipeline ensures stable convergence. Experiments on MMEB-v2 and MRMR show that LaME achieves competitive performance, surpassing some explicit CoT-based models, while delivering 60x faster inference than explicit CoT methods and 2x faster than latent baselines with throughput comparable to discriminative embedding models. Code will be released.

[53] Unified MRI Brain Image Translation via Hierarchical Tumor Structure Comparison cs.CVPDF

Yupeng Cai, Jia Wei, Jianlong Zhou

TL;DR: 本文提出了一种名为HTSCGAN的新型多模态MRI脑图像翻译模型，该模型通过整合肿瘤区域的结构信息来提升翻译图像的质量和临床适用性。模型利用不同大小的Patch Contrast Module（PCM）捕获肿瘤的层次结构信息，并结合预训练的Patch Classifier（PC）和Structure-Aware Encoder（SAE）通过分类损失和感知损失确保生成图像与真实图像在肿瘤结构上的一致性。在BraTS2020和BraTS2021数据集上的实验表明，该模型在图像翻译和下游分割任务中均表现出色。

Details

Motivation: 现有脑图像翻译方法忽略了不同肿瘤区域的结构信息，这限制了翻译图像的质量和临床实用性，因此需要一种能整合肿瘤结构信息以提升翻译保真度的方法。

Result: 在BraTS2020和BraTS2021基准测试中，HTSCGAN在图像翻译和下游分割任务上均展现出强劲性能，有效提升了翻译图像的质量和临床相关性。

Insight: 创新点在于引入层次化肿瘤结构比较，通过多尺度PCM模块和预训练的PC与SAE组件，将肿瘤结构信息显式整合到生成对抗网络中，从而增强翻译的保真度和临床价值。

Abstract: Multi-modal MRI brain image translation via available modalities holds significant practical importance in modern medicine, providing robust support for early diagnosis, treatment planning, and outcome assessment of diseases. For this purpose, it is important to ensure the fidelity of the tumor regions after translation. However, existing brain image translation methods ignore the structure information of different tumor regions, which could assist translation models in enhancing the quality and clinical applicability of the translated images. In this work, we propose a novel translation model called HTSCGAN, which is a unified multi-modal brain image translation generative adversarial model integrating the structural information within tumor regions with the aim of improving the quality of brain image translation. Specifically, the generator employs three Patch Contrast Module (PCM) with different patch sizes to capture the hierarchical structural information of the tumor regions. In addition, a pretrained Patch Classifier (PC) and a pretrained Structure-Aware Encoder (SAE) are employed to derive the generated image containing the same tumor region structure as the ground truth image via patch classification loss and tumor perceptual loss, respectively. The experiments on BraTS2020 and BraTS2021 demonstrate strong performance of our model in both translation tasks and down stream segmentation tasks, highlighting its effectiveness in enhancing the quality and clinical relevance of the translated brain images. Our code is available at https://anonymous.4open.science/r/HTSCGAN.

[54] PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks cs.CVPDF

Yubo Zhang, Xueqing Wang, Manhui Lin, Yue Zhang, Penglongyi Deng

TL;DR: 本文提出了PP-OCRv6，一个轻量级的OCR系统，通过架构创新和数据为中心的优化，解决了视觉语言模型在专用OCR场景中的幻觉、定位不精确和高计算成本问题。该系统设计了基于统一MetaFormer风格构建块并采用结构重参数化的新架构，包含中型、小型和微型三个层级，覆盖从服务器到边缘的部署场景。

Details

Motivation: 尽管视觉语言模型在通用视觉语言任务上表现出色，但在专用OCR场景中，它们存在幻觉、定位不精确和计算成本过高的问题，因此需要开发一个更高效、精准的轻量级OCR系统。

Result: 在内部基准测试中，PP-OCRv6_medium实现了83.2%的识别准确率和86.2%的检测Hmean，分别比PP-OCRv5_server提升了5.1%和4.6%，并且以数量级更少的参数量超越了Qwen3-VL-235B、GPT-5.5和Gemini-3.1-Pro等大规模视觉语言模型；微型版本在Intel Xeon CPU上的推理速度比PP-OCRv5_mobile快3.9倍，同时保持相当的准确率。

Insight: 论文的创新点在于围绕统一的MetaFormer风格构建块重新设计了骨干网络、检测颈部和识别颈部，通过结构重参数化将空间令牌混合与通道混合解耦，并通过任务特定的步长配置支持检测和识别任务；这种设计实现了高性能与轻量化的统一，为专用OCR任务提供了超越大规模视觉语言模型的效率与精度解决方案。

Abstract: Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

[55] Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback cs.CV | cs.AIPDF

Animesh Tripathy, Aswanth Krishnan

TL;DR: 本文提出了迭代视觉思维（IVT）框架，旨在解决视觉语言模型（VLM）在空间定位任务中缺乏自我观察和修正能力的问题。该框架通过让模型预测边界框、观察渲染后的预测结果，并利用视觉反馈进行迭代优化，从而提升其自我修正能力。

Details

Motivation: 现有视觉语言模型在单次空间定位上表现良好，但缺乏自我观察和修正预测的机制，导致直接迭代提示会引发性能崩溃，这揭示了模型定位能力与自我修正能力之间存在根本性差距。

Result: 在包含RefCOCOg、Ref-Adv和Ref-L4的混合基准测试（505个样本）上，经过IVT微调的模型在各项指标上均超越基础单次预测模型：Acc@0.5提升至82.0%（+2.4个百分点），Acc@0.7提升至74.1%（+3.2个百分点），Acc@0.9提升至48.3%（+2.8个百分点）。GRPO进一步将每步IoU退化降低了5倍，稳定了优化轨迹。

Insight: 创新点在于提出了一个闭环的视觉反馈训练框架（IVT），并设计了两阶段训练方法：首先利用基础模型自身预测作为真实错误，通过教师VLM生成修正推理轨迹以创建无人工标注的监督数据；其次采用基于IoU奖励的组相对策略优化（GRPO）来稳定多步优化过程。这表明空间自我修正是一种可通过小规模数据（仅2400样本）学习的可培养能力。

Abstract: Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model’s own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

[56] Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing cs.CVPDF

Anugrah Aidin Yotolembah, Novanto Yudistira, Gembong Edhi Setyawan

TL;DR: 本文提出了Custom ZeroCLIP，一个用于印度尼西亚传统服装零样本图像描述的检索增强视觉语言框架。该模型在24个已见省份的数据上训练，并在8个未见省份上评估，无需使用任何未见省份的图像、标签或描述。该方法在CLIPScore、BLEU-4和METEOR指标上均优于现有基线，并通过检索显著提升了文化词汇的恢复能力。

Details

Motivation: 解决在文化遗产等低资源场景下，为特定文化背景（如印尼传统服饰）的图像生成准确、流畅描述的挑战，尤其是在零样本设置（即模型需描述训练时未见过类别的图像）下的领域适应问题。

Result: 在包含38个省份、3800张专家标注图像的数据集上，使用省份级别的归纳式零样本协议进行评估。Custom ZeroCLIP取得了CLIPScore 0.8536、BLEU-4 0.3342和METEOR 0.4859的成绩，超越了现有基线。消融实验表明检索带来了19.3%的METEOR提升，人工评估也证实了其在文化准确性和流畅性上的优势。

Insight: 核心创新在于将检索增强机制与冻结的CLIP等预训练视觉语言模型结合，用于零样本图像描述任务，特别是在文化特定领域。这种方法通过利用已见省份的描述作为检索库，有效提升了模型对未见文化概念（如特定省份服饰）的描述能力，为低资源文化遗产的自动化分析提供了有效的领域适应方案。

Abstract: This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is trained on 24 seen provinces, validated on 6 seen provinces, and evaluated on 8 unseen provinces. The framework combines a frozen CLIP ViT-B/32 image encoder, a CLIP text encoder, a BERT text encoder, and an LSTM caption decoder. During inference, unseen-province labels and captions are unavailable, and retrieval uses only captions from training provinces. No unseen-province image, label, or caption is used during training, validation, or retrieval-bank construction. Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, outperforming existing baselines. Ablation results show that retrieval improves cultural vocabulary recovery with a 19.3% METEOR gain, while human evaluation confirms stronger cultural accuracy and fluency. The results demonstrate the effectiveness of retrieval-augmented domain adaptation for culturally grounded caption generation in low-resource heritage settings. The dataset is publicly available at https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset.

Wei Li, Zhen Huang, Xinmei Tian

TL;DR: 本文提出MACCO框架，通过跨模态掩码组合概念建模来增强视觉语言模型的组合性理解能力。该方法在一种模态中掩码组合概念，并利用另一模态的完整上下文信息进行重建，从而更有效地捕获和对齐跨模态组合结构。实验表明，该方法在五个组合性基准测试上显著提升了视觉语言模型的组合性、句法结构捕捉能力，并有益于文本到图像生成和多模态大语言模型。

Details

Motivation: 解决对比学习训练的视觉语言模型（如CLIP）在组合性理解上的不足，包括难以捕捉物体关系、属性-对象绑定和词序依赖等问题，这些问题源于对全局单向量表示的依赖以及对图像-文本对数据中固有丰富组合信息利用不足。

Result: 在五个组合性基准测试上进行了广泛实验，结果表明该方法显著增强了视觉语言模型的组合性，并提升了其捕捉句法结构和语言信息的能力，达到了先进的性能水平。

Insight: 创新点在于提出跨模态掩码组合概念建模框架，通过掩码和重建组合概念来强制模型学习细粒度的跨模态对齐；同时引入跨模态和模态内的辅助对齐与正则化目标，以更充分地利用数据中的组合信息，这是一种新颖的、任务无关的组合性增强方法。

Abstract: Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a “bag-of-words” behavior–struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.

[58] HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers cs.CV | cs.AIPDF

Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li

TL;DR: 本文提出了HYDRA-X，首个在单一视觉Transformer（ViT）内统一图像和视频标记化的统一多模态模型。其核心是设计了一个整体视觉标记器，通过帧级因果时间注意力与分层时间压缩实现高效时空重建，并利用轻量级解压缩器在联合图像-视频教师监督下增强潜在空间的语义感知。基于此标记器，论文还改进了编辑流程，将源-目标交互置于标记器内部而非LLM的语义层面，从而提升编辑一致性并加速收敛。

Details

Motivation: 解决统一多模态模型中整体视觉标记器的核心挑战：如何高效地将时空重建能力注入原生ViT，以及如何在潜在空间中嵌入图像和视频级别的语义感知。

Result: 在7B密集模型上实例化的HYDRA-X，在图像和视频的理解与生成任务上均取得了强劲性能，为未来基于统一标记器的UMMs铺平了道路。

Insight: 创新点包括：1）发现帧级因果时间注意力足以进行视觉重建，而完全时空注意力会损害性能；2）提出分层时间压缩优于单步压缩；3）提出在标记器潜在层面而非LLM语义层面进行源-目标交互的编辑流程改进，提升了编辑一致性。

Abstract: Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

[59] Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRI cs.CV | eess.IVPDF

Esra Ergün, Hersh Chandarana, Dan Sodickson, Gözde Ünal

TL;DR: 本文系统研究了两种自监督预训练范式（MAE和JEPA）在3D脑MRI疾病检测任务中的应用，通过引入谱域重建损失和方差-协方差正则化来增强模型对细粒度解剖结构和去相关特征的敏感性，并在五个下游疾病检测任务中验证了自监督目标设计与任务结构的相关性。

Details

Motivation: 现有MRI基础模型研究主要关注分割和密集预测任务，而针对MRI疾病检测的自监督基础模型系统研究不足，本文旨在探索适用于疾病检测的自监督预训练方法。

Result: 在五个下游疾病检测任务中，MAE结合谱域监督始终取得最优性能，具体改进取决于任务结构：当判别信号涉及高频解剖结构时谱正则化最有效，当信息分布在多个去相关特征维度时协方差正则化最有益。

Insight: 创新点包括为MAE设计谱域重建损失以增强对解剖细节的敏感性，以及在JEPA中集成方差-协方差正则化以促进去相关表示；核心发现是医学影像自监督目标会编码特定偏差，其下游效益本质上取决于任务结构。

Abstract: Self-supervised foundation models have shown strong promise in medical imaging. However, existing MRI foundation-model studies have primarily emphasized segmentation and dense prediction tasks, while systematic investigation of self-supervised foundation models for MRI-based disease detection remains limited. In this work, we investigate two major self-supervised pretraining paradigms for MRI-based disease detection: reconstruction-based learning via Masked Autoencoders (MAE) and predictive representation learning via Joint Embedding Predictive Architectures (JEPA). We study the role of auxiliary objectives by introducing a novel spectral-domain reconstruction loss for MAE to enhance sensitivity to fine-grained anatomical structure, and by integrating variance–covariance regularization (VCR) within our JEPA framework to encourage decorrelated latent representations. Our models are pretrained on heterogeneous single-contrast MRI volumes in a contrast-agnostic setting, without modality concatenation. Across five downstream disease detection tasks, our results highlight the importance of self-supervised objective design for medical foundation model pretraining, demonstrating that the downstream benefit of each objective is determined by its relevance to the task’s structure. Specifically, spectral regularization yields the largest improvements when the downstream discriminative signal is characterized by strong high-frequency anatomical structures, while covariance regularization is most beneficial when discriminative information spans multiple decorrelated feature dimensions. MAE with spectral-domain supervision consistently achieves superior downstream performance for MRI-based disease detection. These findings suggest that self-supervised objectives in medical imaging encode specific biases, and their downstream benefit is fundamentally conditioned on the task’s structure.

[60] ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance cs.CVPDF

Salaheldin Mohamed, M. Hamza Mughal, Rishabh Dabral, Christian Theobalt

TL;DR: 本文提出ReFree-S2V框架，用于生成与语音同步的逼真人物肖像视频。该框架基于预训练视频生成模型，通过多级语音表征（捕捉音素和韵律信息）和可学习的层级选择器注入Transformer模块，以同时实现精确的唇部同步和自然的表达性动作。此外，引入一种无奖励的强化学习方案到流匹配训练中，以抑制感知上不合理的头部运动，无需依赖手工设计的同步指标或奖励模型。

Details

Motivation: 解决现有语音驱动人物动画方法在精确音素-唇部同步与动态面部表情/头部运动之间的权衡问题，现有方法往往导致动画要么准确但僵硬，要么富有表现力但同步性差。

Result: 大量实验表明，ReFree-S2V在定量唇部同步准确性和定性人类评估（自然度和表现力）方面均显著优于现有方法，达到了最先进的性能水平。

Insight: 创新点包括：1) 多级语音表征（局部和全局粒度）与可学习层级选择器结合，实现细粒度语音发音和高层次表达线索的协同控制；2) 将无奖励强化学习引入流匹配训练，以无监督方式抑制不自然的头部运动，避免了手工设计奖励或昂贵的人工标注需求。

Abstract: Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurate lip articulation and expressive behavior remains challenging. Existing approaches typically trade off precise phoneme-to-lip synchronization against dynamic facial expressions and head motion, yielding animations that are either accurate yet rigid, or expressive but poorly synchronized. We address this challenge by proposing ReFree-S2V, a flow-matching speech-to-portrait animation framework that builds upon a pretrained video generation model to achieve fine-grained speech articulation and high-level expressive cues in speech-driven portrait animation. This model introduces a multi-level speech representation capturing phonetic and prosodic information at both local and global granularities. These representations are selectively injected into transformer blocks via learnable level selectors, enabling both accurate lip synchronization and natural expressive motion. To achieve natural head movements, we further introduce a novel reward-free reinforcement learning scheme into flow-matching training to discourage perceptually implausible motion without relying on handcrafted synchronization metrics or reward models, or the high cost of human preference annotation. Extensive experiments demonstrate that ReFree-S2V achieves state-of-the-art performance, significantly outperforming existing methods in both quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.

[61] OR-Action: Multi-Role Video Understanding with Fine-Grained Actions cs.CVPDF

Felix Tristram, Ege Özsoy, Christian Benz, Marcel Walch, Ghazal Ghazaei

TL;DR: 本文提出了首个以动作为中心的OR-Action基准，用于评估手术室（OR）场景中多角色细粒度动作的时序理解。该基准基于公开的自我-外中心OR数据集构建，通过从真实场景图状态变化中蒸馏生成密集动作片段。实验表明现有场景图预测方法在时序建模上存在困难，因此作者提出了一个纯视觉时序模型，在使用全部自我中心视频输入时显著优于基于图的方法，并引入了一种新颖的多视图到单视图特征对齐策略以提升单视图下的多角色动作识别性能。

Details

Motivation: 手术室（OR）活动的细粒度理解对于实现工作流感知的辅助至关重要，但由于环境杂乱、遮挡和感知有限而难以实现。现有方法主要使用场景图作为OR交互的可解释表示，但将其逐帧关系预测转换为时序扩展的细粒度动作具有挑战性，缺乏显式的时序建模。

Result: 在提出的OR-Action基准上的实验表明，当前场景图预测方法（即使通过图神经网络添加显式建模）难以建模时序结构。作者提出的纯视觉时序模型在使用所有可用自我中心视频输入时显著优于基于图的方法。此外，新颖的多视图到单视图特征对齐策略提升了单视图下的多角色动作识别性能。

Insight: 论文的创新点在于构建了首个面向手术室细粒度多角色动作理解的时序评估基准，并通过蒸馏真实场景图状态变化来生成密集动作标注。从客观角度看，其提出的纯视觉时序模型以及多视图到单视图特征对齐策略，为解决多视角、时序性强的细粒度动作识别问题提供了新的技术思路，特别是减轻了对大量自我中心视频捕获的依赖。

Abstract: Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

[62] Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis cs.CV | cs.AI | physics.med-phPDF

Gabriel Steele, Alzahra Altalib, Alessandro Perelli

TL;DR: 本文提出了一种双域等变生成对抗网络（DDE-GAN），用于多模态CT-PET图像合成。该方法通过在空间域和频域（傅里叶域）联合学习，并嵌入旋转等变性约束，以提升合成图像的结构保真度和解剖学准确性。在HECKTOR 2022 CT-PET数据集上的实验表明，该方法优于基线模型。

Details

Motivation: 解决传统基于GAN的方法仅关注空间域、忽略几何一致性问题，导致合成图像结构保真度有限的问题。

Result: 在HECKTOR 2022 CT-PET数据集上评估，DDE-GAN在CT-PET图像合成任务上取得了优于基线模型的合成质量，达到了SOTA水平。

Insight: 创新点在于将双域（空间域和频域）学习与旋转等变性（基于CT/PET成像物理特性）相结合，并通过分层双域训练策略强制域内和域间一致性，这增强了多模态图像合成的准确性和鲁棒性。

Abstract: We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

[63] MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold cs.CVPDF

Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang

TL;DR: MoVerse是一个实时视频世界模型，能够从单个窄视场图像创建可交互导航的场景。它通过分离世界构建与观察渲染来解决输入视野有限的问题，首先将输入扩展为重力对齐的360度全景图，然后将其提升为持久的3D高斯支架，最后通过高斯条件视频渲染器沿用户指定的相机轨迹生成逼真视频。

Details

Motivation: 解决从单个窄视场图像创建完整、可交互、持久且时间一致的高保真环绕场景的挑战，因为输入仅观察环境的一小部分，而交互式漫游需要完整的周围世界。

Result: MoVerse在单个NVIDIA RTX 4090 GPU上支持实时场景漫游，达到8 FPS，展示了从单图像创建世界并输出交互视频的实用路径。

Insight: 创新点包括：分离世界构建与观察渲染的架构设计，使用拓扑感知扩散和全景几何感知残差预测来扩展和提升输入，以及通过双向扩散教师蒸馏到因果自回归学生实现高质量、低延迟渲染，结合了显式3D表示的可控性与生成视频模型的感知质量。

Abstract: We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~~FPS on a single NVIDIA RTX~~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

[64] SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation cs.CV | cs.AIPDF

Zian Yang, Zixin Wang

TL;DR: 本文提出SmartFont，一种基于扩散模型的少样本字体生成框架，旨在同时满足字体全局结构完整性和细粒度局部风格保真度。该框架结合了全局内容-风格生成与弱监督的局部校正专家，并通过去噪状态条件分配模块自适应地融合多级条件。

Details

Motivation: 现有少样本字体生成方法要么依赖全局建模导致解耦不完美，要么强调局部建模但严重依赖局部先验和参考覆盖。本文认为关键挑战在于如何通过多级分配来组织互补但有偏的全局和局部条件。

Result: 大量实验表明，SmartFont在多个基准测试中实现了更好的全局-局部平衡，提升了字形质量和局部细节保真度，达到了先进水平。

Insight: 创新点在于提出了语义-空间分配机制，使局部分支能在弱组件监督下学习专家级局部概念和语义空间图，无需显式的组件条件推理即可进行细粒度校正，并通过动态条件分配模块自适应融合多级特征。

Abstract: Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

[65] VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits cs.CVPDF

Hoang-Nguyen Cao, Le-Hoang Bui, Dinh-Khoi Vo, Minh-Triet Tran, Trung-Nghia Le

TL;DR: 论文提出了VietFashion基准数据集，专注于越南传统服饰Ao Dai的草图-文本组合图像检索。该数据集包含650个手绘草图和通过生成模型扩展的超过21,000张真实感图像及对齐描述，旨在通过结合草图（传达结构）和文本（编码文化语义）来检索具有文化意义的服饰。

Details

Motivation: 解决文化服饰检索的独特挑战，因为其身份识别依赖于标准AI模型难以捕捉的细微结构和象征性细节。

Result: 在VietFashion基准上对最先进的组合图像检索方法进行了基准测试，实验结果显示在建模细粒度文化语义和多模态组合方面存在显著的性能差距。

Insight: 创新点在于构建了一个专注于特定文化服饰（Ao Dai）的多模态检索基准，结合草图与文本，并采用多目标检索设置以反映设计意图的固有模糊性，为细粒度时尚检索提供了一个具有挑战性的测试平台。

Abstract: Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: https://hng0303.github.io/VietFashion.

[66] OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data cs.CV | cs.AIPDF

Jiwen Liu, Shujuan Li, Zhixue Fang, Xiaohan Li, Yan Zhou

TL;DR: 本文提出OmniDirector框架，通过将相机参数编码为网格运动视频，实现无需跨配对数据的通用多镜头相机运动克隆。该方法利用大规模相机网格-视频对训练，结合分层提示扩展代理协调角色、动作和相机，为多模态扩散变换器提供导演级控制。

Details

Motivation: 现有相机运动克隆方法无法处理多镜头生成或依赖稀缺的跨配对数据，导致复杂相机运动克隆性能不佳。本文旨在解决这些问题，实现更通用和可控的视频生成。

Result: 大量实验表明，该框架在相机运动克隆任务上表现出优越性能和出色的可控性，但摘要未提及具体基准测试或定量比较结果。

Insight: 创新点包括提出视觉化相机网格表示以支持多镜头生成，以及设计分层提示扩展代理来协调不同控制信号，实现导演级的多模态控制。

Abstract: Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/

[67] VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models cs.CVPDF

Ruiqi Xian, Yuehan Xian, Jing Liang, Xuewei Qi, Dinesh Manocha

TL;DR: 本文提出VISA方法，一种用于3D占用世界模型的训练时语义审计框架，通过离线视觉语言模型（VLM）对每个物理对象实例进行结构化审计，并将审计信息蒸馏到语义logits中，以提升语义3D占用的准确性，特别是在对象和稀有类别上的性能。

Details

Motivation: 语义3D占用为自动驾驶和机器人决策提供体素化世界状态，但对象和稀有类别的错误会影响自由空间解释、碰撞检测和时间状态传播；现有VLM策略（如对齐3D体素特征与裁剪-描述嵌入）虽提升文本空间相似性，但未能可靠改善闭集占用mIoU，因此需要更有效的语义审计方法。

Result: 在nuScenes数据集上，VISA将OccWorld模型的mIoU从19.06提升至20.05，GaussianWorld从21.36提升至21.91；在GaussianWorld上，对象mIoU从18.18提升至19.16，稀有类别mIoU从15.60提升至16.79，表明方法在闭集占用任务上达到SOTA水平。

Insight: 创新点在于将VLM用作可靠性感知的语义审计器而非通用描述嵌入目标，通过结构化审计（包括类别假设、混淆分析、可靠性、属性和证据）并结合可靠性加权的分类、属性因子和场景级审计图损失进行蒸馏，推理时无需VLM，可高效集成到现有占用模型中。

Abstract: Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop-caption embeddings, improves text-space similarity without reliably improving closed-set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training-time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability-weighted taxonomy, attribute-factor, and scene-level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare-class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed-set occupancy as reliability-aware semantic auditors than as generic caption-embedding targets.

[68] Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization cs.CV | cs.AIPDF

Mateo Toro Diz, Jonathan Hoss, Noah Klarmann

TL;DR: 本文提出了一种测量校准的多相机融合方法，用于解决基于视觉的室内定位系统中由检测噪声、遮挡和相机覆盖范围有限带来的不确定性。该方法通过显式量化单相机定位各组件（单应性校准、人体检测和运动跟踪）的误差，来校准和优化多相机数据融合。

Details

Motivation: 动机在于当前多相机数据融合通常被视为黑盒组件，仅进行端到端评估，掩盖了其机制性贡献。本文旨在探究显式表征单相机定位误差是否能用于校准和优化多相机数据融合。

Result: 实验结果表明，数据融合相比单相机基线提高了定位精度。测量校准的融合在绝对精度上相比标准融合提升有限，但显著降低了轨迹方差并提高了运动平滑度，这对于需要稳定连续运动估计的应用至关重要。

Insight: 创新点在于提出了组件级误差量化的测量校准融合方法，并进行了组件级评估以量化各阶段的误差贡献。客观来看，该研究强调了在设计基于视觉的室内定位系统数据融合策略时，显式误差表征的价值，为理解融合机制提供了新视角。

Abstract: Indoor vision-based localization systems are affected by detection noise, occlusions, and limited camera coverage, leading to uncertainty at multiple stages of the pipeline. While multi-camera data fusion is widely used to mitigate these issues, it is typically treated as a black-box component and evaluated solely end-to-end, obscuring its mechanistic contributions. To address this gap, this work investigates whether explicitly characterizing single-camera localization errors can be leveraged to calibrate and optimize multi-camera data fusion. We introduce a measurement-calibrated fusion approach that integrates component-wise error quantification, specifically isolating homography calibration, human detection, and motion tracking. A component-wise evaluation is conducted to quantify error contributions from homography calibration, human detection, and motion tracking. Experimental results show that data fusion improves localization accuracy compared to single-camera baselines. While measurement-calibrated fusion provides only limited improvement in absolute accuracy over standard fusion, it substantially reduces trajectory variance and improves motion smoothness, which are critical for applications requiring stable and continuous motion estimates. These results highlight the value of explicit error characterization when designing data fusion strategies for vision-based indoor positioning systems.

[69] MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models cs.CV | cs.LG | cs.ROPDF

Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu

TL;DR: MaskWAM是一种以物体为中心的世界-动作模型，通过统一地将掩码作为显式输入和预测目标，解决了现有模型在空间表示上的瓶颈问题。它利用统一的Transformer混合架构，结合掩码提示和预测，增强了策略的鲁棒性和泛化能力。

Details

Motivation: 当前世界-动作模型存在空间瓶颈：文本输入在杂乱场景中引入指代模糊性，而无结构的RGB预测缺乏语义基础并受任务无关背景干扰。MaskWAM旨在通过物体中心的掩码表示来克服这些限制。

Result: 在LIBERO、RoboTwin和真实世界任务上的评估表明，MaskWAM在语言清晰和语言模糊任务中均显著优于基线模型。

Insight: 创新点在于将掩码统一作为输入和预测目标，通过物体中心的语义监督抑制视觉噪声，并结合首帧视觉提示（如目标物体掩码）减少语言歧义，为操作未见物体提供了精确且鲁棒的范式。

Abstract: World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

[70] EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution cs.CV | cs.AIPDF

Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun

TL;DR: 本文提出EvTexture++，首个专注于视频超分辨率（VSR）中纹理增强的事件驱动框架。它利用事件信号的高频时空细节来改善纹理恢复，并设计了纹理增强分支和迭代纹理增强模块逐步细化纹理区域。此外，通过事件提供的连续时间运动线索来增强时间一致性，减少纹理闪烁，且该框架可作为即插即用工具提升现有VSR模型的性能。

Details

Motivation: 现有工作将事件信号主要用于VSR中的运动估计和时间对齐，而本文则转向利用事件信号进行纹理增强，以解决视频超分辨率中纹理恢复不足和因大运动导致的纹理区域时间不一致（如闪烁）问题。

Result: 在五个数据集上的实验表明，EvTexture++达到了最先进的性能。当集成到最近的VSR模型中时，它带来了显著提升，在纹理丰富的Vid4数据集上PSNR增益高达1.55 dB。

Insight: 创新点在于将事件信号的焦点从运动细化转移到纹理增强，并设计了专门的纹理增强分支、迭代纹理增强模块和基于事件的时间纹理对齐模块。从客观角度看，其利用事件的高时间分辨率特性进行渐进式纹理恢复和一致性对齐，以及作为即插即用工具的灵活性，是值得借鉴的创新之处。

Abstract: Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset. Code: https://github.com/DachunKai/EvTexture.

[71] World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible cs.CV | cs.GRPDF

Hao Zhang, Mohamed El Banani, Jen-Hao Cheng, Paul Zhang, Yi Hua

TL;DR: 本文提出了World Tracing，一种生成式像素对齐的几何表示方法，旨在解决图像到3D方法中忠实性与完整性之间的权衡问题。该方法为每个输入像素预测一个有序的相机空间3D点堆栈，其中第一层表示可见表面，后续层表示被遮挡表面的从前到后交点。通过世界追踪扩散变换器（WT-DiT）实现该表示，并在物体、场景和动态基准测试中，在可见表面重建和完整几何生成方面均取得了优于深度预测器和图像到3D生成器的性能。

Details

Motivation: 现有图像到3D方法往往在忠实性和完整性之间做出妥协：深度估计器与输入像素对齐但仅止于可见表面，而图像到3D模型能生成完整形状却常与输入不对齐。本文旨在开发一种既能保持像素对齐又能生成超出可见表面完整几何的方法。

Result: World Tracing在物体、场景和动态基准测试中，在可见表面重建和完整几何生成方面均表现出色，超越了深度预测器和图像到3D生成器。

Insight: 创新点在于提出了像素对齐的、有序多层3D点堆栈的几何表示，以及使用世界追踪扩散变换器（WT-DiT）进行建模，该模型通过因子化和全局注意力耦合多个几何层。此外，采用像素空间流匹配和混合噪声调度来平衡可见表面重建与被遮挡几何生成，并保持了2D到3D的对应关系，支持文本驱动的3D场景编辑等应用。

Abstract: Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

[72] Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction cs.CV | cs.GRPDF

Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang

TL;DR: Flex4DHuman是一个多视角视频扩散模型，它仅使用相对相机位姿条件，将单目或稀疏多视角的动态人物视频转换为同步的密集多视角视频。该模型无需显式几何先验（如骨架、深度图等），生成的视频可直接输入下游重建流程以创建动态4D高斯溅射。基于Wan 2.1 1.3B文本到视频模型构建，通过五轴位置编码整合相机和视角信息，并采用三阶段课程学习进行训练。结合现成的4D高斯溅射技术，该框架能将单目静态相机视频提升为动态4D高斯溅射。

Details

Motivation: 解决从单目或稀疏多视角视频中高效、高质量地重建动态4D人物（或动物）的挑战，避免依赖复杂的几何先验（如骨架、深度图等），旨在实现可扩展的4D内容创建，适用于模拟、游戏、AR/VR和视频重拍等应用。

Result: 在DNA-Rendering和ActorsHQ基准测试中，Flex4DHuman超越了先前的最先进方法（SOTA）；通过混合人-动物训练，该模型还能泛化到动物类别，展现出良好的通用性。

Insight: 创新点包括：仅使用相对相机位姿条件进行生成，无需显式几何先验；提出五轴位置编码，扩展了时空RoPE以整合视角索引和连续SE(3)相对相机几何；采用三阶段课程学习策略，逐步训练模型进行位姿跟随、灵活参考到目标视角生成和时间展开；结合多视角字幕实现测试时的文本控制，增强了实用性。从客观角度看，该方法简化了4D重建流程，提高了从日常视频创建动态内容的效率和可访问性。

Abstract: We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

[73] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning cs.CV | cs.AIPDF

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee

TL;DR: 本文提出SpatialClaw，一种无需训练的空间推理框架，它采用代码作为动作接口，使基于视觉语言模型（VLM）的智能体能够通过逐步编写可执行代码单元来灵活组合感知操作，从而解决开放式的静态和动态3D/4D空间推理问题。

Details

Motivation: 现有空间推理智能体要么采用单次代码执行（在观察到任何中间结果前就确定完整分析策略），要么依赖结构化工具调用接口，这两种设计在灵活组合操作和针对任务定制分析方面受限，难以应对开放式的复杂空间推理。

Result: 在涵盖广泛静态和动态3D/4D空间推理任务的20个基准测试上，SpatialClaw实现了59.9%的平均准确率，比近期空间智能体提升了11.2个百分点，且在来自两个模型家族的六个VLM骨干网络上均取得一致增益，无需任何基准或模型特定适配。

Insight: 创新点在于将代码作为状态化、交互式的动作接口，允许智能体基于先前输出逐步编写代码，灵活组合感知与几何原语，从而适应中间观察和问题需求；从客观角度看，这种设计提供了比传统单次执行或结构化工具调用更高的灵活性和适应性，是提升智能体空间推理能力的有效途径。

Abstract: Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent’s capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

[74] RepWAM: World Action Modeling with Representation Visual-Action Tokenizers cs.CVPDF

Junke Wang, Qihang Zhang, Shuai Yang, Yiming Luo, Yujun Shen

TL;DR: 本文提出了RepWAM，一种基于表示视觉-动作分词器的以表示为中心的世界动作模型（WAM）。现有WAM通常继承自预训练视频生成模型的重建导向视频分词器，这些分词器虽保持视觉保真度，但像素重建本身为学习连接未来预测与机器人控制的指令跟随动态提供有限指导。为此，作者探索了一个用于表示中心世界动作建模的语义视觉-动作潜在空间，训练了一个表示视觉-动作分词器，将视觉输入映射到对齐的视觉和潜在动作标记，并预训练WAM以在语言指令下联合建模未来视觉状态及连接它们的潜在动作，随后适应真实机器人轨迹进行闭环操作。实验表明，RepWAM在多样操作设置中表现强劲，消融研究凸显了语义视觉-动作分词相对于重建导向方法的优势。

Details

Motivation: 解决现有世界动作模型（WAM）因继承重建导向视频分词器而导致在连接未来预测与机器人控制的指令跟随动态学习上指导有限的问题，探索更语义化的视觉-动作表示以提升模型对操作任务的理解和控制能力。

Result: 在真实世界操作任务和仿真基准测试中，RepWAM展现出强劲性能；消融实验证实了语义视觉-动作分词相对于重建导向替代方案的价值，为世界动作模型奠定了有前景的基础。

Insight: 创新点在于提出表示视觉-动作分词器，构建对齐的语义视觉-动作潜在空间，以联合建模未来视觉状态和潜在动作，从而更好地桥接视觉感知与机器人控制，推动通用机器人策略的发展。从客观角度看，该方法通过强调语义表示而非像素重建，可能更有效地捕获任务动态，提升模型在复杂操作环境中的泛化能力。

Abstract: This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

[75] InterleaveThinker: Reinforcing Agentic Interleaved Generation cs.CVPDF

Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo

TL;DR: 本文提出了InterleaveThinker，一种多智能体管道，旨在为现有图像生成器赋予交错生成（文本-图像序列）能力。该方法通过规划器智能体组织输入序列并指导生成器，再通过批评器智能体评估输出、识别偏差并优化指令，从而在视觉叙事等任务中实现高效生成。

Details

Motivation: 现有图像生成器在单图像生成和编辑方面表现出色，但受架构限制无法实现文本与图像交错序列的生成，而最新的统一多模态模型在此方面性能有限，因此需要一种方法增强现有生成器的交错生成能力。

Result: 在交错生成基准测试中，InterleaveThinker实现了与Nano Banana和GPT-5相当的性能；在基于推理的基准测试上，如4步FLUX.2-klein，该方法显著提升了基础模型在WISE和RISE指标上的表现。

Insight: 创新点包括：首次提出多智能体管道来扩展图像生成器的交错生成功能；通过规划器和批评器智能体的协同工作优化生成轨迹；引入准确性奖励和逐步奖励，使单步强化学习能有效指导整个生成过程，解决了计算优化难题。

Abstract: Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator’s outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

cs.SD [Back]

[76] AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation cs.SD | cs.CV | cs.MMPDF

Zeyue Tian, Lei Ke, Zhaoyang Liu, Ruibin Yuan, Liumeng Xue

TL;DR: 本文提出了AudioX-Turbo，一个用于高效多模态音频生成的统一框架。它通过教师-学生范式，将基于多模态扩散变换器的教师模型AudioX-Base蒸馏为仅需4步采样的学生模型，并构建了大规模高质量数据集IF-caps-Pro进行训练。该模型在文本到音频和文本到音乐生成等任务上实现了高性能，同时大幅降低了推理成本。

Details

Motivation: 解决基于灵活多模态控制信号（如文本、视频、音频）的音频生成所面临的三大挑战：统一的多模态建模框架、大规模高质量训练数据，以及多步扩散采样带来的过高推理成本。

Result: 在广泛的基准测试中，模型在文本到音频和文本到音乐生成等任务上取得了优越性能。它仅需4步采样，比多步基线模型减少了约25倍的功能评估次数（NFE），实现了高效的高质量生成。

Insight: 创新点在于提出了一个整合多种模态条件的统一生成框架，并采用了适配流匹配的分布匹配蒸馏方法，结合基于扩散的判别器，以实现高质量、少步数的生成。从客观角度看，其构建大规模高质量数据集的流程以及高效的教师-学生蒸馏架构具有借鉴意义。

Abstract: Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.

eess.AS [Back]

[77] Adaptive Turn-Taking for Real-time Multi-Party Voice Agents eess.AS | cs.AI | cs.CLPDF

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

TL;DR: 本文提出了ModeratorLM，一个基于语音大语言模型的实时多方语音代理，通过显式分配角色来调节其对话轮转行为。系统采用分块流式处理，并引入了一种结合思维链推理的增强变体。研究构建了RolePlayConv大规模合成数据集，并在真实会议数据和该数据集上验证了方法的有效性。

Details

Motivation: 解决多方语音对话中，在动态发言权竞争和不同用户期望下，语音代理的轮转（turn-taking）这一根本性挑战。

Result: 在真实世界会议数据和RolePlayConv数据集上的实验表明，相比无条件角色的基线，该方法将轮转精确率提升了超过40%，召回率提升了超过70%，并显著减少了误打断（false-positive interruptions）。

Insight: 核心创新在于将明确的角色分配作为调节轮转行为的关键条件，并构建了大规模合成数据集RolePlayConv进行训练和评估；从客观角度看，将思维链推理融入流式语音处理以增强上下文理解，也是一个值得借鉴的思路。

Abstract: Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

cs.RO [Back]

[78] EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows cs.RO | cs.CV | cs.LGPDF

Clinton Enwerem, John S. Baras, Calin Belta

TL;DR: 本文提出了EquiDexFlow，一种SE(3)-等变的流匹配模型，用于从物体点云联合预测灵巧抓取的手腕姿态、关节角度、指尖接触点、表面法线和接触力。该模型通过架构设计将接触点投影到物体表面，将力约束在库仑摩擦锥内，从而确保物理稳定性。实验表明，该模型在仿真和真实机器人上均实现了稳定、物理可行的抓取。

Details

Motivation: 现有的大多数学习型灵巧抓取生成器将接触力分析置于下游验证步骤，导致运动学上合理的抓取姿态可能在物理上不稳定。本文旨在通过联合建模姿态和接触力，直接从物体点云生成物理稳定的抓取。

Result: 在包含81个物体、8100个力闭合抓取的数据集上，针对16自由度Allegro手训练，模型实现了零摩擦违规、最佳综合评分和最低的力残余。在物理机器人上，经过重定向的抓取成功完成了所有六个测试物体的开环拾取-保持试验。

Insight: 创新点在于提出了一个SE(3)-等变的生成流模型，通过模型架构设计（而非损失函数）将接触点投影到物体表面并将力约束在摩擦锥内，从而保证物理可行性。同时，证明了端到端的SE(3)等变性，并展示了向不同灵巧手（如LEAP Hand）的抓取重定向能力。

Abstract: Most learned dexterous grasp generators relegate contact forces to a downstream verification step, so a kinematically-plausible pose can still violate the conditions for a stable physical grasp. We address this with EquiDexFlow, an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. Our architecture projects contacts onto the object surface and forces into the Coulomb friction cone by construction, so placement and friction compliance hold without loss penalties. We prove end-to-end SE(3) equivariance and verify it empirically over 200 rotations, with wrist residuals below $0.04^\circ$ and exactly zero joint deviation. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, our model achieves zero friction violations, the best composite score, and the lowest wrench residual among all ablation variants. We retarget decoded fingertip contacts to a 16-DoF LEAP Hand via per-finger inverse kinematics, and our hardware-feasible refinement places every joint at least 5% inside its actuator envelope while preserving wrench balance. On the physical robot, retargeted EquiDexFlow-decoded grasps complete open-loop pick-and-hold trials on all six test objects, with every asymmetric object succeeding at both the canonical pose and a $120^\circ$ co-rotation. Videos, code, and checkpoints are available at https://equidexflow.github.io.

[79] Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning cs.RO | cs.AI | cs.CV | eess.SYPDF

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez

TL;DR: 本文提出了GRASP框架，旨在实现开放词汇的桌面操作任务。该框架利用预训练的视觉语言模型将自然语言查询转换为神经符号目标状态，并通过边界框检测管道将其与物理世界进行关联。GRASP能够解释抽象的空间概念（如“顶层架子”）并执行任务，无需额外的微调。

Details

Motivation: 为了解决机器人任务与运动规划中现有方法计算量大或需要大量演示数据的问题，作者旨在开发一个无需任务特定训练、能实时适应自然语言指令的轻量级框架。

Result: 在三个难度级别的90次真实机器人试验中，GRASP实现了73.3%的总体成功率，且无需任务特定训练。

Insight: 创新点在于将预训练视觉语言模型与神经符号规划相结合，通过边界框检测实现语言指令到物理目标的接地，从而支持开放词汇和抽象空间概念的即时解释，避免了固定颜色列表或硬编码坐标的限制。

Abstract: For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally “heavyweight” or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as “top shelf” and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

[80] Trajectory-Level Redirection Attacks on Vision-Language-Action Models cs.RO | cs.CV | eess.SYPDF

Gokul Puthumanaillam, Vardhan Dongre, Pranay Thangeda, Hooshang Nayyeri, Dilek Hakkani-Tür

TL;DR: 本文研究了视觉-语言-动作（VLA）策略中的轨迹级重定向攻击。攻击者通过微调文本指令（提示词），使其在表面上仍符合原始任务意图，但实际上能引导机器人执行完全不同的物理任务，从而控制最终物理结果。论文提出了一个形式化的威胁模型和一种基于策略的提示词搜索方法，并在仿真和硬件实验中验证了攻击的有效性。

Details

Motivation: 现有VLA攻击研究主要关注诱导特定低级动作或使动作在变化图像中持续存在，但作者发现了一种更强的轨迹级失效模式：攻击者可以通过看似正常的指令，从根本上重定向整个任务轨迹的最终物理结果，这暴露了VLA策略在指令落地（grounding）层面的新漏洞。

Result: 在仿真和硬件实验（如桌面操作任务）中，论文方法生成的接近良性的提示词扰动能够成功地将VLA策略的轨迹重定向到攻击者指定的目标，证明了这种轨迹级攻击的可行性。

Insight: 论文的核心创新在于形式化了‘保持指令的轨迹重定向’这一威胁模型，并提出了利用策略rollout进行提示词搜索的攻击方法。这揭示了VLA系统一个深层次的安全隐患：文本指令的语义相似性并不能保证其闭环行为的安全性，为理解和防御基于语言的机器人策略攻击提供了新视角。

Abstract: Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot’s final physical outcome. Project website: https://vla-redirection-attack.github.io/

[81] SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale cs.RO | cs.CVPDF

Nils Blank, Paul Mattes, Maximilian Xiling Li, Jakub Suliga, Thomas Roth

TL;DR: 本文提出了SPARC框架，一种风险感知的自动标注系统，能够为机器人演示数据生成带可靠性评分的结构化空间标注（如边界框、物体轨迹和操作阶段标签）。该框架利用机器人任务固有的时空结构来校准标注的可靠性，从而在减少噪声标注的同时保留更多有用样本。

Details

Motivation: 现有自动化标注流程虽能大规模生成结构化空间标注，但缺乏可靠的标注质量信号，导致用户不得不在接受噪声标签或丢弃有用样本之间做出选择。SPARC旨在解决这一问题，通过提供校准后的可靠性评分来提升标注的可用性。

Result: 在涵盖多样化机器人和场景的1.7k个人工标注演示数据集上，SPARC在定位精度上显著优于仅基于检测的基线方法，并在高精度操作点下保留了多三倍的样本。使用SPARC标注微调的模型在物体定位和指向基准测试中达到了同规模模型的SOTA水平，同时在更广泛的空间推理任务上保持竞争力。

Insight: 核心创新在于利用机器人演示的时空结构来生成标注可靠性信号，而非单纯依赖检测器置信度。这为大规模自动化标注提供了质量校准机制，使得生成的标注既能用于训练高性能模型（如具身智能基础模型），又能提升在杂乱、视觉模糊真实场景中的策略性能。

Abstract: This work introduces Spatial Annotations from Robot Demonstrations with Reliability Calibration (SPARC), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations, such as bounding boxes, object trajectories, and manipulation phase labels, benefit a broad range of robotics applications from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition. Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, reducing noisy labels and retaining more useful samples. We further introduce Interaction-Aware Bench (IA-Bench), a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in localization accuracy while retaining three times more samples at high-precision operating points. Our experiments demonstrate that models finetuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations outperform baselines in cluttered, visually ambiguous real-world scenes. Code, data, and models are available at intuitive-robots.github.io/sparc-labeling.

[82] Mana: Dexterous Manipulation of Articulated Tools cs.RO | cs.AI | cs.CV | cs.LGPDF

Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu

TL;DR: 论文提出Mana（Manipulation Animator），一个将灵巧操作重新定义为动画问题的仿真到现实（sim-to-real）框架。它采用从粗到精的流程，通过运动规划和强化学习，将程序生成的抓取关键帧转化为操作轨迹。该方法在四种不同尺度和关节类型的铰接工具上实现了抓取和手中操作的零样本仿真到现实迁移。

Details

Motivation: 铰接工具的操作是灵巧机器人学中的主要挑战，因为需要协调内部自由度与丰富的接触交互。先前工作多集中于刚性物体，而铰接工具因其物理复杂性和学习功能性抓取与操作策略的困难性，研究尚不充分。

Result: 在四种不同尺度和关节类型的铰接工具上，Mana实现了抓取和手中操作的零样本仿真到现实迁移，展示了一种可扩展的灵巧铰接工具使用方法。

Insight: 核心创新点是将灵巧操作重新定义为动画问题，并借鉴计算机动画思想，采用从粗到精的自动化流程生成数据。其数据生成过程高度自动化，仅需少量人工交互（如鼠标点击指定功能可供性），这为实现可扩展的仿真到现实迁移提供了新思路。

Abstract: Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

cs.CY [Back]

[83] The Tone of Awareness: Topic, Sentiment, and Toxicity Maps During Mental Health Month on TikTok cs.CY | cs.CL | cs.HC | physics.soc-phPDF

Henrique Ferraz de Arruda, Andreia Sofia Teixeira, Pranay Gundala Reddy, Anindya Mondal, Kleber Andrade Oliveira

TL;DR: 该研究通过TikTok研究API收集了2023年和2024年心理健康意识月期间的视频和评论数据，分析了心理健康相关内容的主题、情感和毒性变化。研究发现，主题在不同年份间保持稳定，但参与度高度集中于少数主题；视频内容的情感往往负面，而评论则趋向混合或正面；毒性整体较低，但在评论中呈现更长的尾部异常值，并集中于特定主题。

Details

Motivation: 尽管TikTok的使用与心理健康影响引发担忧，但创作者如何构建相关内容以及受众如何接收这些内容尚不清楚。研究旨在通过分析心理健康意识月期间的内容，探讨心理健康讨论的“语调”如何随主题和年份变化。

Result: 研究使用BERTopic提取主题，并利用XLM-T和Detoxify分别量化视频转录和评论的情感与毒性。结果显示，情感分析中视频内容对情绪化主题常呈负面，而评论（尤其是自杀预防主题）则更偏向混合或正面；毒性分析中整体中位数较低，但评论中的异常值更长且集中于“Duet”、“Suicide Prevention”等特定主题。

Insight: 创新点在于将“语调”操作化为情感和毒性指标，并分别分析视频内容（生产端）和评论（接收端），提供了对社交媒体心理健康讨论的细粒度分解。客观来看，该方法结合主题建模与情感/毒性分析，为理解在线心理健康话语的动态提供了可借鉴的框架。

Abstract: Despite raising concerns about the mental health effects associated with the usage of TikTok, little is known about how related content is framed by creators and received by audiences. We collect the content of 28,341 TikTok videos and 80,130 comments from Mental Health Awareness Month (May) in 2023 and 2024 via the TikTok Research API, and study how the tone of awareness varies across topics and years. We characterize “tone” as the emotional and interpersonal framing of mental health discourse, operationalized through sentiment and toxicity measures. We extract topics from video text using BERTopic and log-odds keywords, then quantify topic-conditioned sentiment (XLM-T) and toxicity (Detoxify) separately for video transcriptions and comments. Sentiment captures the affective valence of content, while toxicity reflects the presence of harmful or abusive language. We find a stable set of recurring themes across years, spanning clinical conditions, emotional disclosure, self-care, and campaign-oriented content, with engagement highly skewed toward a small subset of topics. All sentiment and toxicity analyses are computed separately for video content and comments, allowing us to distinguish between content production and audience reception. Sentiment in videos is often negative for emotionally charged topics, while comments tend to shift toward more mixed or positive polarity, especially for suicide prevention. Toxicity is low in median overall, but exhibits longer-tailed outliers in comments than in videos that are more pronounced in comments and concentrated in specific topics (e.g., “Duet”, “Suicide Prevention”, and “Psychisch”). Overall, our results provide a topic-level decomposition of mental health discourse on TikTok during awareness-month campaigns.

cs.CR [Back]

[84] ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection cs.CR | cs.CVPDF

Fatima Qaiser, Bisma Tahir, Muhammad Abid Mughal, Nauman Shamim

TL;DR: ViPER是一种基于视觉的恶意软件检测模型，通过将二进制文件映射为灰度图像并利用视觉分类器进行检测。该模型采用双头架构，同时学习恶意软件分类和打包检测，并通过打包感知门控机制根据打包状态调整恶意软件预测，从而提升对打包文件的鲁棒性。

Details

Motivation: 现有基于可视化的恶意软件检测方法在处理打包的可执行文件时存在缺陷，因为打包会生成高熵图像，掩盖了模型依赖的结构模式。由于打包在良性软件中也普遍存在，仅凭打包状态无法可靠判断恶意性，且现有方法未在统一监督框架中解决此问题。

Result: 在包含20万个Windows PE字节图图像的数据集上评估，ViPER实现了0.8521的平衡准确率、0.9260的ROC-AUC和0.9279的AUPR，在所有主要指标上均优于代表性SOTA基线，同时打包检测AUC达到0.9949。

Insight: 创新点包括：1）双头架构联合学习恶意软件分类和打包检测；2）打包感知门控机制根据推断的打包状态调整恶意软件预测决策边界；3）采用频率加权损失和分层采样解决训练中打包标签偏斜问题，提升了模型对打包文件的鲁棒性。

Abstract: Visualization-based malware detection maps raw binary bytes to grayscale images and applies learned visual classifiers, providing an evasion-resistant and disassembly-free alternative to conventional analysis pipelines. However, executable packing remains a critical failure mode: packed binaries produce high-entropy images that obscure the structural patterns these models rely on. Because packing is also prevalent in benign software (e.g., for compression or copy protection), packing state alone is not a reliable indicator of maliciousness, and existing approaches do not address this challenge within a unified supervised framework. We present ViPER, a Vision-based Packing-Aware Encoder for Robust malware detection. ViPER builds on a LoRA-adapted ViT-B/14 backbone with a dual-head architecture that jointly learns malware classification and packing detection. A packing-aware gating mechanism conditions malware predictions on the inferred packing state, enabling distinct decision boundaries for packed and unpacked inputs. To address packing label skew during training, we employ frequency-weighted losses with stratified sampling over joint class-packing strata. Evaluated on 200,000 Windows PE byteplot images, ViPER achieves a balanced accuracy of 0.8521, ROC-AUC of 0.9260, and AUPR of 0.9279, outperforming representative state-of-the-art baselines across all primary metrics, while attaining a packing detection AUC of 0.9949.

cs.AI [Back]

[85] PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation cs.AI | cs.CLPDF

Mahmoud Srewa, Praneetsai Iddamsetty, Mohammad Abdullah Al Faruque, Salma Elmalaki

TL;DR: 本文提出了PersonaDrive，一种基于检索增强的视觉-语言-动作（VLA）智能体框架，用于在闭环驾驶模拟中生成具有不同人类驾驶风格（如激进、中性、保守）的非自车交通参与者。该方法通过从风格化指令的人类驾驶数据集中检索演示，并利用这些演示作为上下文来微调VLA模型，从而无需为每种风格重新训练即可实现风格切换。

Details

Motivation: 现有闭环驾驶模拟器中的非自车交通参与者行为模式单一，缺乏真实的人类驾驶风格多样性。虽然近期工作尝试通过后处理标签或LLM推断的奖励权重引入风格变化，但这些信号是风格的代理而非人类在明确风格指令下的真实驾驶演示。

Result: 在Bench2Drive基准测试中，PersonaDrive（无风格条件）的驾驶分数比SimLingo和HiP-AD分别提高了4.6%和2.5%。在风格条件下，其在所有风格（激进、中性、保守）中都取得了最高的驾驶分数，性能差异在约2%的范围内，其最弱的风格也比最强的基线DMW高出5.4%。从保守到激进指令，平均速度和加速度分别提升了18%和25%。

Insight: 核心创新在于构建了一个风格化指令的人类驾驶数据集，并提出了一个三阶段流程：离线三元组挖掘、轻量级检索头训练以及利用检索到的演示作为上下文来微调单一VLA主干网络。这使得一个模型主干可以通过切换检索数据库来适应不同驾驶风格，无需为每种风格重新训练，实现了高效、多样且拟人化的交通智能体生成。

Abstract: Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

[86] Zero-source LLM Hallucination Detection with Human-like Criteria Probing cs.AI | cs.CL | cs.LGPDF

Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen

TL;DR: 本文提出了一种名为HCPD的零源幻觉检测方法，通过模拟人类评估者的多角度推理，将判断分解为可解释的加权标准集，并聚合标准特定分数以衡量真实性。该方法仅依赖查询-回答对，无需模型内部信息或外部参考，并通过基于奖励的对齐方案和多重采样聚合策略实现鲁棒且可解释的检测。

Details

Motivation: 解决在零源约束下检测大型语言模型幻觉的挑战，即仅基于文本查询-回答对，无需模型内部或外部参考，以降低模型生成不准确内容带来的安全风险。

Result: 在广泛实验中，HCPD consistently outperforms state-of-the-art baselines，在零源幻觉检测任务上提供了有效且可解释的解决方案。

Insight: 创新点包括Human-like Criteria Probing机制，模拟人类多角度推理；基于奖励的对齐方案，仅使用语义一致性作为弱监督；以及多重采样聚合策略，确保鲁棒决策和完全可解释性。从客观角度看，该方法将可解释性标准集成到检测过程中，提升了零源场景下的实用性和可靠性。

Abstract: Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

[87] The Illusion of Multi-Agent Advantage cs.AI | cs.CL | cs.MAPDF

Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang

TL;DR: 这篇论文挑战了多智能体系统（MAS）优于单智能体系统（SAS）的普遍观点。通过系统评估，研究发现自动生成的MAS在传统推理任务和交互式多步骤工作流任务上，其性能始终不如链式思维与自洽性（CoT-SC）单智能体方法，且成本高出10倍。论文引入了一个诊断性合成数据集，揭示了现有评估框架的缺陷以及自动设计范式产生的架构臃肿问题。

Details

Motivation: 论文旨在质疑并实证检验“多智能体系统优于单智能体系统”这一普遍假设，指出现有实证支持主要基于孤立推理任务的基准测试，未能充分评估MAS宣称的优势（如上下文保护、并行处理）。

Result: 在传统推理数据集和交互式多步骤工作流任务（如BrowseComp-Plus）上，自动生成的MAS始终表现不如CoT-SC单智能体方法，且成本高出10倍。在专门设计的诊断性合成数据集上，专家设计的MAS在性能和成本效益上均优于自动生成的架构。

Insight: 论文的核心创新点在于揭示了当前自动生成多智能体系统设计范式的根本缺陷：它们会产生架构臃肿，追求表面的复杂性而非功能性效用，这与多智能体原则存在根本性错位。同时，研究强调了评估框架需考虑计算成本的边际效用，以避免掩盖复杂MAS的关键架构差距和低效问题。

Abstract: Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

[88] OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models cs.AI | cs.CV | cs.LG | eess.IVPDF

Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

TL;DR: OpenMedQ是一个在迄今为止最广泛的完全开放医学混合数据集上预训练的医学视觉语言模型，包含14个数据集总计约335万预训练样本，涵盖病理学、放射学、显微镜学和纯文本临床问答。该模型在PathVQA上达到最先进的BLEU-1分数（75.9），超越了参数规模大80倍的Med-PaLM M变体，并在VQA-MED上匹配了最佳报告的BLEU-1分数（64.5）。

Details

Motivation: 旨在通过构建一个在广泛开放医学数据集上预训练的视觉语言模型，解决医学领域缺乏高质量、可复现基准模型的问题，以促进社区研究。

Result: 在PathVQA上达到SOTA的BLEU-1（75.9），超越Med-PaLM M变体；在VQA-MED上匹配最佳报告BLEU-1（64.5）；视觉编码器在8个未见医学分类基准上获得最高平均宏F1（0.757），优于BiomedCLIP、PMC-CLIP等模型。

Insight: 创新点在于使用了迄今为止最广泛的完全开放医学混合数据集进行预训练，证明了在多样化医学数据上预训练的有效性；其视觉编码器展现出强大的迁移能力，为社区提供了可复现的基准模型和代码。

Abstract: We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling 3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

[89] Augmentation techniques for video surveillance in the visible and thermal spectral range cs.AI | cs.CVPDF

Vanessa Buhrmester, Ann-Kristin Grosselfinger, David Munch, Michael Arens

TL;DR: 本文研究了在可见光和热红外光谱范围内用于视频监控的数据增强技术，重点关注多光谱CNN目标检测。论文探讨了如何利用可见光谱数据来辅助训练深度神经网络，特别是在获取足够且实用的热红外数据集存在挑战的情况下。

Details

Motivation: 解决智能视频监控中可见光和热红外传感器数据融合的挑战，以及因热红外数据集不足而难以训练深度神经网络的问题。

Result: 论文通过实验研究了不同增强技术对分类准确性的影响，但摘要中未提及具体的定量结果或基准测试。

Insight: 创新点在于系统研究热辐射、形状和颜色信息变化对分类准确性的影响，并探索利用可见光谱数据增强热红外数据训练的策略，以提升多光谱目标检测的鲁棒性。

Abstract: In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques…

[90] Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models cs.AI | cs.CLPDF

Joseph Keshet

TL;DR: 本文论证大型语言模型（LLM）不具备道德主体性或能动性，因为其运作本质上是基于数据学习的概率性输入-输出映射，缺乏内在意向性和基于承诺的行动，随机采样产生的变异性不等于真正的选择或作者身份。

Details

Motivation: 针对近期关于LLM表现出能动性或可作为道德主体的主张，本文旨在澄清这些归因是错误的，强调道德责任需要基于内在意向性和自我归属行动的承诺承载型能动性。

Result: 论文未提供具体实验基准或定量结果，而是通过哲学论证分析，指出LLM的输出虽连贯且可进行规范评估，但其机制不足以支持真正的能动性。

Insight: 创新点在于从哲学角度系统驳斥了LLM具有能动性的常见论点（如意向立场、功能主义、相容论等），强调内在意向性与承诺对道德责任的关键性，为AI伦理讨论提供了严谨的概念框架。

Abstract: Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.

[91] IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing cs.AI | cs.CVPDF

Tao Hu, Jiaxin Ai, Licheng Wen, Xueheng Li, Shu Zou

TL;DR: 本文提出了IterCAD，一个用于闭环交互式CAD生成与编辑的统一多模态智能体框架。该框架将任务建模为多模态智能体与可执行CAD沙箱之间的多轮交互，涵盖绘图到代码、文本到代码和交互式编辑三类任务。通过数据合成流水线生成多视图工程图、复杂代码编辑任务和高保真交互轨迹，并采用渐进式监督微调与几何感知强化学习优化智能体，显著提升了代码可执行性和几何保真度。

Details

Motivation: 现有自动化CAD方法主要依赖开环、一次性生成，与实际工程中迭代式设计流程不匹配。本文旨在解决这一鸿沟，提出一个支持闭环、交互式CAD生成与编辑的智能体框架。

Result: 在提出的IterCAD-Bench评估套件上进行了广泛实验，IterCAD在多个基准测试中取得了极具竞争力的性能，在代码可执行性和几何精度上显著优于现有方法，并在闭环迭代优化方面展现出卓越能力。

Insight: 创新点在于将CAD任务建模为多轮交互的闭环智能体框架，并引入了无幸存者偏差的评估标准（CD-TR曲线与AUC-TR指标），统一了代码有效性与几何精度评估。数据合成流水线整合了先进工业制造特征，强化学习策略采用了可行前缀掩码以提升几何保真度。

Abstract: Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

[92] Reward Modeling for Multi-Agent Orchestration cs.AI | cs.CL | cs.LG | cs.MAPDF

King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi, Zixuan Ke

TL;DR: 本文提出了一种名为Orchestration Reward Modeling (OrchRM) 的自监督框架，用于评估多智能体系统（MAS）的编排质量，无需人工标注。该方法利用多智能体执行过程中的中间产物构建胜-负样本对来训练Bradley-Terry奖励模型，从而高效地指导编排器训练和测试时扩展。

Details

Motivation: 基于大语言模型（LLM）构建的多智能体系统需要有效的编排来协调各专业智能体，但现有方法在训练编排器时面临监督信号有限和计算成本高昂的问题。

Result: OrchRM在多个领域（包括数学推理、基于网络的问答和多跳推理）上，将训练效率（以token使用量计）提升了高达10倍，并将MAS测试时扩展的性能（以准确率计）提升了高达8%。

Insight: 创新点在于提出了一种直接在编排层面运作的自监督奖励建模方法，避免了依赖成本高昂的子智能体推演，为稳健的多智能体编排提供了一个可扩展的方向。

Abstract: Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.

[93] EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery cs.AI | cs.CLPDF

Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang

TL;DR: 本文提出了EurekAgent，一个基于环境工程的智能体系统，用于实现指标驱动的自主科学发现。论文认为，随着模型能力的提升，自主科学发现的瓶颈已从设计智能体工作流转向设计智能体环境，即塑造智能体行为的资源、约束和接口。EurekAgent通过权限工程、工件工程、预算工程和人机交互工程四个维度来构建环境，以促进有益行为并抑制有害行为。

Details

Motivation: 随着基于LLM的智能体在自动化科学发现中潜力日益显现，其瓶颈正从规定智能体工作流转向设计智能体环境，即如何通过环境工程来有效引导智能体行为，以支持开放式探索、系统化工件管理和智能体间协作，同时避免奖励黑客行为和高摩擦的人工监督。

Result: EurekAgent在多个数学、内核工程和机器学习任务上取得了新的最先进（SOTA）结果，例如以低于11美元的总API成本发现了新的26圆填充最先进结果。

Insight: 论文的核心创新点在于将环境工程确立为开发可靠自主研究智能体的关键研究方向，通过系统化设计环境的四个维度来放大智能体的生产性行为，这为构建更高效、可控的自主科学发现系统提供了可借鉴的框架。

Abstract: LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

cs.LG [Back]

[94] Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents cs.LG | cs.AI | cs.CLPDF

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

TL;DR: 本文提出了一种名为Sibling-Guided Credit Distillation (SGCD)的新方法，用于解决长视野工具使用强化学习中的信用分配稀疏问题。该方法通过动态采样生成成功与失败的兄弟轨迹，利用外部LLM总结对比信息生成逐步信用参考，并以此重塑策略梯度，避免了传统自蒸馏方法可能破坏工具使用的风险。

Details

Motivation: 长视野工具使用强化学习可以从结果验证中学习，但其轨迹级优势信号稀疏且分散在许多推理、API和答案标记上。直接进行标记级的自蒸馏虽然能提供更密集的信号，但可能不加区分地同时放大有用技能和有害捷径，从而破坏工具使用能力。

Result: 在AppWorld和τ³-airline基准测试上，SGCD方法相比匹配的GRPO基线取得了显著提升：在AppWorld的test_normal上TGC从42.9提升至45.6，在test_challenge上从24.7提升至27.0；在τ³-airline上的pass@1从0.583提升至0.602。

Insight: 核心创新在于将蒸馏用于信用分配而非作为竞争性的行动者损失，通过兄弟轨迹对比、外部LLM生成仅用于训练的逐步信用参考、以及有界的分离信用权重来重塑策略梯度，从而在保持策略梯度主导地位的同时实现了更精确的信用分配。

Abstract: Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy’s own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.

[95] ProPlay: Procedural World Models for Self-Evolving LLM Agents cs.LG | cs.CLPDF

Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li, Xiaoguang Guo

TL;DR: 本文提出了ProPlay，一种用于自进化LLM智能体的程序化世界模型，它通过将成功轨迹抽象为程序并组织成程序图来捕捉任务阶段间的因果转换，支持程序级预演以模拟未来路径，从而在部分可观测环境中持续改进智能体对环境的理解和自我进化能力。

Details

Motivation: 解决在部分可观测环境中，自进化智能体需主动探索、从有限反馈中学习并决定何时信任先验经验，而现有LLM智能体方法常依赖记忆或规划模块但未能闭环整合以持续精化对环境动态的内部理解的问题。

Result: 在公开基准测试中，ProPlay相比强基线方法，持续提升了环境理解和自我进化能力，具体定量结果未在摘要中明确给出，但表明其性能优于现有方法。

Insight: 创新点在于将经验抽象为程序并构建程序图来结构化表示世界知识，通过程序级预演和基于环境反馈的图精化，实现了记忆与规划的闭环整合，为智能体提供了可解释且可更新的结构化软指导。

Abstract: Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

[96] Multi-Bitwidth Quantization for LLMs Using Additive Codebooks cs.LG | cs.CL | cs.ITPDF

Liza Babaoglu, Shuangyi Chen, Ashish Khisti

TL;DR: 本文提出了一种名为Drop-by-Drop的新型多比特宽度后训练量化框架，用于大型语言模型（LLMs）。该方法基于信息论和逐次精炼理论，通过引入嵌套式监督和加性码本结构，使得从一个训练好的单一模型中，可以在推理时动态控制权重的精度，从而在性能和效率之间进行自适应权衡，显著减少了存储和内存开销。

Details

Motivation: 随着LLMs越来越多地部署在资源约束各异的异构硬件上，无需重新训练即可自适应管理性能与效率权衡的能力变得至关重要。

Result: 该方法在Qwen、LLaMA、Gemma和Mistral等主流架构上均保持了有竞争力的困惑度（perplexity）和准确率，允许单个检查点服务于多种比特宽度。

Insight: 核心创新点在于将信息论中的逐次精炼原理与加性码本结构相结合，并利用嵌套式监督损失函数，理论上证明了对于常见的高斯分布权重，在受LLM损失函数启发的加权均方误差失真度量下，可以通过逐步增加比特数实现最优重建。这为LLM的单一模型多精度推理提供了理论依据和实用框架。

Abstract: As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

[97] Emerging Flexible Designs for Geospatial Multimodal Foundation Models cs.LG | cs.AI | cs.CVPDF

Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu

TL;DR: 本文对地理空间多模态基础模型的主流架构进行了标准化比较，重点关注不同光谱波段配置下的灵活性。研究通过统一的自监督学习目标和训练数据集进行预训练，并在GEOBench基准测试中评估分类和分割任务的性能。

Details

Motivation: 解决不同基础模型架构（如仅编码器、编码器-解码器和掩码自编码范式）在性能权衡评估上缺乏一致性的问题，以指导下一代地理空间多模态基础模型的构建。

Result: 在GEOBench基准测试的分类和分割任务中，所有模型在一致参数化下进行评估，结果揭示了模型灵活性、模态对齐与下游任务性能之间的设计权衡。

Insight: 通过受控条件下的架构对比，提供了关于模型灵活性、模态对齐和下游性能之间权衡的新见解，为构建鲁棒的多模态地理空间基础模型提供了实用指导。

Abstract: Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

[98] Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning cs.LG | cs.CLPDF

Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan

TL;DR: 本文提出了SWITCH框架，一种可切换的隐式推理方法，通过引入显式的边界标记（和）来控制隐式推理模式的进入和退出。该方法解决了传统隐式思维链方法难以用标准策略强化学习优化和因果解释困难的问题，并通过课程学习和Switch-GRPO目标进行训练，在性能上超越了以往的隐式状态循环推理方法。

Details

Motivation: 现有隐式思维链方法用连续的隐状态循环替代显式推理轨迹，但难以用标准策略强化学习优化，且因果机制不透明。本文旨在通过引入离散的边界标记，同时解决优化和可解释性两大问题。

Result: SWITCH在相似规模下持续优于先前的隐式状态循环推理方法。通过边界标记的机制分析进一步揭示了三个关键发现：是一个局部化的学习切换策略；其开启的隐式步骤执行了问题特定且因果重要的计算；该计算集中在进入时的单个隐状态转换上。

Insight: 创新点在于使用一对离散的边界标记作为开关，使隐式推理块与标准策略强化学习兼容，并为机制分析提供了自然的切入点。这证明了隐式状态循环推理既可通过强化学习有效训练，也能进行直接的因果机制分析。

Abstract: Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits to enter latent mode and to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

[99] Understanding helpfulness and harmless tension in reward models cs.LG | cs.CLPDF

Eshaan Tanwar, Pepa Atanasova

TL;DR: 该论文研究了基于人类反馈的强化学习（RLHF）中奖励模型内部机制，特别是针对有益性和无害性这两个对齐目标之间的冲突。通过分析仅有益性、仅无害性以及混合目标训练的奖励模型，发现混合目标模型性能常低于单一目标模型，表明目标间存在干扰。研究使用基于激活的方法识别与各目标相关的神经元，并通过针对性消融实验探究其功能角色，发现这些神经元因果性地支持其对应目标，同时常对对立目标产生负面影响。此外，研究发现有益性和无害性共享大量神经元，这些共享神经元对模型行为有不成比例的影响，加剧了对齐张力。

Details

Motivation: 奖励模型是RLHF中的关键组件，用于使语言模型对齐有益性和无害性行为，但这些目标的内在机制及其冲突尚不明确。论文旨在深入理解奖励模型中这两个对齐目标之间的张力及其内部表征机制。

Result: 实验表明，混合目标训练的奖励模型在性能上常低于单一目标模型，揭示了目标间的干扰。通过神经元分析和消融实验，证实了与各目标相关的神经元对其对应目标有因果支持作用，并对对立目标产生负面影响。研究还发现共享神经元对模型行为有显著影响，这解释了对齐张力的来源。

Insight: 论文的创新点在于使用基于激活的神经元分析方法和针对性消融实验，为奖励模型中对齐目标的内部表征提供了可解释的机制性见解。研究揭示了共享神经元在目标冲突中的关键作用，为未来开发解耦和可控的对齐方法提供了动机和方向。

Abstract: Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.

[100] SupraBench: A Benchmark for Supramolecular Chemistry cs.LG | cs.AI | cs.CLPDF

Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun, Ziming Li

TL;DR: 该论文提出了首个超分子化学基准测试SupraBench，用于系统评估大语言模型在超分子化学推理任务（如结合亲和力预测）上的性能。研究设计了四项核心任务和一个辅助视觉任务，并发布了SupraPMC语料库以支持领域适应。实验表明，当前大语言模型在所有任务上仍有很大提升空间，且不同任务家族展现出截然不同的难度特征和失败模式。

Details

Motivation: 超分子化学中宿主-客体系统的设计过程耗时且缺乏系统评估大语言模型在该领域推理能力的基准。

Result: 在SupraBench上对广泛的开源和专有大语言模型进行基准测试，发现所有模型在所有任务上均存在显著性能差距；领域适应预训练在分布内回归任务上有效，但会牺牲严格的字母格式输出。

Insight: 创新点在于构建了首个超分子化学基准测试和配套语料库，系统揭示了当前大语言模型在超分子化学推理中不同任务类型的特定能力缺陷，为针对性改进提供了方向。

Abstract: Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

[101] VideoMDM: Towards 3D Human Motion Generation From 2D Supervision cs.LG | cs.CVPDF

Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany

TL;DR: VideoMDM是一个基于扩散模型的框架，旨在从单目视频中提取的2D姿态直接学习3D人体运动先验，无需任何3D真值。它利用预训练的2D到3D提升器生成近似3D姿态序列作为噪声教师，通过扩散和去噪过程在3D空间中进行处理，并通过2D重投影损失进行监督。

Details

Motivation: 解决在缺乏3D标注数据的情况下，从大量易获取的2D视频数据中学习生成高质量、连贯的3D人体运动的问题。

Result: 在HumanML3D基准测试上，其性能（FID 0.88）接近完全3D监督的MDM模型（FID 0.54）。在真实视频数据集Fit3D和NBA上，该方法生成的动作用户偏好度高，定量结果强劲。

Insight: 创新点在于提出了一种仅使用2D监督训练3D运动扩散模型的方法，通过理论证明深度加权的2D重投影损失在期望上等价于直接3D监督，并将标准3D运动正则化器（如速度一致性和过参数化表示对齐）适配到2D设置中，从而在训练中学习到连贯的3D运动流形。

Abstract: We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

[102] Reinforcement Learning for Neural Model Editing cs.LG | cs.CVPDF

Shaivi Malik

TL;DR: 本文提出了一种将神经模型编辑问题转化为强化学习任务的探索性框架，通过奖励反馈指导智能体修改模型权重。该框架包含MaskWorld（乘性权重缩放）和ShiftWorld（加性权重更新）两种环境，并在文本分类的偏见缓解和图像分类的机器遗忘任务上进行了评估。

Details

Motivation: 针对预训练神经网络的编辑通常需要为特定目标设计专门算法，这一过程耗时且费力，因此作者旨在探索一种通用框架，将模型编辑问题形式化为强化学习问题，以自动学习编辑策略。

Result: 在机器遗忘任务中，学习到的策略将遗忘集准确率降至接近0%，同时保持保留集准确率超过90%；在偏见缓解任务中，策略将偏见相关性能提升超过5%，同时维持整体分类性能。

Insight: 创新点在于将神经模型编辑构建为强化学习问题，通过结合效用保持和目标编辑的奖励函数，使智能体能够学习有针对性的修改，而无需为每个任务手动设计算法，这为模型编辑提供了一种自动化的通用方法。

Abstract: Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where agents modify models using reward feedback. We introduce two environments: MaskWorld, where agents scale weights multiplicatively, and ShiftWorld, where agents apply additive weight updates. The reward function combines a utility-preservation objective with a task-specific editing objective, enabling agents to learn targeted modifications while maintaining overall model performance. We evaluate the framework on bias mitigation in text classification and machine unlearning in image classification, both of which traditionally rely on specialized algorithms. Our results show that the learned policies reduce forget set accuracy to nearly 0% while preserving over 90% retain set accuracy on the unlearning task. In the bias mitigation setting, the learned policies improve bias-related performance by more than 5% while maintaining general classification utility. Our findings show that neural model editing can be cast as a reinforcement learning problem, allowing editing policies to be learned from reward feedback rather than manually engineered for each task.

[103] Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models cs.LG | cs.AI | cs.CLPDF

Daniel Scalena, Sara Candussio, Luca Bortolussi, Elisabetta Fersini, Malvina Nissim

TL;DR: 本文研究了大型推理模型中思维链（CoT）步骤的因果影响，发现推理过程存在一个‘承诺边界’，即从临时猜测到稳定高置信度答案的急剧转变。转变后的大量CoT步骤对最终答案概率没有影响，属于‘附带现象’。利用这一发现，作者在承诺边界处提前退出推理块，平均缩短了55%的CoT长度且性能影响可忽略。

Details

Motivation: 旨在探究语言模型思维链推理中，单个步骤对最终答案的因果影响，以理解答案是如何在推理过程中形成的。

Result: 在多个模型家族和多样化任务上的实验表明，该方法能够将CoT长度平均缩短55%，而对模型性能的影响微乎其微。

Insight: 创新性地定义了‘承诺边界’和‘附带现象CoT’的概念，揭示了推理过程的核心阶段，并利用注意力探针线性解码答案形成阶段，实现了高效的推理提前退出策略。

Abstract: Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step’s causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a \emph{commitment boundary} – a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model’s reasoning block ends, and is followed by \emph{epiphenomenal} CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55% on average with negligible impact on model performance.

Table of Contents

cs.CL [Back]

[1] EDEN: A Large-Scale Corpus of Clinical Notes for Italian cs.CL | cs.AIPDF

[2] Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures cs.CLPDF

[3] Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation cs.CLPDF

[4] MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction cs.CLPDF

[5] Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models cs.CLPDF

[6] Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants cs.CL | cs.LGPDF

[7] Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review cs.CLPDF

[8] Localizing Anchoring Pathways in Language Models cs.CL | cs.AIPDF

[9] LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling cs.CLPDF

[10] Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study cs.CL | cs.LGPDF

[11] PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue cs.CLPDF

[12] SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents cs.CLPDF

[13] Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL cs.CLPDF

[14] No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions cs.CLPDF

[15] G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents cs.CL | cs.AIPDF

[16] EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge cs.CLPDF

[17] NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning cs.CL | cs.AIPDF

[18] SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection cs.CLPDF

[19] Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization cs.CL | cs.LGPDF

[20] When Similar Means Different: Evaluating LLMs on Arabic–Hebrew Cognates cs.CLPDF

[21] Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation cs.CLPDF

[22] From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent cs.CLPDF

[23] Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data cs.CLPDF

[24] ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages cs.CL | cs.AIPDF

[25] LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories cs.CL | cs.AI | cs.LG | cs.MM | cs.ROPDF

[26] One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders cs.CL | cs.AIPDF

[27] Operads for compositional reasoning in LLMs cs.CL | math.CTPDF

[28] Recursive Agent Harnesses cs.CLPDF

[29] HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents cs.CLPDF

[30] Operadic consistency: a label-free signal for compositional reasoning failures in LLMs cs.CL | cs.LGPDF

[31] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning cs.CL | cs.AIPDF

cs.CV [Back]

[32] Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM cs.CVPDF

[33] HairPort: In-context 3D-aware Hair Import and Transfer for Images cs.CV | cs.GRPDF

[34] Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs cs.CV | cs.AIPDF

[35] Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning cs.CVPDF

[36] ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation cs.CV | cs.LGPDF

[37] SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images cs.CVPDF

[38] GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models cs.CVPDF

[39] VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving cs.CVPDF

[40] DIMOS: Disentangling Instance-level Moving Object Segmentation cs.CV | cs.AIPDF

[41] Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning cs.CV | cs.AIPDF

[42] Language-Guided Abstraction for Visual Reasoning cs.CVPDF

[43] Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement cs.CV | cs.AIPDF

[44] Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension cs.CV | cs.CLPDF

[45] Multi-Label Test-Time Adaptation with Bayesian Conditional Priors cs.CV | cs.LGPDF

[46] A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis cs.CV | cs.AIPDF

[47] A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition cs.CVPDF

[48] Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video cs.CVPDF

[49] SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object Tracking cs.CVPDF

[50] TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment cs.CV | cs.AIPDF

[51] SeamEdit: A Black-Box VLM-Agnostic Pipeline for Large-Image Semantic Editing cs.CV | cs.GR | cs.MMPDF

[52] LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck cs.CVPDF

[53] Unified MRI Brain Image Translation via Hierarchical Tumor Structure Comparison cs.CVPDF

[54] PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks cs.CVPDF

[55] Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback cs.CV | cs.AIPDF

[56] Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing cs.CVPDF

[57] Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality cs.CV | cs.AI | cs.CLPDF

[58] HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers cs.CV | cs.AIPDF

[59] Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRI cs.CV | eess.IVPDF

[60] ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance cs.CVPDF

[61] OR-Action: Multi-Role Video Understanding with Fine-Grained Actions cs.CVPDF

[62] Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis cs.CV | cs.AI | physics.med-phPDF

[63] MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold cs.CVPDF

[64] SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation cs.CV | cs.AIPDF

[65] VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits cs.CVPDF

[66] OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data cs.CV | cs.AIPDF

[67] VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models cs.CVPDF

[68] Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization cs.CV | cs.AIPDF

[69] MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models cs.CV | cs.LG | cs.ROPDF

[70] EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution cs.CV | cs.AIPDF

[71] World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible cs.CV | cs.GRPDF

[72] Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction cs.CV | cs.GRPDF

[73] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning cs.CV | cs.AIPDF

[74] RepWAM: World Action Modeling with Representation Visual-Action Tokenizers cs.CVPDF

[75] InterleaveThinker: Reinforcing Agentic Interleaved Generation cs.CVPDF

cs.SD [Back]

[76] AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation cs.SD | cs.CV | cs.MMPDF