Table of Contents
- cs.CL [Total: 46]
- cs.CV [Total: 68]
- cs.HC [Total: 1]
- cs.CY [Total: 1]
- cs.AI [Total: 9]
- cs.RO [Total: 6]
- cs.IR [Total: 1]
- cs.MM [Total: 2]
- cs.LG [Total: 8]
- cs.GR [Total: 1]
cs.CL [Back]
[1] Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO cs.CL | cs.AIPDF
Arash Ahmadi, Parisa Masnadi, Sarah Sharif, Charles Nicholson, David Ebert
TL;DR: 该论文提出了一种基于方差感知的评分标准奖励框架,用于改进大型语言模型在心脏相关医学问答任务中的表现。通过将Group Relative Policy Optimization(GRPO)与从RaR-Medicine衍生的评分标准监督相结合,用连续分析奖励函数替代传统的加权二元标准聚合和单一整体评分,为稀疏、多标准且难以自动验证的反馈提供了更丰富的优化信号。
Details
Motivation: 解决通用大模型在真实医疗场景中部署面临的挑战,包括数据隐私限制、推理成本高、边缘/设备端适用性有限等问题,需要开发更小、更高效的模型,并采用鲁棒的后训练策略来确保可靠的医学推理能力。
Result: 在HealthBench的心脏相关子集上,最佳GRPO变体将Qwen3-14B基础模型的准确率从0.362提升至0.502,F1分数从0.532提升至0.668,性能与GPT-OSS-120B(准确率0.508,F1 0.674)相当,达到了接近SOTA的水平。
Insight: 创新点在于提出了方差感知奖励框架,将评分标准转化为连续分析奖励函数,为稀疏多标准反馈提供更稳定的策略优化信号;该方法不仅提升了心脏医学问答性能,还有潜力扩展到其他基于评分标准的任务中。
Abstract: Large Language Models (LLMs) have shown strong promise in healthcare applications. Yet deploying general-purpose models in real-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on-device use. These challenges motivate the development of smaller, more efficient models that require robust post-training strategies to ensure reliable medical reasoning. In this work, we investigate Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering with rubric-based supervision derived from RaR-Medicine. We propose a Variance-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert-style scoring with continuous analytical reward functions derived from criterion-level rubric outcomes. This formulation provides richer optimization signals for feedback that is sparse, multi-criteria, and difficult to verify automatically, and enables more stable on-policy reinforcement learning. On a held-out heart-related subset of HealthBench, our best GRPO variant improves accuracy from 0.362 to 0.502 and F1 from 0.532 to 0.668 relative to the Qwen3-14B base model, while remaining competitive with GPT-OSS-120B (0.508 accuracy, 0.674 F1). Our findings show that carefully designed rubric-based rewards provide a practical strategy for improving heart-focused medical question answering in LLMs, with potential to extend to other rubric-based tasks.
[2] MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models cs.CL | cs.AI | eess.ASPDF
Manh Luong, Tamas Abraham, Junae Kim, Amar Kaur, Rollin Omari
TL;DR: 本文提出了MCBench,一个用于评估全能大语言模型(Omni LLMs)多模态安全性的新基准。该基准包含1196个跨四种安全类别的场景,要求模型整合视觉、音频和文本信息进行准确的安全评估。每个不安全场景都配有一个最小差异的安全对应场景,以测试模型的敏感性。
Details
Motivation: 现有多模态安全基准仅关注视觉输入,无法评估处理视觉、音频和文本的全能大语言模型(Omni LLMs)的安全性,因此需要一个新的、全面的评估框架。
Result: 对多个最先进(SOTA)模型的评估揭示了显著挑战:Omni LLMs在处理微妙或非物理风险时表现不佳,但在存在显著视觉或听觉线索时表现更好。分析表明,模型虽然能提取特定模态信息,但常常无法有效整合这些线索进行安全判断。
Insight: 主要创新点是构建了首个针对Omni LLMs的多模态(视觉、音频、文本)安全评估基准MCBench,并采用最小差异配对设计来精确评估模型敏感性。客观分析揭示,当前Omni LLMs在安全关键场景中缺乏鲁棒的跨模态推理能力,这凸显了改进架构和训练策略的必要性。
Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.
[3] Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning cs.CL | cs.AIPDF
Aimen Boukhari
TL;DR: 本文提出了一种结合联合嵌入预测架构(JEPA)风格预测损失与标准掩码语言建模(MLM)目标的混合预训练目标,用于自监督语言表示学习。通过在单一共享编码器上联合训练,该模型旨在学习比纯MLM目标更深层的语义结构表示。
Details
Motivation: 动机在于解决MLM预训练目标过度依赖表层词汇身份、缺乏深层语义结构的问题,受JEPA在视觉和音频领域成功的启发,探索将预测性目标引入语言表示学习。
Result: 在相同架构和计算预算下,在五个GLUE基准(SST-2、MRPC、MNLI、CoLA、STS-B)上的广泛表示分析表明,混合编码器产生更均匀的嵌入(均匀度小于-0.16,而MLM为-0.05)、具有更丰富的谱几何、编码更少的表层词汇信息,并实现更好的语义-词汇平衡,尽管线性探测下游准确率相似。
Insight: 创新点在于将JEPA风格的潜在空间预测损失与MLM结合,通过可学习的标量参数动态平衡目标,从而重塑潜在空间的几何结构,使其更偏向语义,这是仅靠标准准确率指标无法捕捉的。
Abstract: Masked language modelling (MLM) has been the dominant pre-training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface-form token identity rather than deeper semantic structure. Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre-training objective that combines a JEPA-style latent-space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre-train both a hybrid model and a pure-MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100). Extensive representation analysis across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than -0.16 vs -0.05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface-level lexical information, and achieves a better semantic-to-lexical balance. Despite similar linear-probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.
[4] Multi-Granularity Reasoning for Natural Language Inference cs.CL | cs.AIPDF
Chunling Xi, Di Liang
TL;DR: 本文提出了一种多粒度推理网络(MGRN),用于自然语言推理任务。该模型通过显式利用交互式推理空间中的分层语义特征,模仿人类从浅层词汇匹配到深层语义抽象和逻辑推理的认知过程,以更好地捕捉复杂的语义交互。
Details
Motivation: 现有基于Transformer的预训练模型主要依赖最后一层的token表示,这通常不足以捕捉有效推理所需的复杂、分层语义交互,因为细粒度词汇线索、短语组合和高级上下文语义在单一表示空间中往往纠缠或稀释。
Result: 在多个公开基准测试上的广泛实验表明,MGRN始终优于强基线模型,验证了所提方法的有效性和鲁棒性。
Insight: 创新点在于提出了一种结构化的多粒度推理框架,通过渐进整合不同粒度的语义信息来揭示自然语言表达背后的复杂语义关系,这模仿了人类语言理解的认知过程,有助于解决单一表示空间的局限性。
Abstract: Natural Language Inference (NLI) is a fundamental task in natural language understanding that requires determining the logical relationship between a premise and a hypothesis. Despite the remarkable success of transformer-based pre-trained models, most existing approaches primarily rely on the final-layer token representations, which are often insufficient for capturing the complex and hierarchical semantic interactions required for effective reasoning. In particular, fine-grained lexical cues, phrasal compositions, and higher-level contextual semantics are typically entangled or diluted in a single representation space. To address these limitations, we propose a novel \emph{Multi-Granularity Reasoning Network} (MGRN) that explicitly leverages hierarchical semantic features within an interactive reasoning space. The proposed framework mimics the human cognitive process of language understanding, which naturally progresses from shallow lexical matching to deeper semantic abstraction and logical reasoning. By integrating semantic information across multiple granularities in a progressive and structured manner, MGRN is able to uncover intricate semantic relationships underlying natural language expressions. Extensive experiments on multiple public benchmarks demonstrate that MGRN consistently outperforms strong baseline models, validating the effectiveness and robustness of the proposed approach.
[5] LoRi: Low-Rank Distillation for Implicit Reasoning cs.CL | cs.AIPDF
Ryan Solgi, Jiayi Tian, Zheng Zhang
TL;DR: 本文提出了一种名为LoRi的低秩蒸馏框架,用于提升大语言模型的隐式思维链推理能力。该方法基于发现推理轨迹具有低秩结构的观察,通过在一共享的低秩张量子空间中对齐师生模型的推理轨迹统计量来传递推理知识。实验表明,该方法在多个模型系列和数学推理基准上均能提升性能,尤其在多步任务上接近显式思维链的准确率。
Details
Motivation: 现有隐式思维链方法性能常落后于显式思维链提示,作者通过实证发现隐藏状态推理轨迹具有低秩结构,这启发了设计一个能捕捉推理全局结构并支持紧凑潜在过程的蒸馏框架。
Result: 在LLaMA和Qwen等多个模型系列、不同规模下的数学推理基准测试中,该方法持续提升了性能,在具有挑战性的多步任务上表现尤其突出,接近显式思维链的准确率,并超越了先前的隐式思维链蒸馏方法。
Insight: 核心创新在于利用推理轨迹的低秩特性,设计了一个基于一阶和二阶统计量对齐的低秩蒸馏框架,这能更有效地捕捉和传递推理的全局结构,为知识蒸馏和模型推理能力提升提供了新视角。
Abstract: Implicit chain-of-thought (iCoT) methods aim to internalize reasoning in large language models, but often underperform explicit CoT prompting. We empirically find that hidden-state reasoning trajectories exhibit low-rank structure. Motivated by this observation, we propose a low-rank distillation framework that transfers reasoning by aligning teacher and student trajectories in a shared low-rank tensor subspace using first- and second-order statistics. The resulting formulation captures the global structure of reasoning while supporting a compact latent reasoning process. We evaluate the method across multiple model families, including LLaMA and Qwen, at different scales on mathematical reasoning benchmarks. Our approach consistently improves performance, especially on challenging multi-step tasks, approaching explicit CoT accuracy and outperforming prior iCoT distillation methods.
[6] Self-supervised User Profile Generation for Personalization cs.CLPDF
Clark Mingxuan Ju, Yuwei Qiu, Tong Zhao, Neil Shah
TL;DR: 本文提出了BUMP,一种自监督的用户画像生成框架,用于个性化大语言模型。该方法通过双向批次内排序目标,训练LLM从用户交互历史中生成自然语言画像,无需任何下游任务标注。在LaMP基准测试中,BUMP的性能与依赖标注奖励的闭源API和先前方法相当或更优。
Details
Motivation: 解决现有方法依赖昂贵且稀疏的下游任务标注奖励来学习用户画像生成器的问题,旨在实现无需任务标签的自监督个性化。
Result: 在LaMP基准测试上,BUMP匹配或超越了依赖标注奖励的闭源API和先前方法,且训练时无需任务标签。
Insight: 创新点在于提出了双向用户建模框架,通过双向排序目标(画像作为查询排序交互,交互作为查询排序画像)和批次内其他用户提供免费负样本,实现完全自监督的密集奖励信号生成,无需任何标注数据。
Abstract: Personalizing large language models (LLMs) has become a central challenge as LLMs are deployed across recommendation, search, dialogue, and content generation – settings where the same query should yield different answers given different users. A promising route is to summarize each user’s interaction history into a natural-language memory or profile and prepend it to the prompt to facilitate personalization. Existing methods learn such profile generators with explicit rewards derived from labeled downstream tasks, which are expensive and sparse as they require annotated supervision for every target task. In light of this challenge, we introduce Bidirectional User Modeling via Profiles (BUMP), a self-supervised framework that trains a profile generator without any downstream labels. Specifically, given a user’s interaction history, we use GRPO to train an LLM to emit a free-form textual profile under a bidirectional in-batch ranking objective: a small LLM judge measures (i) how well the generated profile, used as a query, ranks the user’s own held-out interactions above interactions from other users in the batch, and (ii) how well a held-out interaction, used as a query, ranks the user’s own profile above profiles of other users. Both directions are scored with multi-positive NDCG and combined into a dense reward per rollout; other users in the batch supply free negatives, so every training example yields supervision from raw interaction logs alone. Evaluated on the LaMP benchmark, BUMP matches or outperforms closed-source APIs and prior methods relying on labeled rewards, while requiring no task label at training.
[7] ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces cs.CL | cs.AIPDF
Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur
TL;DR: 本文提出了ReasoningFlow框架,用于将大型推理模型(LRM)生成的复杂非线性推理轨迹(如回溯和自我修正)转化为细粒度的有向无环图(DAG),以捕捉其话语结构。作者通过人工标注验证了标注方案的有效性,并将其扩展到对三个任务(数学、科学、论证)和五个模型的大规模自动标注。基于对图的分析,发现了LRM推理轨迹在结构上的相似性、细粒度的推理行为模式、错误步骤与最终答案的关联性,以及机制性因果依赖与语言层面话语结构的差异。
Details
Motivation: 大型推理模型产生的推理轨迹具有非线性结构(如回溯、自我修正),这使得评估和监控推理过程变得复杂。因此,需要一种方法来结构化地理解和分析这些复杂的推理轨迹。
Result: 研究开发并验证了标注方案,在人工标注31条轨迹(2.1k步)上取得了高标注者间一致性,并自动标注了1,260条轨迹(247.7k步),涵盖了三个任务和五个模型。分析发现LRM表现出结构相似的轨迹,并揭示了可用于更好监控的细粒度推理行为。
Insight: 创新点在于提出了一个将非结构化推理轨迹转化为结构化图表示(ReasoningFlow DAG)的框架,为深入分析LRM的内部推理机制提供了新工具。客观来看,该方法通过图结构分析揭示了不同模型间推理模式的共性、错误传播模式以及语言表达与计算逻辑之间的脱节,对模型可解释性和可靠性评估具有借鉴意义。
Abstract: Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.
[8] When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories cs.CL | cs.AI | cs.HC | cs.LGPDF
Avinash Baidya, Xinran Liang, Ruocheng Guo, Xiang Gao, Kamalika Das
TL;DR: 本文提出了一种用于对话和LLM智能体轨迹的早期故障预警两阶段方法。该方法通过基于注意力的故障预测器从稀疏的轨迹级标签中学习稀疏的回合级故障证据,并结合α-STOP策略在推理时灵活选择准确性与及时性的权衡点。
Details
Motivation: 解决早期故障预警中监督信号稀疏且延迟的问题。传统方法将轨迹的最终标签(成功/失败)简单地分配给每个前缀,这假设每个回合都包含故障证据,但实际语言交互中故障证据往往稀疏且出现较晚。
Result: 在五个涵盖客服、任务导向对话、说服、工具使用和规划的基准测试上,实验表明:1) 高相关性故障证据平均仅占回合数的4.7-11.3%,且首次出现在轨迹的59.0-83.6%之后;2) 基于注意力的预测器比朴素前缀监督方法将帕累托前沿质量(超体积)提升了1-10%;3) 完整系统比最先进的触发策略将前沿质量提升了3-42%,同时将每个权衡点的训练成本降低了1-3个数量级。
Insight: 创新点在于:1) 明确建模并利用了故障证据的稀疏性,通过注意力机制从轨迹级标签中学习回合级证据;2) 提出了α-STOP策略,一个单一、可条件化的停止策略,能在推理时灵活选择操作点,避免了为每个偏好单独训练触发器的巨大成本,实现了可控的早期预警。
Abstract: Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success/failure label while alerts must be raised from partial interactions. Prior early-classification methods often bridge this gap by assigning the terminal label to every prefix, treating every turn as failure evidence. We hypothesize that this prefix-label assumption is poorly matched to multi-turn language interactions, where evidence of eventual failure is sparse and often delayed. In this paper, we introduce a two-stage approach that learns from this sparse evidence structure and uses the resulting risk estimates for controllable early alerting. Specifically, our attention-based failure predictor learns sparse turn-level failure evidence from trajectory labels and uses it to estimate failure risk from partial histories. We then pair this predictor with $α$-STOP, a single preference-conditioned stopping policy that selects an accuracy-earliness operating point at inference time rather than training a separate trigger for each preference. Across five benchmarks spanning customer support, task-oriented dialog, persuasion, tool use, and planning, we first show that high-relevance failure evidence occupies only 4.7-11.3% of turns and first appears after 59.0-83.6% of trajectories on average. We further show that the attention-based predictor improves Pareto-frontier quality (hypervolume) by 1-10% over naive prefix supervision, and that the full system improves frontier quality by 3-42% over state-of-the-art trigger policies while reducing training cost per operating point by 1-3 orders of magnitude.
[9] CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning cs.CLPDF
Rahul Markasserithodi, Aditya Joshi, Yuekang Li, Ishmanbir Singh, Chris Yoo
TL;DR: 本文提出了CHASE框架,通过强化学习驱动的红蓝对抗演练来提升大语言模型的安全性。该框架让黑盒攻击者与安全对齐的防御者协同进化,攻击者学习生成有效绕过安全过滤的改写提示,防御者则利用收集到的对抗性样本进行强化训练。
Details
Motivation: 现有安全对齐方法在面对角色调制、虚构框架和基于说服的改写等提示重写攻击时依然脆弱,且现有防御要么依赖不可扩展的人工干预,要么是基于特定模型内部结构的白盒优化,难以应对实际部署中面临的适应性黑盒对手。
Result: 在BeaverTails和JailbreakBench基准上,针对五种未见过的攻击方法(PAIR、TAP、AutoDAN、PAP、Translation),CHASE将平均StrongREJECT分数降低了43.2%,同时在良性提示上保持了0%的误拒率。
Insight: 核心创新在于提出了一个封闭循环的红蓝对抗协同进化框架,利用无模板的强化学习探索来发现可迁移的潜在攻击原语,这为实现超越当前对抗训练狭窄分布、更具泛化性的LLM安全强化提供了一条路径。
Abstract: Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve. The attacker is trained via Group Relative Policy Optimization (GRPO) under a multiplicative reward that jointly enforces bypass effectiveness and intent fidelity, while the defender is hardened on the harvested adversarial rewrites through a two-stage GRPO + rejection-sampled SFT pipeline balanced with benign data. Evaluated on BeaverTails and JailbreakBench against five held-out attack families (PAIR, TAP, AutoDAN, PAP, Translation), CHASE cuts mean StrongREJECT score by 43.2% with 0% false-refusal on benign prompts. Beyond the headline result, CHASE shows that template-free RL exploration recovers latent attack primitives that transfer across mechanistically distinct attack families, suggesting a path toward LLM safety hardening that generalises beyond the narrow distributions achieved thus far in adversarial training.
[10] AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints cs.CLPDF
Jiayu Liu, Cheng Qian, Zhenhailong Wang, Bingxuan Li, Jiateng Liu
TL;DR: 该论文提出了AdaPlanBench,一个用于评估大语言模型(LLM)智能体在逐步揭示的世界约束和用户约束下进行自适应规划和重新规划的动态交互基准。该基准基于307个家庭任务构建,通过一个可扩展的约束构建流程为每个任务添加双重约束。实验表明,在双重约束下的自适应规划对当前领先的LLM来说仍然具有挑战性。
Details
Motivation: 现实世界中,语言模型进行规划时常常面临世界和用户的双重约束,这些约束可能不会预先完全指定,而是通过交互逐步揭示。然而,现有基准测试对此类逐步揭示的双重约束下的自适应规划探索不足。
Result: 在十个领先的LLM上的实验表明,在双重约束下的自适应规划仍然具有挑战性,最佳模型仅达到67.75%的准确率。性能随着约束的累积而下降,用户约束尤其构成巨大挑战,失败通常源于较弱的物理基础认知和规划有效性降低。
Insight: 论文的创新点在于提出了首个专注于评估LLM在逐步揭示的双重约束下进行自适应规划的交互式基准测试(AdaPlanBench)。其核心设计是动态交互协议:隐藏的约束仅在智能体提出违反该约束的计划时才被揭示,迫使智能体在累积的反馈下进行迭代式重新规划,这更贴近现实场景。这为评估和提升LLM的约束推理与动态适应能力提供了新的测试平台。
Abstract: Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.
[11] QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation cs.CLPDF
Dike Sun, Zheng Zou, Jingtong Zang, Qi Sun, Huaipeng Zhaoand Tao Luo
TL;DR: 本文提出了QueryAgent-R1,一个用于电子商务查询推荐的记忆增强智能体框架。该框架通过检索链优化,将查询生成与产品检索相结合,旨在解决现有方法只优化查询相关性而忽略下游产品偏好匹配的问题,从而提升端到端的对齐效果。
Details
Motivation: 现有电子商务搜索中的查询推荐方法主要优化查询层面的相关性,但忽略了检索到的产品是否与用户的下游偏好一致,这导致查询点击率高但产品转化率低。本文旨在弥合这一差距,实现查询生成与产品检索的协同优化。
Result: 在基于专有工业数据和公共数据集构建的两个离线评估数据集上,QueryAgent-R1持续优于强基线模型。在大型生产平台的在线A/B测试中,QueryAgent-R1将查询点击率提升了2.9%,并将引导转化率提升了3.1%。
Insight: 核心创新在于提出了一个基于检索链优化的智能体框架,将查询生成锚定在实际库存检索中,使智能体能够根据检索到的产品验证和优化查询。此外,在智能体强化学习过程中设计的一致性奖励,以及用于高效用户画像的记忆抽象模块,也是重要的创新点。
Abstract: Query recommendation in e-commerce search aims to proactively suggest queries that match users’ potential interests. However, existing methods mainly optimize query-level relevance, while neglecting whether the retrieved products align with users’ downstream preferences. This mismatch often leads to high query click through rates (CTR) but low product conversion rates (CVR). To bridge this gap, we propose QueryAgent-R1, a memory-augmented agentic framework that improves end-to-end alignment via chain-of-retrieval optimization. Our QueryAgent-R1 grounds query generation in real inventory retrieval, allowing the agent to validate and refine queries based on retrieved products. We also design a consistency reward in the agentic reinforcement learning (RL) process to jointly optimize query relevance and downstream engagement. In addition, we construct a memory abstraction module for efficient user profiling. To support offline evaluation, we construct two datasets based on both proprietary industrial data and public datasets, on which QueryAgent-R1 consistently outperforms strong baselines. Moreover, on a large scale production platform, QueryAgent-R1 improves Query CTR by 2.9% and guided CVR by 3.1% in online A/B tests.
[12] Rethinking LoRA Memory Through the Lens of KV Cache Compression cs.CLPDF
Chunsheng Zuo, Liaoyaqi Wang, William Jurayj, William Fleshman, Benjamin Van Durme
TL;DR: 本文研究了在文档级问答任务中,LoRA适配器作为参数侧内存与KV缓存作为上下文侧内存之间的交互关系。通过逐步剔除文档的KV状态,发现当KV缓存被高度压缩或完全移除时,文档LoRA能显著提升性能,恢复13-21个ROUGE-L点,表明其作为解码时参数内存的价值在上下文证据稀缺时凸显。
Details
Motivation: 动机在于探究参数检索增强(如LoRA适配器)与KV缓存内存之间的相互作用,明确在文档问答中,参数侧内存何时能超越保留的上下文信息发挥作用。
Result: 在文档级问答基准上,当KV缓存被激进压缩时,文档LoRA能恢复13-21个ROUGE-L点;性能提升最大发生在基础模型编码文档、适配器仅用于答案生成时,且QA监督训练的适配器优于原始上下文的下一个词预测。
Insight: 创新点在于将文档LoRA重新理解为解码时的参数内存,而非文档编码器,并揭示其作为互补内存通道的价值在上下文证据稀缺时才显现;同时,QA监督能产生更强的适配器,为内存优化提供了新视角。
Abstract: Parametric retrieval augmentation encodes document information into lightweight, document-specific modules such as LoRA adapters, reducing the need to include all evidence as input context. However, it remains unclear how this parameter-side memory interacts with context-side memory stored in the KV cache. We study this interaction in document-level question answering by progressively evicting document key-value states and measuring when a document LoRA contributes beyond the retained context. We find that document LoRA adds little when the KV cache is largely intact, but becomes increasingly useful under aggressive compression, recovering 13-21 ROUGE-L points when no document context remains. The gain is largest when the base model encodes the document, and the adapter is applied only during answer generation, suggesting that document LoRA is better understood as decoding-time parametric memory than as a document encoder. Finally, QA-style supervision produces substantially stronger adapters than raw-context next-token-prediction. These results position document LoRA as a complementary memory channel whose value emerges precisely when context-side evidence is scarce.
[13] Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems cs.CLPDF
Yingzhuo Liu
TL;DR: 本文提出了一个统一框架来组织LLM多智能体系统中快速发展的潜在通信研究。该框架从三个正交维度(通信内容、对齐方式和信息融合)系统性地分类了现有方法,识别了主要设计模式,并指出了开放挑战。
Details
Motivation: 当前基于LLM的多智能体系统主要依赖自然语言进行逐令牌通信,这存在推理成本高、信息离散化损失和语言歧义/冗余三大结构缺陷。因此,探索直接交换连续表示的潜在通信协议成为新兴方向,但相关文献缺乏统一的分析框架。
Result: 论文在提出的三维框架下,系统性地分类了2024年至2026年间提出的18种代表性方法,并识别出五种主要设计模式。
Insight: 创新点在于提出了一个从“通信内容(WHAT)”、“对齐方式(WHICH)”和“融合方法(HOW)”三个维度分析潜在通信的统一框架。这为比较和设计新方法提供了结构化词汇,并揭示了跨架构对齐、潜在通道安全性等开放挑战。
Abstract: Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token-by-token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks – high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol – latent communication – in which agents exchange continuous representations (embeddings, hidden states, or KV-caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV-Caches, or other continuous state); (2) WHICH sender-receiver alignment is used (latent-space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross-attention, or cache restoration). Under this 3-axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges – including cross-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain-of-thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.
[14] Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding cs.CL | cs.AIPDF
Qiuyu Tian, Fengyi Chen, Yiding Li, Youyong Kong, Fan Guo
TL;DR: 本文提出了Narrative Knowledge Weaver(NKW)框架,用于增强长文本叙事问答中的推理能力。该框架通过将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐,实现对故事世界中角色状态、社会关系、因果触发和时间位置等演变要素的推理。
Details
Motivation: 现有检索和图增强生成方法虽然改善了证据获取,但其单元(如文本块、实体、关系等)未能直接编码证据在故事中的功能,无法有效处理长文本叙事问答中依赖故事世界演变的复杂推理需求。
Result: 在STAGE、FairytaleQA和QuALITY基准测试中,NKW在剧本级故事世界问答任务上表现最强,同时在更以段落为中心的基准上保持竞争力。消融实验和案例分析表明其在角色、场景、时间、因果和叙事进展推理方面具有互补优势。
Insight: 创新点在于提出了一个以叙事为中心的检索增强推理框架,通过整合文本、图和叙事工具,并引入后检索阅读技能来组装证据并审核约束条件,从而直接建模证据在故事中的功能,提升了长文本叙事理解能力。
Abstract: Long-form narrative QA requires reasoning over evolving story worlds rather than isolated passages: answers may depend on earlier goals, changing character states, social relations, causal triggers, temporal position, and later consequences. Existing retrieval and graph-augmented generation methods improve evidence access, but their units–chunks, entities, relations, summaries, or tool actions–do not directly encode how evidence functions in a story. We introduce Narrative Knowledge Weaver(NKW), a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoning.
[15] PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models cs.CLPDF
Minxin Chen, He Zhu, Junyou Su, Wen Wang, Yijie Deng
TL;DR: 该论文提出了PlanBench-V,这是首个用于评估视觉语言模型在空间规划地图解读能力的综合基准。它构建了一个由专业规划师标注的空间规划地图数据库,包含223张地图和1629个问答对,并设计了一个基于理论的评估框架,涵盖感知、推理、关联和实施四个渐进能力。实验表明,尽管最新模型性能有显著提升,但在需要专业判断和政策敏感性的实施任务上仍存在根本性局限。
Details
Motivation: 空间规划地图的解读需要细粒度视觉感知、空间推理和政策知识,这对人类和AI都是挑战。现有视觉语言模型基准主要关注通用视觉理解,忽视了规划领域特定的认知过程,因此需要建立一个领域专用的评估基准。
Result: 在广泛的实验中,2026年的最佳智能体推理模型Qwen3.6-Plus比2025年的最佳模型GPT-4o性能高出27%。然而,所有模型在需要评估判断、政策敏感性和约束感知决策的实施导向任务上仍然表现不佳。
Insight: 论文的创新点在于构建了首个由领域专家标注的空间规划地图基准,并提出了一个理论驱动的四阶段评估框架,将认知过程结构化。这揭示了当前视觉语言模型在专业领域应用中的根本局限性,并强调了开发领域自适应多模态推理框架的必要性。
Abstract: Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their interpretation, however, requires fine-grained visual perception, spatial reasoning, and policy-informed professional judgment, creating major challenges for both human learners and AI systems. With the rapid progress of Vision-Language Models (VLMs), their use in urban planning analysis is gaining attention, yet existing multimodal benchmarks mainly target general visual understanding and overlook the domain-specific cognitive processes of planning practice. To address this gap, we introduce PlanBench-V, the first comprehensive benchmark for evaluating VLMs in spatial planning map interpretation. We first build the Spatial Planning Map Database (SPMD), an expert-annotated dataset of 223 planning maps and 1629 question-answer pairs curated by professional planners, covering diverse geographic regions and cartographic styles. We then propose a theory-informed evaluation framework assessing four progressive capabilities: Perception, Reasoning, Association, and Implementation, corresponding to the cognitive pipeline of planning map interpretation. Extensive experiments across two generations of VLMs show clear progress but persistent limitations. The best 2026 agentic reasoning model, Qwen3.6-Plus, substantially outperforms the best 2025 model, GPT-4o, by 27%. Nevertheless, all models still struggle with implementation-oriented tasks requiring evaluative judgment, policy sensitivity, and constraint-aware decision-making. These findings reveal fundamental limitations of current VLMs in professional planning contexts and highlight the need for domain-adaptive multimodal reasoning frameworks. Code and data are available at https://plangpt.github.io.
[16] MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA cs.CL | cs.AIPDF
Kaifeng Chen, Hongtao Liu, Qiyao Peng, Jian Yang, Yongqiang Liu
TL;DR: 本文提出了MARDoc,一个用于多模态长文档问答的记忆感知精炼智能体框架。该框架将任务解耦为三个专门智能体:用于多粒度多模态检索的探索者、用于将交互轨迹提炼为结构化证据和推理记忆的精炼者,以及用于检查证据充分性并提供针对性反馈的反思者。通过依赖动态更新的结构化记忆而非完整的累积交互历史,该设计减少了上下文噪声,同时保留了关键事实及其逻辑依赖。
Details
Motivation: 现有迭代检索-推理智能体系统通常维护一个混合了检索轨迹、观察和中间推理的单一增长上下文,随着交互累积,关键证据变得分散和稀释,导致多跳推理过程充满噪声。
Result: 在MMLongBench-Doc和DocBench基准上的实验表明,MARDoc取得了强劲的结果,超越了同骨干网络的基线模型,证明了结构化记忆在智能体文档问答中的有效性。
Insight: 核心创新在于通过解耦任务和引入结构化记忆来管理长文档QA中的复杂交互与证据流。具体来说,将单一智能体分解为三个专门角色并建立动态更新的结构化记忆,有效分离了证据存储与推理过程,减少了上下文污染,提升了多跳推理的鲁棒性。
Abstract: Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.
[17] CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement cs.CL | cs.AI | cs.CY | cs.LGPDF
Hong Qian, Yuanhao Liu, Zihan Zhou, Zongbao Zhang, Hanjie Ge
TL;DR: 该论文提出了CollabBench基准,用于在合作游戏中评估和训练大型语言模型(LLM)的协作能力。它通过多样玩家行为模拟和代理训练范式,在扩展的多玩家环境中进行系统评估,实验表明训练后的模型在效率和情感适应方面均优于基线模型。
Details
Motivation: 现有基于LLM的代理在个体任务上表现出色,但与真实人类伙伴的有效协作仍具挑战性;现有对话级协作研究缺乏具身交互和行为执行,因此需要能够实现情境化和沉浸式协作的合作游戏环境。
Result: 在效率和情感指标上的实验表明,训练后的模型优于基础模型,实现了19.5%的效率提升和24.4%的情感性能改进;评估在扩展的CWAH-MultiPlayer和Cook-MultiPlayer环境中进行。
Insight: 创新点包括:1) 多样玩家行为模拟管道,以建模不同玩家行为;2) 统一的协作代理训练范式,通过代理推演整合推理、通信和行动,并使用混合奖励优化任务效率和情感适应;该研究揭示了现有模型的协作局限性,并为未来协作训练提供了见解。
Abstract: While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.
[18] Can LLMs Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting cs.CLPDF
Michiro Asai, Ailiang Lin, Yu Kishimoto, Takao Obi, Satoshi Kosugi
TL;DR: 本文提出两种基于回忆的提示策略(Self-Recall和Question-Recall),以改善大语言模型在知识截止日期约束下的表现,解决传统直接回答方法在隐含依赖截止后知识的问题上的不足。
Details
Motivation: 现有基于提示的知识截止方法主要依赖直接生成答案,当问题隐含涉及截止日期后的知识时效果不佳,因此需要更有效的约束机制。
Result: 在三个现有基准测试中,所提方法优于直接回答提示和逐步推理基线,尤其在反事实问题上提升显著;在自建的多截止历史事件基准(MHEB)上,结合两种回忆策略始终取得最佳性能。
Insight: 创新点在于通过让模型主动回忆截止约束或问题相关有效信息来强化知识边界控制,这为提升模型在时间敏感任务中的可靠性提供了新思路。
Abstract: Prompted knowledge cutoff instructs a large language model (LLM) to act as if information beyond a specified cutoff date were unavailable. However, prior work mainly relies on direct-answer generation, which struggles when post-cutoff knowledge is not explicitly queried but is only causally related to the question. To address this limitation, we propose two recall-based prompting strategies: Self-Recall (SR), which asks the model to restate its cutoff constraint, and Question-Recall (QR), which requires the model to recall question-relevant information valid under the cutoff. Across three existing benchmarks, our methods outperform both direct-answer prompting and conventional step-by-step reasoning baselines, with particularly strong improvements on counterfactual questions. To investigate robustness across different cutoff settings, we further construct the Multi-cutoff Historical Event Benchmark (MHEB), which evaluates the same question under multiple cutoff years. Results show that knowledge cutoff performance varies with cutoff distance, while combining SR and QR consistently yields the best performance.
[19] ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL cs.CLPDF
Zhaorui Yang, Huawei Zheng, Sen Yang, Yuhui Zhang, Haoxuan Li
TL;DR: 本文提出了ProSPy,一个面向企业级数据库的、基于剖析驱动的SQL-Python智能体框架,用于解决Text-to-SQL任务。该框架通过自动数据剖析提取细粒度证据,逐步剪枝大型模式,使用与方言无关的SQL接口获取中间视图,并最终利用Python进行灵活的下游分析,从而应对企业数据库中模式庞大、元数据不完整、SQL方言复杂以及分析问题困难等挑战。
Details
Motivation: 动机在于将大语言模型应用于企业级数据库的Text-to-SQL任务时,面临模式庞大且异构、元数据不完整、SQL方言特定以及复杂分析问题难以用单一SQL查询解决等实际挑战。
Result: 在Spider 2.0-Lite和Spider 2.0-Snow基准测试上,ProSPy在使用开源和专有模型时均持续优于强基线。具体而言,使用Claude-4.5-Opus模型且无需多数投票时,其执行准确率分别达到60.15%和60.51%。分析表明,该框架对SQL方言变化具有鲁棒性,并在模式召回率和精确度之间取得了良好的平衡。
Insight: 宣称的创新点在于将SQL在大数据库上的效率与Python分析的灵活性相结合,通过分阶段的、剖析驱动的智能体框架,减少了对不可靠元数据的依赖并提升了跨SQL方言的鲁棒性。从客观角度看,其将复杂任务分解为剖析、剪枝、查询、分析四个阶段的结构化推理流程,以及引入与方言无关的SQL接口来解耦逻辑与具体方言,是值得借鉴的设计思路。
Abstract: Large language models have substantially advanced Text-to-SQL systems, yet applying them to enterprise-scale databases remains challenging. Real-world databases often contain large and heterogeneous schemas, incomplete metadata, dialect-specific SQL syntax, and complex analytical questions that are difficult to solve with a single SQL query. To address these challenges, we propose ProSPy, a Profiling-driven SQL–Python agentic framework for enterprise-scale Text-to-SQL. ProSPy structures the reasoning process into four stages: it first extracts fine-grained data evidence through automatic profiling, progressively prunes large schemas into task-relevant contexts, fetches intermediate views through a dialect-agnostic SQL interface, and finally performs flexible downstream analysis with Python. This design combines the efficiency of SQL over large databases with the flexibility of Python-based analysis, while reducing reliance on unreliable metadata and improving robustness across SQL dialects. Experiments on Spider 2.0-Lite and Spider 2.0-Snow show that ProSPy consistently outperforms strong baselines with both open-source and proprietary models, achieving execution accuracies of 60.15% and 60.51% with Claude-4.5-Opus, without majority voting. Further analysis shows that ProSPy is robust to SQL dialect variations and achieves a favorable trade-off between schema recall and precision.
[20] Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads cs.CL | cs.AIPDF
Ruoxi Sun, Quantong Qiu, Juntao Li, Zecheng Tang, Yihang Lou
TL;DR: 本文对多模态大语言模型(MLLMs)进行了解释性研究,发现了一种称为‘功能稀疏性’的结构特性。通过引入‘检索注意力质量(RAM)’指标,识别出一小部分高度专业化的注意力头,称为上下文感知检索(CoRe)头。这些头专门负责从复杂视觉上下文中提取查询相关的特征,而其他头则关注更广泛的上下文。因果干预和加速实验证明了CoRe头的必要性和实用性。
Details
Motivation: 尽管多模态大语言模型在复杂视觉语言任务上表现出色,但其从复杂、嘈杂的上下文中提取查询相关视觉特征的机制尚不明确。本文旨在揭示MLLMs内部的工作机制,特别是跨模态检索中的结构特性。
Result: 研究发现,在不同视觉领域和模型规模下,CoRe头都扮演着专门的信息提取器角色。消融实验表明,仅消除排名前5%的CoRe头就会导致多模态推理性能显著下降,而消除排名较低的头影响甚微。加速实验验证了利用这种局部稀疏性可以显著加速推理,同时保持稳健的任务性能。
Insight: 论文的核心创新点是发现了MLLMs中存在的‘功能稀疏性’结构原则,并定义了专门的CoRe头。这为理解模型的机制可解释性提供了新视角,并为未来的架构设计和模型优化(如高效推理)奠定了理论基础。从客观角度看,将注意力机制的功能进行细粒度划分和量化,是模型解释性研究中的一个有价值的贡献。
Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.
[21] TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization cs.CLPDF
Liting Zhang, Shiwan Zhao, Xuyang Zhao, Zichen Xu, Jianye Wang
TL;DR: TARPO是一种纯强化学习框架,用于增强大语言模型的推理能力。它通过一个轻量级的动作头路由器,在每个推理步骤中自适应地在离散令牌生成和连续潜在推理之间切换,从而结合了两种推理范式的优势。该方法在多个基准测试中超越了现有的显式和潜在推理基线,并展现出稳定的训练动态。
Details
Motivation: 为了解决连续潜在推理中确定性表示限制强化学习策略探索的问题,论文旨在开发一种能够自适应融合离散与连续推理的框架,以提升大语言模型的推理表达能力和探索效率。
Result: 在Qwen2.5(1.5B到7B)和Llama-3.1-8B骨干模型上的广泛实验表明,TARPO在多个基准测试中持续优于现有的显式和潜在推理强化学习基线。
Insight: 论文的核心创新是引入了令牌级别的自适应路由机制,通过一个可学习的路由器在每一步动态选择推理模式,这既保留了离散采样的随机性以促进探索,又利用了连续表示的表达能力。这种端到端的联合优化框架为结合不同推理范式提供了新思路。
Abstract: Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics. Our code is available at https://github.com/NKU-LITI/TARPO-master.
[22] Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models cs.CLPDF
Huiyuan Zheng, Houtao Zhang, Boyang Wang, Qingyi Si, Hongcheng Guo
TL;DR: 该论文提出了RandomBench基准,用于评估多模态大语言模型(MLLMs)在逻辑中性场景下(即多个选项效用相等时)是否能保持分布中性的随机行为。研究发现MLLMs普遍存在’随机性坍缩’现象,即模型在明确随机指令下也无法维持均匀随机性,表现出强烈的分布偏差。
Details
Motivation: 当前对MLLMs的评估过于关注效用驱动目标,而忽略了模型在逻辑中性场景下的行为。在多个行动同样有效的场景(如推荐行程或日程)中,随机性至关重要,但确定性策略可能导致重复行为和有效选项覆盖不足。
Result: 实验表明MLLMs普遍存在随机性坍缩,例如Claude Sonnet 4.6的top-1概率从理想的25%升至97%,随机性指数(RI)降至0.068。广泛的消融研究证明这种偏差在不同语言和表示格式中持续存在。
Insight: 论文的创新点在于提出了首个专门评估MLLMs在逻辑中性决策中随机性保持能力的基准RandomBench,并设计了RI、BCI、BII三个量化指标。客观来看,这揭示了当前MLLMs在分布公平性方面的系统性缺陷,为理解模型隐式偏差提供了新视角。
Abstract: Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.
[23] Staying with the Uncertainty: Uncertainty-Scaffolding Strategies for Artificial Moral Advisors in LLM-to-LLM Simulated Conversations cs.CL | cs.AIPDF
Salvatore Greco, Hainiu Xu, Jacopo Domenicucci, Yulan He, Sylvie Delacroix
TL;DR: 本文研究大型语言模型作为人工道德顾问时,应如何通过不确定性策略引导对话者‘与不确定性共存’。作者提出了三种不确定性模式(视角倍增、张力保持、过程反思)并与三种对照条件(基线、说服性、迎合性)进行比较,通过模拟对话和问卷评估了不同策略的效果。
Details
Motivation: 随着LLM在各种场景中被部署为人工道德顾问,需要探索其应展现的对话模式,特别是如何帮助对话者处理道德困境中的不确定性。
Result: 研究发现:1)没有单一模型在模拟用户代理中占主导,开源模型通过角色间分歧、闭源模型通过角色内对冲来与人类模糊性对齐;2)声明式角色能更好捕捉初始立场多样性,而叙事式角色展现出更现实的信念修正;3)所有六种AMA策略产生可区分的对话模式;4)不确定性策略的差异不在于产生多少立场修正,而在于维持的参与质量。
Insight: 创新点在于系统性地提出了三种不确定性脚手架策略,并通过LLM-to-LLM模拟对话实验框架,揭示了不同策略在道德咨询对话中的独特作用,强调了参与质量而非立场改变作为评估指标的重要性。
Abstract: LLMs are increasingly deployed as Artificial Moral Advisors (AMA) in a variety of contexts: what kind of conversational patterns should they display? In this paper, we study how AMA can help their interlocutors “stay with the uncertainty”. We propose three modes of uncertainty (Perspective-Multiplying, Tension-Preserving, Process-Reflecting) and compare them against three control conditions (Baseline, Persuasive, Sycophantic). A user-agent LLM engages in a dialogue on an ethical dilemma with an AMA following a specific uncertainty strategy, and completes pre- and post-conversation questionnaires. We further examine the effect of two persona prompt formats (Declarative and Narrative). We found that (1) no single model dominates as a simulated user agent, with open models aligning with human ambiguity through between-persona divergence and closed models through within-persona hedging; (2) declarative personas better capture initial stance diversity while narrative personas show more realistic belief revision; (3) all six AMA strategies produce distinguishable conversational patterns; and (4) uncertainty strategies differ not in how much stance revision they produce, but in the quality of engagement they sustain.
[24] Representing Research Attention as Contextually Structured Flows cs.CL | cs.LGPDF
Jessica Rodrigues, Angelo Salatino, Gard Jenset, Scott Hale
TL;DR: 该论文提出了一种称为’注意力流’的上下文结构化表示方法,用于编码研究注意力的组织及其随时间演变的过程,以解决传统聚合计数表示方法无法保留注意力跨上下文发展信息的问题。
Details
Motivation: 研究注意力通常被用作可见性、影响力和社会采纳度的指标,但传统表示方法(如聚合计数)无法捕捉注意力在不同上下文中的发展过程,导致其解释与表示之间存在不匹配。
Result: 在基于类比推理构建的基准测试中,与信号和序列表示相比,流表示在支持结构比较方面更有效,尤其是在注意力受时间进程或上下文分布影响的场景下;学习到的流表示还提高了在部分观察和结构扰动下的鲁棒性。
Insight: 创新点在于将研究注意力建模为上下文结构化的现象,并提出了’注意力流’这一表示方法,为研究评估提供了更具信息量的基础;客观来看,该方法通过编码注意力的时空演变结构,实现了更细粒度和鲁棒的比较分析。
Abstract: Research attention is widely used as an indicator of visibility, influence, and societal uptake, yet it is typically represented as aggregated counts that do not preserve how attention develops across contexts over time. This creates a mismatch between how attention is interpreted and how it is represented. We propose attention flows as contextually structured representations that encode the organisation of attention and its evolution over time. We evaluate whether these representations capture transferable structure by constructing a benchmark based on analogy-style reasoning across research outputs. Comparing signal, sequence, and flow-based representations, we find that flow representations more effectively support structural comparison, particularly in settings where attention is shaped by temporal progression or context distributions. We further show that learned flow representations improve robustness under partial observation and structural perturbation. Overall, these results support modelling attention as a contextually structured phenomenon and provide a basis for more informative approaches to research evaluation.
[25] Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version) cs.CL | cs.AIPDF
Christopher J. Wedge, Joshua Stutter, Danny Dixon, Jacek Cała
TL;DR: 本文提出了一种基于简单图结构的检索增强生成(RAG)系统,旨在减少大型语言模型在复杂问答任务中的幻觉现象。该系统通过一个轻量级图结构和专用工具集,结合向量搜索和图查询,在基于维基百科的结构化数据集上进行评估。
Details
Motivation: 动机是解决LLM和基于LLM的系统(如RAG)在复杂问答中仍然存在的幻觉和事实错误问题,同时避免昂贵的模型微调,实现对专有信息的推理。
Result: 在MoNaCo(一个具有挑战性的维基百科复杂问答基准)上的评估结果显示,引入基于图的工具显著提高了事实正确性的精确率和召回率,将幻觉答案数量减半,并在三种评估场景中获得了最高的细粒度真实性分数,同时仅带来适度的token使用量增加。
Insight: 创新点在于将轻量级图结构与RAG系统结合,通过专门的图查询工具增强事实检索和验证能力。从客观角度看,这种将结构化图查询与传统向量搜索相结合的方法,为提升RAG系统的事实性和可解释性提供了一种有效且计算成本可控的途径。
Abstract: Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM “hallucinating” information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning. In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks. Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.
[26] ACE-SQL: Adaptive Co-Optimization via Empirical Credit Assignment for Text-to-SQL cs.CLPDF
Xiaobing Chen, Ai Jian, Eryu Guo, Zhiqi Pang
TL;DR: 本文提出了一种名为ACE-SQL的强化学习框架,用于联合优化文本到SQL任务中的模式检索和SQL生成。该方法通过从生成器的执行轨迹中构建在线列集池,并基于与执行正确轨迹最常关联的列集来推导自适应的策略检索目标,从而在检索器和生成器之间实现双向适应。
Details
Motivation: 现有方法要么依赖全模式生成,将模式链接隐含在巨大的搜索空间中,要么使用一个在静态黄金列监督下训练的独立检索器,其目标对于当前生成器策略可能不是最优的。本文旨在解决模式链接这一关键步骤的优化问题。
Result: 在BIRD Dev数据集上,ACE-SQL实现了65.3%的贪婪执行准确率,同时每个查询仅使用约0.93k个输出token。
Insight: 核心创新在于提出了一个基于经验信用分配的强化学习框架,通过执行反馈联合优化检索和生成,实现了检索器与生成器在策略上的双向适应,避免了静态监督目标可能存在的次优性问题。
Abstract: Text-to-SQL maps natural language questions to executable SQL queries. Modern databases often contain large and complex schemas, making schema linking a critical step for accurate SQL generation. Existing methods either rely on full-schema generation, which leaves schema linking implicit within a large search space, or use a separate retriever trained with static gold-column supervision, whose targets may be suboptimal for the current generator policy. To address this issue, we propose Adaptive Co-optimization via Empirical Credit Assignment for Text-to-SQL (ACE-SQL), a reinforcement learning (RL) framework that jointly optimizes schema retrieval and SQL generation under execution feedback. ACE-SQL constructs an online column-set pool from generator rollouts and derives adaptive on-policy retrieval targets from the column set most frequently associated with execution-correct rollouts. This induces bidirectional adaptation, where the retriever adapts toward column sets that the generator can execute correctly, while the generator adapts to the retriever’s evolving schema selections under execution feedback. With approximately 3k synthetic Text-to-SQL question-database pairs for RL training, ACE-SQL achieves 65.3% greedy execution accuracy on BIRD Dev while using 0.93k output tokens per query. The repository is available at https://github.com/xbchen1/ACE-SQL.
[27] Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach cs.CL | cs.AIPDF
Zhihao Lin, Ziqi Zhu, Hao Huang, Guanghui Wang, Peiyang He
TL;DR: 本文提出了一种多方面的迭代优化框架,用于生成高质量的文学翻译参考数据和偏好数据,以解决文学翻译中高质量标注数据稀缺的问题。通过利用生成的数据进行监督微调和强化学习,训练出的LitMT模型在文学翻译基准测试中取得了与Claude Sonnet 4.5相当的性能。
Details
Motivation: 解决文学翻译中高质量标注数据稀缺的挑战,并平衡表达流畅性与文学效果。
Result: 在MetaphorTrans英译中文学翻译基准测试中,LitMT-8B和LitMT-14B分别达到67.25和69.07 CEA100分,与Claude Sonnet 4.5(68.43分)相当;生成的参考数据用于监督微调比原始真实数据提升8.65 CEA100分,使用GRPO强化学习进一步带来1.51分的提升。
Insight: 创新点在于通过多方面的专业LLM翻译器迭代生成高质量数据,并发现两阶段训练的稳定性结合GRPO的在线探索能力优于DPO,有效提升了文学翻译质量。
Abstract: Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO’s online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).
[28] To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection cs.CL | cs.AI | cs.CV | cs.IR | cs.LG | cs.MM | eess.ASPDF
Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan
TL;DR: 该论文提出了一种查询自适应的音频-视觉人物检索框架,通过主动模态检测来决定是否采用多模态融合。在真实世界的广播视频档案中,目标人物可能只出现声音或只出现面部,固定融合会引入噪声。该方法利用跨模态分数一致性检测活跃模态,在BBC Rewind语料库上显著提升了检索精度。
Details
Motivation: 解决在真实广播视频档案中,由于目标人物可能只出现声音或只出现面部,固定多模态融合会因缺失模态引入噪声而降低检索精度的问题。
Result: 在BBC Rewind语料库(超过12,000个广播视频)上,自适应系统达到了94.2%的P@1,优于仅使用语音(82.9%)、仅使用人脸(93.4%)和固定融合(90.0%),恢复了与使用真实模态标签的Oracle系统(96.6%)之间64%的性能差距。
Insight: 创新点在于提出了查询自适应的模态选择框架,通过跨模态分数一致性来检测活跃模态,避免了在模态缺失时进行无效融合。这为处理真实世界不完整多模态数据提供了新思路,即动态决策是否融合比强制融合更有效。
Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).
[29] EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation cs.CL | cs.AIPDF
Xinpeng Qiu, Wang Yihu, Zhifeng Liu, Xiaochen Wang, Jimin Wang
TL;DR: 本文提出了EGTR-Review框架,旨在通过多智能体教师蒸馏的方法,高效生成有证据支撑且可追溯的科学同行评审意见。该方法首先构建一个多智能体教师模型,负责分解论文结构、提取关键元素、检索外部学术证据、进行验证推理并合成评审意见;然后将教师的推理轨迹和最终评论蒸馏到一个轻量级的学生模型中。
Details
Motivation: 解决现有基于大语言模型(LLM)的同行评审生成方法存在的评论内容泛泛、证据支持不足、来源可追溯性弱,以及复杂多智能体系统推理成本高的问题。
Result: 在公开的同行评审数据集上的实验表明,EGTR-Review(学生模型)在自动指标、LLM-as-Judge评估和人工评估中均优于基于提示、微调以及结构化/智能体基线方法,同时保持了强事实依据和来源可追溯性,并显著降低了令牌消耗和推理时间。
Insight: 核心创新点在于通过多智能体教师蒸馏,将复杂的、证据驱动的多步推理过程压缩到一个高效的单一模型中,实现了性能与效率的平衡。具体技术亮点包括结构感知的论文分解、证据状态标注、任务前缀驱动的多任务学习,以及用于减少弱监督影响的证据加权目标函数。
Abstract: Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.
[30] IA-RAG: Interval-Algebra-Driven Temporal Reasoning for Dynamic Knowledge Retrieval cs.CLPDF
Xiaoman Wang, Yaoze Zhang, Wenzhuo Fan, Hongwei Zhang, Ding Wang
TL;DR: 本文提出IA-RAG,一种基于区间代数驱动的层次化时序RAG框架,用于动态知识检索。该框架将知识建模为时间区间,在形式化时序约束下进行检索,并引入子图时间紧缩机制处理不完整或不确定的时序边界,以支持复杂的时序推理任务。
Details
Motivation: 现有RAG和Graph RAG框架大多将知识视为静态或仅使用粗粒度时间戳,无法捕捉持续时间、重叠、包含等丰富的时序结构,因此需要一种能够建模精细时序关系并支持动态知识检索的方法。
Result: 在TimeQA、TempReason和ComplexTR等多个时序问答基准测试上的实验表明,IA-RAG在时序检索和推理方面表现强劲,尤其在复杂的组合时序推理任务上取得了优异性能。
Insight: 创新点在于将知识表示为区间事件单元,并基于Allen区间代数构建层次化主题森林来建模时序依赖关系;同时通过子图时间紧缩机制处理模糊时间边界,实现了对隐式时序语义的检索和推理。
Abstract: Retrieval-Augmented Generation (RAG) has shown strong effectiveness in grounding Large Language Models (LLMs) with external knowledge. However, existing RAG and Graph RAG frameworks largely treat knowledge as static or associate time with coarse-grained timestamps or metadata, failing to capture rich temporal structures such as duration, overlap, and containment. We propose IA-RAG, a hierarchical temporal RAG framework that models knowledge as time intervals and performs retrieval under formal temporal constraints. IA-RAG represents facts as Interval Event Units (IEUs) and organizes them into a hierarchical Thematic Forest, where temporal dependencies are governed by Allen’s Interval Algebra. To handle incomplete or uncertain temporal boundaries, IA-RAG further introduces a Sub-graph Time Tightening mechanism that refines fuzzy intervals through logical constraints within connected event subgraphs. In addition, IA-RAG supports implicit temporal semantic retrieval through interval-algebra-guided traversal. Experiments on multiple temporal question answering benchmarks, including TimeQA, TempReason, and ComplexTR, demonstrate that IA-RAG achieves strong temporal retrieval and reasoning performance, particularly on complex compositional temporal reasoning tasks. Our code is released at https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA.
[31] Automatic Labelling of Speech Translation Errors cs.CLPDF
Dominik Macháček, Maike Züfle, Ondrej Klejch
TL;DR: 本文提出了语音翻译错误标注(STEL)方法,旨在解决语音翻译系统缺乏置信度和质量评估标准的问题。作者构建了标注协议和真实端到端评估数据集,并分析了纯文本与多模态系统在STEL任务上的表现。
Details
Motivation: 语音翻译错误会降低系统可信度并可能引发严重后果,但目前缺乏评估语音翻译置信度和质量估计的成熟方法。
Result: 实验表明,纯文本系统XCOMET和多模态大模型Qwen2.5-Omni在STEL任务上的准确率约为人类水平的一半;研究发现语音直接处理对STEL任务必要,且当前文本与语音处理系统在标注翻译错误和语音处理错误方面具有互补性。
Insight: 创新点在于首次建立了语音翻译错误标注的评估框架,揭示了多模态与纯文本系统的性能差异及互补特性,为语音翻译质量评估提供了新基准。
Abstract: Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech translations. To initiate progress in this direction, we propose Speech Translation Error Labelling (STEL). We create an annotation protocol, a small authentic end-to-end evaluation dataset, and we analyse how existing text-only and speech-processing systems perform the STEL task. Our results show that text-only XCOMET and multimodal LLM Qwen2.5-Omni are able to perform the STEL task in roughly half the precision of humans. We also find that direct speech processing is necessary for the STEL task, and that the current text-only and speech-processing systems are complementary in labelling translation-only vs. speech-processing errors in ST.
[32] SkillComposer: Learning to Evolve Agent Skills for Specification and Generalization cs.CLPDF
Qi Zhang, Zhaopeng Feng, Xiaonan Shi, Xiaomeng Hu, Chu Liu
TL;DR: SkillComposer是一个用于智能体技能演化的框架,它将技能构建分解为创建、改进和合并三个可学习的操作。该框架通过系统性的拒绝采样进行训练,使语言模型能够在推理时自我演化技能,并支持离线、在线和混合三种部署模式,旨在解决技能在特定任务与泛化之间的平衡问题。
Details
Motivation: 当前技能构建方法将其视为一次性提取,忽略了技能在特定任务上过拟合而无法迁移,或过于抽象而指导性不足的根本矛盾。论文旨在通过引入明确的技能规范和泛化机制来解决这一缺陷。
Result: 在τ²-Bench、LiveCodeBench v6和AppWorld上的综合实验表明,SkillComposer始终优于基线方法。其4B参数的模型能将一个27B执行器的性能在智能体任务上提升高达+4.5分,在代码任务上提升+3.4分,并且能够泛化到训练时未见过的领域和任务类型。
Insight: 核心创新在于将技能构建分解为三个可学习的操作,并提出了一个支持技能演化的训练与部署框架。分析表明,合并与改进操作解决了正交的质量维度,且技能组合是一种可迁移的元能力,这为技能增强推理提供了实用的方法。
Abstract: Agent skills, which consist of reusable strategies that guide agent reasoning and action, have shown strong potential for improving model capability at inference time. However, current skill construction methods treat the problem as one-shot extraction, overlooking a fundamental tension: a skill tailored to the specific task fails to transfer, while the abstracted skill often provides insufficient guidance. We attribute this fragility to the absence of explicit mechanisms for skill specification and generalization. To address this gap, we introduce SkillComposer, a framework that decomposes skill construction into three learnable operations: create, improve, and merge. Trained via systematic rejection sampling recipe, SkillComposer enables language models to self-evolve skills at inference time and supports three deployment modes: offline for building generalized libraries, online for task-specific refinement, and hybrid for combining both. Comprehensive experiments on $τ^2$-Bench, LiveCodeBench v6, and AppWorld show that SkillComposer consistently outperforms baselines. Our SkillComposer-4B improves a 27B executor by up to +4.5 on agent tasks and +3.4 on code tasks, while generalizing across domains and task types unseen during training. Analysis reveals that merge and improve address orthogonal quality dimensions and that skill composition is a transferable meta-ability, providing a practical recipe for skill-augmented inference.
[33] IR3DE: A Linear Router for Large Language Models cs.CL | cs.LGPDF
Eros Fanì, Oğuzhan Ersoy
TL;DR: 本文提出IR3DE,一种基于岭回归的线性路由器,用于为大语言模型(LLM)的提示选择最合适的领域专家模型。该方法在因果语言建模和推理任务设置中评估,性能与基线相当甚至更优,且支持动态增删领域专家而无需重新训练路由器。
Details
Motivation: 现有路由方法要么在弱到强的通用LLM间优化成本,要么需要大量训练以支持领域专家路由,IR3DE旨在提供廉价、快速的领域专家路由决策。
Result: 在两个因果语言建模(CLM)设置中,IR3DE性能与基线相当;在推理设置中,其归一化性能达到98.4%,超越其他基线。
Insight: 创新点在于使用线性岭回归实现高效路由,并支持动态调整领域专家集合而无需重新训练路由器,降低了部署和维护成本。
Abstract: Foundational Large Language Models (LLMs) demonstrate proficiency on a wide range of general tasks, and achieve remarkable results on various specialized tasks via domain-expert LLMs. With the ever-growing list of available LLMs, inference routers are being proposed to select the most appropriate LLM for each prompt. However, existing routing methods either optimize cost across weak-to-strong generalist LLMs or require substantial training to support domain-expertise routing. In this paper, we propose IR3DE, a Ridge Regression-based Router for Domain Experts that provides cheap and fast routing decisions for each prompt. We evaluate IR3DE in two Causal Language Modeling (CLM) settings where the tasks are next-token prediction for all domains, and one reasoning setting where each domain has its own distinct reasoning task. Despite being a linear router, IR3DE achieves performance comparable to the other baselines in both CLM settings, and surpassing them in the reasoning setting, with a normalized performance of 98.4%. Moreover, IR3DE enables the addition or removal of new domain experts without requiring the router to be retrained from scratch, allowing a dynamic set of LLMs to be served with minimal disruption to the router itself. Our code is available at: github.com/gensyn-ai/IR3DE.
[34] Harnessing Structural Context for Entity Alignment Foundation Models cs.CL | cs.AIPDF
Xingyu Chen, Yuanning Cui, Zequn Sun, Wei Hu
TL;DR: 本文提出ContextEA,一种增强的编码器-解码器框架,用于提升实体对齐基础模型对结构上下文的利用能力。通过引入跨知识图谱交互编码器统一两个图谱,以及结构校准解码器利用多层级结构证据校准对齐分数,从而在未见过的知识图谱对上实现更强的可迁移性。
Details
Motivation: 现有实体对齐基础模型在编码阶段跨图谱交互较弱,且最终候选排名过度依赖粗略相似度,未能充分利用结构上下文信息。
Result: 在OpenEA、SRPRS和DBP的29个数据集上的实验表明,ContextEA持续优于强可迁移基线,其预训练模型在所有三个基准组上均超过微调基线,展现出对未见知识图谱的显著更强的迁移能力。
Insight: 创新点在于显式利用结构上下文(通过跨图谱交互编码和结构校准解码)来增强实体对齐基础模型,这是一种轻量级且有效的改进方向,提升了模型的可迁移性和对齐精度。
Abstract: Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment knowledge, once pretrained, can be directly applied to diverse previously unseen KG pairs. However, it still underuses structural context in two places: cross-KG interaction is weak during encoding, and final candidate ranking still relies too heavily on coarse similarity. We address these limitations with ContextEA, an enhanced encoder-decoder framework for transferable EA. On the encoder side, we introduce a cross-KG interaction encoder that unifies the two KGs with anchor bridges and performs earlier relation-aware cross-graph propagation. On the decoder side, we introduce a structural calibration decoder that calibrates alignment scores with entity-level, neighborhood-level, relation-level, and anchor-aware structural evidence. This design strengthens both structural context construction and structural context exploitation while remaining lightweight. Experiments on 29 EA datasets in OpenEA, SRPRS, and DBP show consistent gains over strong transferable baselines. Notably, the pretrained ContextEA already surpasses the finetuned baselines on all three benchmark groups, demonstrating substantially stronger transfer to unseen KGs. These results suggest that explicitly harnessing structural context is an effective direction for improving EA foundation models.
[35] The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models cs.CLPDF
Jinyang Zhang, Hongxin Ding, Yue Fang, Weibin Liao, Muyang Ye
TL;DR: 该论文提出使用隐藏状态的L2范数作为大型语言模型(LLM)推理强度的内生信号,并建立了其与稀疏自编码器(SAE)推理特征激活之间的理论联系。基于此,论文引入了三种无需额外训练、由L2范数引导的测试时缩放技术,以提升模型的推理性能。
Details
Motivation: 当前研究缺乏一个能够捕捉LLM逐层推理动态的、模型内在的、有理论依据的信号。本文旨在填补这一空白,探索一个能够表征模型推理强度的内生指标。
Result: 实验表明,基于L2范数的技术(如自适应逐层推理递归、内生推理状态引导和L2引导的响应选择)在不同模型架构和基准测试上显著提升了推理性能。
Insight: 核心创新点在于发现并理论证明了隐藏状态的L2范数是LLM内部推理强度的可靠信号,并基于此设计出简单有效的测试时干预方法,为理解和控制LLM的潜在推理动态提供了一个新的、有理论支撑的视角。
Abstract: Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model-intrinsic signal that captures its layer-wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that the l2 norm of hidden states serves as an endogenous signal of the model’s reasoning intensity. Using Sparse Autoencoders (SAEs) as a diagnostic probe, we observe that LLMs’ internal reasoning is marked by a sharp increase in reasoning feature activations concentrated in late layers. Motivated by this pattern, we establish a formal link between reasoning intensity and the model’s latent geometry and theoretically prove that the l2 norm of hidden states bounds the activation strength of SAE reasoning features. Empirical correlation analysis and causal interventions further validate the l2 norm as a faithful indicator, where heightened norms consistently correspond to critical reasoning steps. We then introduce three test-time scaling techniques guided by l2 norms: (i) Adaptive Layer-wise Reasoning Recursion, (ii) Endogenous Reasoning State Steering, and (iii) l2-guided Response Selection, which requires no additional training or data and is compatible with advanced inference engines. Experiments across model architectures and benchmarks show that l2-norm-based techniques significantly improve reasoning performance, offering a principled yet simple lens to perceive and control LLM latent reasoning dynamics. Our code is available at https://github.com/zjy1298/The-Tell-Tale-Norm.
[36] FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays cs.CL | cs.HCPDF
Yijun Liu, Yifan Song, John Gallagher, Sarah Sterman, Tal August
TL;DR: 本文介绍了FOXGLOVE数据集,该数据集包含专家和多个前沿大语言模型(LLM)对议论文的写作反馈评论,旨在系统比较两者在目标导向性、锚定性和优先级等关键维度上的异同。研究发现,专家和LLM在反馈目标和文章位置分布上相似,但在具体锚定句子上存在差异,且LLM反馈通常更长、更复杂,在多数质量维度上评分更高,但优势部分归因于评论长度。
Details
Motivation: 当前LLM越来越多地用于生成写作反馈,但缺乏在写作研究认定的核心修订维度(如目标导向性、锚定性和优先级)上对LLM反馈与专家反馈的系统性比较。
Result: 在FOXGLOVE数据集上,专家对LLM反馈的质量评分在多数维度上更高,但优势很大程度上可归因于更长的评论长度;同时,专家和LLM在反馈目标和文章位置分布上相似,但在具体锚定句子上存在分歧。
Insight: 论文的创新点在于构建了一个系统比较人类专家与LLM写作反馈的数据集,并揭示了LLM反馈在长度和复杂性上的特点,以及其与人类反馈在锚定具体句子上的差异,为理解LLM反馈的潜力和局限性提供了实证基础。
Abstract: While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.
[37] EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading cs.CLPDF
Zhihao Wu, Linhai Zhang, Taiyi Wang, Runcong Zhao, Peter Andrews
TL;DR: 本文提出了证据诊断干预训练(EDIT)框架,用于提升大型语言模型在评分任务中的忠实性。该框架包含两个阶段:EDIT-SFT通过模型内部信号(后验信念和输入基础分数)定位有问题的推理步骤并进行局部修正;EDIT-RL则通过信念引导的奖励塑形来校准评分模型,惩罚有害的信念漂移。
Details
Motivation: 现有评分方法主要针对数学推理等自包含任务,难以在需要依据评分标准和学生答案证据的评分任务中定位推理错误或追踪模型对最终分数的信念变化,因此需要一种更忠实于评分规则的LLM评分训练方法。
Result: 在两个真实世界、多学科的评分基准测试中,EDIT在领域内和领域外数据分割上均持续优于强监督微调和强化学习基线,消融研究证实了内部状态诊断的有效性。
Insight: 创新点在于利用模型内部信号(如后验信念)进行诊断和干预,并设计了信念引导的奖励塑形机制,这为提升LLM在复杂、需证据支撑的决策任务中的忠实性和可解释性提供了新思路。
Abstract: Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed for self-contained reasoning tasks such as mathematics reasoning, struggle in this setting because they do not identify where grading reasoning goes wrong or how the model’s belief about the final mark changes during reasoning. We propose Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for training more rubric-faithful LLM graders. First, EDIT-SFT locates problematic reasoning steps using internal model signals: posterior belief over the final mark and input-grounding scores. It then revises only these local steps with help from a rubric checklist. Second, EDIT-RL calibrates the grader with belief-guided reward shaping, penalising large harmful belief drifts while still allowing helpful exploration. Experiments on two real-world, multi-subject grading benchmarks demonstrate that EDIT consistently outperforms strong supervised fine-tuning and reinforcement learning baselines on both in-domain and out-of-domain splits, with ablation studies confirming that internal-state diagnostics drive these gains.
[38] Emergent Language as an Approach to Conscious AI cs.CL | cs.AI | cs.MA | cs.NEPDF
Zengqing Wu, Chuan Xiao
TL;DR: 该论文提出了一种生成式方法,通过多智能体强化学习中的涌现语言来研究AI意识问题。该方法让智能体从零开始(无语言、无自我概念、极少接触人类文本),在任务压力下发展通信,以确保观察到的结构可归因于任务需求而非继承的人类语言先验。作为概念验证,作者在一个最小环境中实例化了该方法,并展示了智能体发展出了自我指涉通信,包括一个回声不匹配检测回路。
Details
Motivation: 现有评估AI意识的方法(如基于理论的清单判别法或直接构建意识启发模块的架构法)存在局限,无法确定观察到的结构是否是人类语言先验的产物。本文旨在提出一种生成式方法论,通过涌现语言来因果地研究意识相关结构,避免人类语言先验的干扰。
Result: 作为概念验证,在一个最小环境中实例化该方法,结果表明智能体发展出了自我指涉通信,并涌现出一个回声不匹配检测回路,该回路并非由任务结构或架构单独预测,而是源于特定的环境可供性。
Insight: 创新点在于提出了一种生成式、自底向上的涌现语言方法作为研究AI意识的工具,强调从零开始发展通信以建立因果归因。该方法为研究意识相关结构(如自我指涉)提供了一种避免人类语言先验污染的新途径,其涌现的检测回路也展示了环境复杂性在塑造通信中的关键作用。
Abstract: The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.
[39] CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments cs.CLPDF
Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang
TL;DR: 本文提出了CollabSim,一个基于CSCW理论的可配置仿真框架,用于系统评估LLM智能体在多智能体系统中的协作能力。该框架通过控制交互条件和探测智能体内部状态,分析智能体在建立共识、维持共享任务理解、平衡个体与集体激励以及修复错位等方面的协作能力。
Details
Motivation: 当前多智能体系统(MAS)的失败往往源于智能体缺乏协作能力,而非个体任务解决能力。现有评估主要关注任务结果或单智能体能力,缺乏对协作能力的系统分析,因此需要一种基于CSCW理论的方法来填补这一空白。
Result: 在四个LLM上的实验表明,CollabSim能够捕捉条件效应、区分模型性能模式,并揭示智能体设计对任务依赖性的影响,为评估协作能力提供了定量和定性分析工具。
Insight: 创新点在于将CSCW理论融入MAS评估,通过可控仿真和内部状态探测来系统分析协作能力;客观来看,该方法为多智能体协作研究提供了理论指导和可复现的实验框架,有助于深入理解智能体交互中的协作机制。
Abstract: Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents’ ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents’ collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents’ internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.
[40] Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions cs.CL | cs.MM | cs.SIPDF
Xinnong Zhang, Wanting Shan, Hanjia Lyu, Zhongyu Wei, Jiebo Luo
TL;DR: 本文提出了一种基于反事实语境修订的审计框架,用于评估大型语言模型在模拟社交媒体用户立场时的语境敏感性。通过对比纯文本和多模态(包含表情包)语境修订策略,研究发现两种策略在不同极化偏好机制下均能有效且稳健地引发立场转变。
Details
Motivation: 当前大型语言模型被广泛用于模拟社交媒体用户对在线讨论的回应,但其模拟结果是否准确反映用户特定信念,以及对语义独立的语境变化是否高度敏感,仍不明确。
Result: 研究采用平均方向性立场偏移和立场转换率作为评估指标,结果表明纯文本和多模态修订策略在不同极化偏好机制下均能实现有效且稳健的立场转换。
Insight: 创新点在于引入反事实语境修订作为审计框架,并首次将多模态(表情包)语境纳入评估,揭示了LLM模拟在线意见动态的潜力与风险,为理解模型语境敏感性提供了系统评估方法。
Abstract: Large language models are increasingly used to simulate social media users and infer how individuals may respond to online discussions. However, it remains unclear whether these simulations reflect precise user-specific beliefs or whether they are highly sensitive to semantically independent changes in conversational contexts. In this work, we study counterfactual context revision as a framework for auditing LLM-based stance simulation. Given an original online conversation, we first infer a target user’s stance toward a specific topic. We then apply controlled revision strategies to the conversational context and simulate the user’s stance again under the revised context. We compare text-only revision strategies with a multimodal one that incorporates meme-based context and evaluate two main effectiveness metrics, i.e., average directional stance shift and stance transition rate. The results reveal effective and robust stance transitions in both text-only and multimodal strategies across different polarization-preference mechanisms. Our study contributes an evaluation framework for understanding the context sensitivity of LLM-based stance simulation. More broadly, it highlights both the promise and risk of using LLMs to simulate online opinion dynamics.
[41] Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation cs.CLPDF
Hanxu Hu, Zdeněk Šnajdr, Pinzhen Chen, Jannis Vamvas, Rico Sennrich
TL;DR: 本文提出了一种基于强化学习(RL)的方法,用于在给定丰富语言上下文的情况下进行未见语言的翻译。该方法使用表面级翻译指标(chrF)作为奖励,使大型语言模型(LLM)能够从上下文中提取并应用相关语言知识,从而在完全未见语言上实现比上下文学习或监督微调更好的翻译效果。
Details
Motivation: 现有方法(如持续训练或上下文编码语法书)在翻译未见或低资源语言时通常过拟合特定语言,零样本迁移能力有限。为了解决大规模翻译极低资源语言的问题,需要让LLM掌握利用上下文语言知识的元技能,而非记忆特定语言。
Result: 实验表明,尽管奖励信号轻量,但经过RL训练的模型能有效从提供的上下文中提取和应用相关语言信息,在完全未见语言上的翻译效果优于上下文学习或监督微调。
Insight: 创新点在于将基于结果的强化学习应用于语言学习任务,使其能够从上下文中学习语言知识,而不仅限于传统的数学和代码推理任务。这为LLM通过上下文学习语言提供了一种新方法。
Abstract: Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by undergoing continued training or even by encoding a grammar book in their context. However, both methods typically overfit specific languages, with limited zero-shot transfer at test time. To translate extremely low-resource languages at scale, we argue that LLMs must acquire the meta-skill of utilizing in-context linguistic knowledge rather than memorizing specific languages. In this paper, we propose a reinforcement learning (RL) approach to unseen language translation given rich linguistic context, using a surface-level translation metric (chrF) as the reward. Empirically, despite the lightweight reward, our RL-trained models effectively extract and apply relevant linguistic information from the provided context, leading to better translations on completely unseen languages than in-context learning or supervised fine-tuning. Our analyses suggest that outcome-based RL can extend beyond conventional reasoning tasks like math and coding to serve as a recipe for language learning from context.
[42] Latent Reasoning with Normalizing Flows cs.CL | cs.LGPDF
Guancheng Tu, Xiangjun Fu, Suhao Yu, Yao Tang, Haoqiang Kang
TL;DR: 本文提出NF-CoT,一种基于归一化流的隐式推理框架,用于改进大语言模型的推理能力。它通过在紧凑的连续状态中进行中间计算,避免了显式思维链对离散、序列化文本生成的依赖,同时保留了自回归模型从左到右生成、概率采样、KV缓存解码和可处理似然估计等优势。
Details
Motivation: 解决显式思维链推理的局限性,即中间计算必须通过离散、序列化的文本流进行表达,即使底层更新是语义的、不确定的或部分形成的,这限制了推理的带宽和效率。
Result: 在代码生成基准测试中,NF-CoT相比显式思维链和先前的隐式推理基线提高了通过率,同时显著降低了中间推理成本。
Insight: 创新点在于将归一化流集成到LLM主干中,对从显式思维链提炼的紧凑连续思想进行概率建模,实现了在同一个因果流中由NF头生成连续思想位置、由标准LM头生成文本位置的设计,从而支持隐式推理空间中的精确似然估计和策略梯度优化。
Abstract: Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and communication-oriented token stream: each reasoning step must be verbalized before the model can proceed, even when the underlying update is semantic, uncertain, or only partially formed. Latent reasoning offers a higher-bandwidth alternative by performing intermediate computation in compact continuous states before committing to text. Yet existing latent-reasoning methods often sacrifice key advantages that make CoT effective in autoregressive language models, including native left-to-right generation, probabilistic sampling, compatibility with KV-cache decoding, and tractable likelihood estimation. We propose NF-CoT, a latent reasoning framework that preserves these advantages by modeling continuous thoughts with normalizing flows. NF-CoT instantiates a TARFlow-style normalizing flow inside the LLM backbone, defining a tractable probability model over compact continuous thoughts distilled from explicit CoT. Continuous-thought positions are generated by an NF head, while text positions are generated by the standard LM head within the same causal stream. This design provides exact likelihoods for latent thoughts, enables probabilistic left-to-right decoding with the original KV cache, and supports direct policy-gradient optimization in the latent reasoning space. On code-generation benchmarks, NF-CoT improves pass rates over explicit-CoT and prior latent-reasoning baselines while substantially reducing intermediate-reasoning cost.
[43] Human Adults and LLMs as Scientists: Who Benefits from Active Exploration? cs.CLPDF
Mandana Samiei, Eunice Yiu, Anthony GX-Chen, Dongyan Lin, Jocelyn Shen
TL;DR: 本文通过修改的’blicket检测器’任务,研究了主动探索对成人因果推理能力的影响,并比较了人类与大型语言模型在相同任务中的表现。研究发现主动探索显著提升了成人对合取因果规则的推理能力,但合取规则仍比析取规则需要更多测试。同时,一些先进大语言模型在假设推断准确率上接近人类水平,但探索策略效率较低,且表现出类似的合取-析取性能差距。
Details
Motivation: 传统因果学习研究表明,成人在识别合取因果规则(效果需要多个原因同时存在)时存在困难,而析取规则表现更好,但这些发现多基于被动观察范式。本文旨在探究当成人通过主动探索获得自主权时,这种’合取障碍’是否仍然存在。
Result: 在主动探索设置下,成人对合取因果规则的推理能力得到显著改善,但推断合取规则仍比析取规则需要更多测试。在相同任务中,一些最先进的大语言模型在假设推断准确率上接近人类水平,但探索效率较低,且同样表现出合取-析取性能差距。
Insight: 论文的创新点在于将主动探索范式引入因果推理研究,揭示了人类在自主干预下能部分克服’合取障碍’。从客观角度看,该研究通过人类与大语言模型的对比,为理解人类认知优势(如探索策略效率)和模型局限性提供了新视角,对构建更类人的因果推理AI系统具有启发意义。
Abstract: A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified blicket detector’’ task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults’ conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.
[44] You Only Index Once: Cross-Layer Sparse Attention with Shared Routing cs.CL | cs.AI | cs.LGPDF
Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei
TL;DR: 本文提出了一种名为跨层稀疏注意力(CLSA)的新方法,旨在解决大语言模型(LLM)长上下文推理中的解码效率瓶颈。该方法基于KV共享架构(如YOCO),核心创新在于跨解码器层共享KV缓存和路由索引,使得单个索引器只需计算一次令牌级top-k选择,即可在各层复用索引,从而在保持令牌稀疏注意力细粒度选择性的同时,分摊路由开销。
Details
Motivation: 现代LLM的长上下文推理,尤其是在需要生成长思维链的推理密集型场景中,解码效率日益成为瓶颈。现有的稀疏注意力方法面临效率与质量的权衡:结构化块稀疏方法加速明显但质量损失显著,而令牌稀疏方法更准确但端到端加速有限,因为对完整缓存进行top-k路由仍然昂贵。
Result: 在短上下文和长上下文基准测试上的实验表明,CLSA兼具准确性和高效性。在128K上下文长度下,实现了高达7.6倍的解码加速和17.1倍的整体吞吐量提升。
Insight: 主要创新点是提出了一种跨层共享路由索引的架构设计,将昂贵的令牌级路由计算开销分摊到多个层,从而在保持令牌稀疏注意力质量优势的同时,显著提升了端到端推理效率。这为长上下文LLM提供了一个更完整的、能同时提升模型质量和推理效率的架构解决方案。
Abstract: Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.
[45] Self-Augmenting Retrieval for Diffusion Language Models cs.CL | cs.AI | cs.LGPDF
Paul Jünger, Justin Lovelace, Linxi Zhao, Dongyoung Go, Kilian Q. Weinberger
TL;DR: 本文提出了一种名为SARDI的自增强检索框架,用于提升离散扩散语言模型在检索增强生成任务中的性能。该框架利用扩散模型去噪过程中产生的低置信度前瞻性token来动态指导检索,从而在生成最终输出前获取更准确的证据。SARDI无需额外训练,适用于任何具备推理能力的离散扩散语言模型,并在多个多跳问答基准测试中显著提升了吞吐量和性能。
Details
Motivation: 针对离散扩散语言模型在检索增强生成中未能充分利用去噪过程中产生的低置信度token的问题,本文旨在通过动态检索机制提前获取更准确的证据,以提升生成质量。
Result: 在五个多跳问答基准测试中,SARDI在保持高吞吐量的情况下(最高提升8倍),超越了当前无需训练的扩散和自回归检索基线方法,实现了性能的显著提升。
Insight: 创新点在于首次利用扩散模型去噪过程中的低置信度token作为前瞻性信号来动态指导检索,这是一种无需训练、与检索器无关的通用框架,为扩散模型的检索增强生成提供了新的思路。
Abstract: Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens often surface salient entities early in the denoising trajectory, enabling retrieval of stronger evidence before the output is finalized. We exploit this through Self-Augmenting Retrieval for Diffusion Language Models (SARDI), a dynamic RAG framework that uses these lookahead tokens to guide retrieval during denoising. SARDI is training-free, retriever-agnostic, and applicable to any reasoning-capable discrete diffusion language model. Across five multi-hop QA benchmarks, SARDI outperforms current training-free diffusion and autoregressive retrieval baselines at up to $8\times$ higher throughput.
[46] Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection cs.CL | cs.AI | cs.LGPDF
Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Tianjun Yao, Xinyi Shang
TL;DR: 本文介绍了OpAI-Bench,一个用于研究渐进式人机协作文本编辑过程的基准测试。该基准从人类撰写的文档出发,通过预定义的AI覆盖率和五种代表性编辑操作,为每个样本构建九个顺序修订版本,覆盖四个领域,并在多个粒度上保留完整的作者归属信息。实验表明,AI文本的可检测性不仅受AI编辑内容比例影响,还与编辑操作、领域和累积修订历史有关,且混合作者身份的中间版本往往比完全人类或高度AI编辑的端点更难检测。
Details
Motivation: 随着AI写作助手日益融入现实世界的起草和修订工作流,许多文档不再是纯粹的人类撰写或AI生成,而是渐进式人机协作编辑的结果。现有AI文本检测基准主要关注最终输出,对AI作者身份信号在整个修订过程中如何出现、积累或消失的理解有限。
Result: 实验在OpAI-Bench上使用8个文档级检测器、7个句子级检测器和2个细粒度token/span级检测器进行评估,揭示了AI文本可检测性受编辑操作、领域和修订历史影响,并发现了现有基准遗漏的非单调检测模式。
Insight: 创新点在于提出了一个操作引导的、多粒度(文档、句子、token、span)的渐进式人机文本转换基准,能够模拟真实编辑场景并分析AI辅助写作在何时、如何变得可检测。客观来看,该基准通过控制AI覆盖率和编辑操作,为研究混合作者身份文本的检测提供了系统化的测试平台。
Abstract: As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for studying progressive human-to-AI text transformation across document, sentence, token, and span granularities. Starting from human-written documents, OpAI-Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors. Experiments reveal that AI-text detectability is governed not only by the proportion of AI-edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints, exposing non-monotonic detection patterns missed by existing benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at https://github.com/VILA-Lab/OpAI-Bench.
cs.CV [Back]
[47] VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding cs.CVPDF
Lin Fu, Zheyuan Yang, Yang Wang, Tingyu Song, Arman Cohan
TL;DR: 本文提出了VideoKR,这是首个专为增强知识密集型和推理密集型视频理解而设计的大规模训练语料库,包含31.5万个视频推理示例,覆盖14.5万个新收集的CC许可专家领域视频。作者开发了一个人机交互、技能导向的示例生成流程,旨在逐步提升深度视频推理能力,并确保示例及其思维链推理的难度、多样性和可靠性。同时,作者还构建了VideoKR-Eval基准测试,该测试需要真正的视频理解和知识密集型推理,而非文本捷径。实验表明,在标准SFT→GRPO流程下,使用VideoKR进行后训练的模型在知识密集型视频推理上超越了先前的后训练方法,同时在通用视频推理上保持竞争力,凸显了数据设计是视频推理进步的关键驱动力。
Details
Motivation: 当前视频理解模型在需要深度知识和复杂推理的任务上存在不足,缺乏专门针对知识密集型和推理密集型视频理解的大规模训练数据。本文旨在通过构建高质量、高难度的视频推理数据集来推动这一领域的发展。
Result: 在标准SFT→GRPO后训练流程下,使用VideoKR进行后训练的模型在知识密集型视频推理任务上超越了先前的后训练方法,并在新构建的专家标注基准VideoKR-Eval上进行了评估,证明了其有效性。同时,模型在通用视频推理任务上仍保持竞争力。
Insight: 论文的核心创新在于构建了首个大规模、高质量、专注于知识密集和推理密集型视频理解的训练数据集VideoKR及其评估基准VideoKR-Eval。其采用的人机交互、技能导向的示例生成流程,确保了数据在难度、多样性和可靠性上的质量,这为未来视频推理研究提供了高质量的数据基础和方法论参考,强调了精心设计的数据集对模型能力提升的关键作用。
Abstract: We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFT$\rightarrow$GRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.
[48] Personal AI Agent for Camera Roll VQA cs.CV | cs.AIPDF
Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li
TL;DR: 本文研究了个人相册视觉问答(VQA)场景,旨在构建一个能够访问用户个人相册、检索相关照片以回答各种问题的对话式AI助手。为此,作者收集并标注了包含50名用户、31,476张图像和2,500个问答对的camroll数据集,并设计了配备分层记忆和高效导航工具的camroll-agent。实验表明,该智能体在长上下文理解方面优于多个基线方法。
Details
Motivation: 解决AI助手在个人相册这一庞大、长期、高度个性化的视觉内容流中进行高效导航和信息检索的挑战,以回答从简单事实到开放式建议的各类问题。
Result: 在自建的camroll数据集上,camroll-agent在长上下文理解方面超越了众多基线方法,展示了其有效性。
Insight: 创新点在于针对个性化视觉记忆(而非标准文本记忆)的长上下文推理问题,提出了分层记忆架构和专用工具集,强调了处理视觉细节、一致性和用户特定上下文需要不同的方法。
Abstract: We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user’s personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., Name of the food I tried yesterday?'') to more open-ended ones (e.g., Recommend some dishes I have never eaten before’’). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents’ long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.
[49] Recovering Physically Plausible Human-Object Interactions from Monocular Videos cs.CVPDF
Dingbang Huang, Etienne Vouga, Qixing Huang, Georgios Pavlakos
TL;DR: 本文提出RePHO方法,从单目视频中重建物理合理的人-物交互(HOI)。该方法在运动学估计的基础上,通过强化学习策略在物理模拟器中优化交互过程,并采用自适应采样策略解决运动学噪声问题,最终生成物理一致的HOI序列。
Details
Motivation: 现有基于运动学的方法虽然能产生视觉上合理的运动,但常导致物理上不合理的现象(如穿透、物体漂浮),因此需要引入物理约束来提升重建的物理合理性。
Result: 在两个标准HOI基准测试上,该方法在物理合理性指标上相比现有最优方法(SOTA)取得了明显提升。
Insight: 创新点在于将运动学估计与基于物理模拟的强化学习优化相结合,并设计了具有双重自更新机制的自适应采样策略,以处理噪声数据并逐步提升重建质量。
Abstract: In this paper, we propose RePHO, a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physically implausible artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework. We begin with a kinematic estimate and then refine it by training a policy with reinforcement learning (RL). This policy is optimized to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that can identify the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard HOI benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods. Project Page: https://dingbang777.github.io/RePHO/
[50] Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation cs.CV | cs.AI | cs.MMPDF
Tobia Poppi, Silvia Cappelletti, Sara Sarto, Florian Schiffers, Garin Kessler
TL;DR: 本文提出了一种跨模型安全引导框架,旨在将安全控制表示为可移植的潜在方向,实现一次学习即可在异构生成器之间重用。通过从源LLM中学习安全方向,并仅使用良性数据将其轻量对齐到目标生成器,在推理时应用以提升安全性。该方法在文本到图像和文本到视频生成任务中验证,无需目标模型的不安全数据即可达到与使用不安全数据本地学习相当的安全性能。
Details
Motivation: 现有安全控制方法通常针对特定模型,需要为每个新架构重新训练或定制干预,缺乏可移植性。本文探索安全是否可以作为共享的潜在方向表示,实现跨模型的安全引导,以构建轻量且可复用的安全机制。
Result: 在多种源-目标模型对的文本到图像和文本到视频生成任务中,转移的安全方向在攻击成功率(ASR)降低和CLIP-Score/FID权衡方面,与使用目标模型不安全数据本地学习的方向效果相当,且不损害生成质量。
Insight: 创新点在于首次提出跨模型安全引导框架,通过潜在方向转移实现安全控制的可移植性,无需目标侧不安全数据。客观分析表明,安全相关行为并非完全模型局部,而是可以通过跨模型共享的几何表示进行控制,为模块化安全机制提供了新路径。
Abstract: Recent progress in generative modeling has made safety control a central challenge, yet existing approaches remain largely model-specific, requiring retraining or tailored interventions for each new architecture. In this work, we ask whether safety can be represented as a portable latent direction, learned once and reused across heterogeneous generators. We introduce the first framework for cross-model safety steering, in which a safety direction is estimated in a source LLM from paired safe-unsafe prompts, transported to a target generator through a lightweight alignment fitted on benign data alone, and applied at inference time. Crucially, our pipeline never accesses unsafe data on the target side, isolating whether safety can be transferred through shared representation geometry. Beyond a single global direction, we also identify a multi-vector extension that captures category-specific safety behaviors, enabling more selective control. We evaluate our approach in text-to-image and text-to-video generation across diverse source-target model pairs. Across models, transferred safety directions achieve ASR reduction and CLIP-Score/FID trade-offs comparable to directions learned natively on the target model using unsafe data, while requiring no target-side unsafe data. This indicates that safety improvements do not come at the expense of generation quality. Our results point to a modular view of safety: safety-relevant behavior is not purely model-local, but can be controlled through latent directions that persist across models. This suggests a new path toward lightweight, reusable safety mechanisms that do not require target-side unsafe data.
[51] Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin cs.CVPDF
Sayan Mandal, Rocco Sedona, Simon Besnard, Mikhail Urbazaev, Morris Riedel
TL;DR: 该论文提出了Biomazon数据集,这是一个用于亚马逊流域森林结构和生物量建模的多模态基准数据集。该数据集整合了GEDI RH剖面和AGBD目标数据,以及多传感器预测因子,并提供了标准化的空间划分和评估协议。论文通过基线实验评估了不同模型规模、模态贡献和嵌入方法对联合预测任务的影响,并与现有网格化产品进行了性能对比。
Details
Motivation: 目前缺乏一个机器学习就绪的多模态基准数据集,用于联合预测完整的GEDI RH剖面和AGBD,或用于评估强制RH百分位数之间物理一致排序的方法。现有方法通常将冠层高度代理或AGBD作为独立的标量目标进行预测,而非学习有序的森林垂直结构剖面。
Result: 论文通过一个共享编码器-解码器与任务特定头部的基线框架进行了全面的消融研究,评估了主干网络/模型规模、模态贡献以及在独立和融合设置下辅助嵌入的使用效果,并报告了单目标和联合目标的结果。研究还将基线性能与现有网格化产品(如GEDI L4D RH10-RH98和AGBD)在匹配的时间尺度上进行了区域对齐比较,以提供性能背景。
Insight: 主要创新点在于创建了一个专注于联合预测完整、有序的森林垂直结构剖面(GEDI RH剖面)和地上生物量密度(AGBD)的多模态基准数据集。该数据集强调物理一致性(RH百分位数的有序性),并整合了多种遥感数据源,为热带森林结构一致建模和碳核算研究提供了标准化的评估平台。
Abstract: Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.
[52] Deep Learning-assisted AMD Staging based on OCT and OCT Angiography cs.CVPDF
Yukun Guo, Tristan T. Hormel, An-Lun Wu, Liqin Gao, Min Gao
TL;DR: 该研究开发并评估了基于光学相干断层扫描(OCT)和OCT血管成像(OCTA)数据的深度学习模型,用于自动分级年龄相关性黄斑变性(AMD)的严重程度。研究比较了基于生物标志物图、二维投影和三维体积三种不同输入模态的模型,发现所有模型均表现出色,其中基于生物标志物的模型综合性能最佳,尤其在早期AMD检测方面具有优势。
Details
Motivation: 解决AMD严重程度人工分级耗时、主观性强的问题,利用OCT和OCTA数据开发自动、准确的深度学习辅助诊断模型。
Result: 所有模型在AMD分期任务中均表现优异,与参考标准具有高度一致性(QWK >= 0.83)。基于生物标志物的模型取得了最高的整体性能(QWK = 0.85 +/- 0.03)和最佳的早期AMD检测F1分数(0.59 +/- 0.14)。
Insight: 创新点在于系统比较了三种不同的数据表示(生物标志物图、2D投影、3D体积)用于AMD分期,证明了基于分割病理特征(如液体、玻璃膜疣)的生物标志物图是一种高效且具有临床可解释性的输入方式,能有效提升早期病变的检测能力。
Abstract: To develop and evaluate deep learning models for automated grading of age-related macular degeneration (AMD) severity using optical coherence tomography (OCT) and OCT angiography (OCTA) data. Two hundred seventy-one participants aged >= 50 years with varying AMD severities. Central macular 6 x 6 mm OCT/OCTA volumes were acquired using a swept-source OCTA system (SOLIX; Visionix/Optovue Inc., CA). AMD severity was graded into four stages (No AMD, Early AMD, Intermediate AMD, and Advanced AMD) according to the AREDS simplified severity scale. Three deep learning models were developed using different input modalities: (1) biomarker maps derived from segmented pathological features, including retinal fluid, drusen, geographic atrophy (GA), and macular neovascularization (MNV); (2) two-dimensional (2D) en face OCT and OCTA projections; and (3) three-dimensional (3D) OCT/OCTA volumes. EfficientNet-based architectures were trained using normalized inputs, data augmentation, and five-fold cross-validation. A total of 2,030 OCT/OCTA volumes from 351 eyes of 271 participants were analyzed. All models demonstrated strong AMD staging performance with substantial agreement with the reference standard (QWK >= 0.83). The biomarker-based model achieved the highest overall performance (QWK = 0.85 +/- 0.03, mean +/- standard deviation) and the best detection of early AMD (F1-score = 0.59 +/- 0.14). The 3D model achieved performance comparable to the 2D OCT/OCTA model (QWK = 0.83 +/- 0.04 vs. 0.83 +/- 0.09), while the 2D OCT/OCTA model showed the highest precision (0.79 +/- 0.06) and most accurately identified eyes without AMD. Deep learning models using OCT/OCTA data can accurately and automatically grade AMD severity. Among the evaluated approaches, the biomarker-based model provided the most balanced performance and showed particular value for early AMD detection.
[53] Would you still call this Dax? Novel Visual References in VLMs and Humans cs.CV | cs.CLPDF
Ada Defne Tür, Gaurav Kamath, Joyce Chai, Siva Reddy, Benno Krojer
TL;DR: 本文提出了一个名为NVRD的新视觉参考数据集,包含19,176张图像,涵盖90个不同视觉新颖性水平的概念,用于研究视觉语言模型(VLMs)和人类如何学习新视觉概念。研究发现,当新概念与预训练先验知识冲突时,模型难以在上下文中学习,且模型会过度泛化,将习得的标签扩展到人类会拒绝的刺激上。
Details
Motivation: 研究视觉语言模型(VLMs)在接触新视觉概念后如何将其映射到语言,特别是当这些新概念与预训练先验知识相矛盾时,这一问题尚未得到充分探索。
Result: 在NVRD数据集上评估了3个开源和2个闭源模型,并与2,400个人类判断进行比较。结果显示,模型在与先验知识冲突时学习新概念困难,且尽管模型和人类对视觉扰动的敏感性相关,但模型显著过度泛化。
Insight: 创新点在于构建了一个完全新颖、开放式的NVRD数据集,模拟人类遇到全新概念的场景,为人类和机器的视觉概念学习研究提供了语料库和基准。从客观角度看,该研究揭示了VLMs在概念学习中的局限性,特别是其过度泛化倾向,这对改进模型鲁棒性具有借鉴意义。
Abstract: Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.
[54] Disentangled Fine-Grained Prototype Learning for Incomplete Image-Tabular Classification cs.CVPDF
Feixiang Zhou, Jianyang Xie, Zhuangzhi Gao, Qinkai Yu, Fu Wang
TL;DR: 本文提出了一种名为DFPL的新型框架,用于解决图像-表格多模态学习中的缺失模态问题。该框架通过共享-特定原型建模(SSPM)提取紧凑且多样化的原型,并利用原型引导的细粒度对齐(PFA)模块在统一原型空间中实现分布匹配和语义对齐,同时引入类感知多尺度聚合(CMA)模块进行自适应特征聚合,以提升在模态缺失情况下的分类鲁棒性。
Details
Motivation: 解决图像和表格数据在多模态学习中因高度异构性(语义粒度和数据分布差异大)导致的缺失模态挑战,现有方法仅捕获粗粒度的跨模态一致性,忽略了细粒度的语义和分布错位,限制了互补线索的利用。
Result: 在三个不同的图像-表格基准测试上进行广泛实验,结果表明该方法在多种缺失模态设置下优于先前方法。
Insight: 创新点包括:通过共享-特定原型建模实现细粒度原型学习与解耦,抑制模态内冗余相关性;原型引导的细粒度对齐模块统一处理分布匹配和语义对齐;类感知多尺度聚合模块自适应整合全局和原型级别的共享语义与模态特定特征,增强预测鲁棒性。
Abstract: The missing-modality problem poses a significant challenge in image-tabular multimodal learning across a wide range of multimedia applications, including product understanding, recommendation systems, and medical diagnosis. This challenge is particularly pronounced when the two modalities are highly heterogeneous, as images and tabular attributes differ substantially in their semantic granularity and data distributions. Existing methods learn modality-invariant representations through disentanglement and alignment over global token-averaged features, capturing only coarse cross-modal consistency and overlooking fine-grained semantic and distributional misalignment, which hampers the exploitation of complementary cues under missing modalities. To address this, we propose DFPL, a novel framework for fine-grained prototype learning. Specifically, Shared-Specific Prototype Modeling (SSPM) extracts compact and diverse shared and modality-specific prototypes, and further performs prototype-level disentanglement to suppress redundant intra-modality correlations. Additionally, we propose a Prototype-guided Fine-grained Alignment (PFA) module that jointly enforces prototype-level distribution matching and prototype-to-class semantic alignment within a unified prototype space, thereby preserving both fine-grained distributional and semantic consistency across modalities. We further introduce a Class-aware Multi-scale Aggregation (CMA) module to adaptively aggregate shared semantics and modality-specific characteristics from global and prototype levels for robust predictions. Extensive experiments on three diverse image-tabular benchmarks demonstrate the superiority of our method compared to the previous approaches under various missing-modality settings. Code will be made publicly available.
[55] Horse Eye Blink Detection and Classification for Equine Affective State Assessment cs.CVPDF
João Alves, Signe Møller-Skuldbøl, Pia Haubro Andersen, Rikke Gade
TL;DR: 本文开发并评估了三种从马匹视频中自动检测和分类眨眼的方法,用于评估马匹的情感状态。方法包括基于帧的YOLOv12检测器、光流幅度阈值方法和微调的VideoMAE模型,在一个公开数据集上进行了测试。
Details
Motivation: 马匹面部动作单元(AUs)的自动检测是评估马匹疼痛和情感状态的有前景但尚未充分探索的途径,而眨眼作为微表情,其细微特性使得肉眼容易错过,需要可靠的自动化视频检测。
Result: 在公开数据集上,眨眼分类的宏F1分数达到0.898,二进制眨眼检测达到0.926,展示了方法的有效性。
Insight: 创新点在于将计算机视觉技术(如YOLOv12、光流和VideoMAE)应用于马匹微表情检测这一细粒度任务,为动物福利监测提供了自动化解决方案,并强调了该领域固有的挑战。
Abstract: Automated detection of equine facial action units (AUs) is a promising yet under-explored avenue for pain and affective state assessment in horses. Half and full-blink movements are recognised indicators of pain and stress, but as micro-expressions, their subtle, fine-grained nature makes them easily missed by the naked eye and only discernible through frame-by-frame video inspection, making reliable automated detection from video a particularly demanding task. We develop and evaluate three methods for automated blink classification from horse videos: a frame-based YOLOv12 detector, an optical flow magnitude thresholding approach, and a fine-tuned VideoMAE model, tested on a publicly available dataset. We achieve a macro-F1 score of 0.898 when doing blink classification and 0.926 on binary blink detection. Our results highlight both the potential and the inherent challenges of fine-grained AU detection for equine welfare monitoring.
[56] Formal Concept Lattices are Good Semantic Scaffolds for Concept-Based Learning cs.CVPDF
Deepika SN Vemuri, Sayanta Adhikari, Ankit Saha, Krishn Vishwas Kher, Vineeth N Balasubramanian
TL;DR: 本文提出利用形式概念分析中的概念格作为语义支架,指导神经网络学习层次化的概念表示,以提升模型的可解释性和与人类语义理解的对应性。该方法将概念按其普遍性程度分配到网络的不同层进行学习,从而形成分阶段的、有语义基础的表示。
Details
Motivation: 现有基于概念的模型通常将概念视为扁平、无结构且在同一网络层学习的集合,这忽略了人类语义理解中概念从一般到具体的层次化组织特性。
Result: 在真实世界数据集上的实验表明,该方法能产生更可解释的嵌入,支持更有效的干预,并学习到既有意义又具有层次结构的概念表示。
Insight: 核心创新点在于利用形式概念格为神经网络学习提供原则性的语义支架,将概念的普遍性程度与网络深度对齐,从而引导模型学习层次化的语义表示,这为构建更可解释、更符合人类认知的AI模型提供了新思路。
Abstract: Learning semantics is essential for deep learning models to be interpretable and better aligned with human reasoning. Concept-based models approach this by representing classes through meaningful semantic abstractions, but typically treat all concepts as a flat, unstructured set learned at a single neural network layer. This overlooks a fundamental property of human semantic understanding: concepts being organized hierarchically, from general to specific. While deep networks do learn a hierarchy of visual features, this structure is rarely aligned with explicit semantic hierarchies. Drawing on Formal Concept Analysis, we demonstrate that formal concept lattices provide principled semantic scaffolds to guide neural network learning. These lattices naturally identify where in the network concepts should be learned based on their level of generality. This allows the model to develop staged, semantically grounded representations throughout its depth. Empirical results on real-world datasets show that our models produce more interpretable embeddings, support more effective interventions, and learn concept representations that are both meaningful and hierarchically structured.
[57] LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval cs.CV | cs.DBPDF
Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari
TL;DR: 本文提出了一种基于大语言模型(LLM)代理的阶段感知索引优化方法,用于解决现代多阶段检索系统中高度耦合参数的联合优化难题。该方法在HICO-DET人-物交互检索基准上,使用Intel VDMS进行评估,在质量约束的吞吐量指标SIEVE下,性能显著优于传统超参数优化方法,并验证了其在不同耦合程度数据集和不同向量数据库平台上的有效性与可迁移性。
Details
Motivation: 传统超参数优化方法(如TPE、贝叶斯优化)基于参数独立性假设,无法有效优化现代多阶段检索系统中高度耦合的联合参数配置空间。
Result: 在HICO-DET基准上,所提LLM代理在SIEVE指标下,性能分别超过Optuna TPE 33.3%和VDTuner 34.2%,相比UniIR实现了15.3倍的吞吐量提升。在GLDv2和SIFT1M基准上,与传统方法性能差距随参数耦合度降低而缩小。在Milvus系统上的跨平台验证也取得了最优结果。
Insight: 创新点在于利用LLM代理,通过考虑完整的优化历史来引导搜索,并设计了分阶段(探索、利用、微调)的优化策略,以处理参数耦合空间。这为复杂、耦合系统的自动参数调优提供了一种新的、可迁移的解决方案。
Abstract: Retrieval systems underpin modern AI applications – spanning visual search, recommendation engines, and multi-modal question answering. Modern multi-stage retrieval systems require the joint optimization of highly coupled parameters, yet traditional hyperparameter optimization (HPO) methods – including Tree-structured Parzen Estimators (TPE) and Gaussian Process Bayesian Optimization – rely on an independence assumption that fundamentally prevents them from navigating these coupled configuration spaces. We address this limitation with a phase-aware large language model (LLM) agent that conditions each proposal on its full optimization history, navigating the coupled parameter space across phase-partitioned exploration, exploitation, and fine-tuning stages. Evaluated on the HICO-DET human-object interaction retrieval benchmark using Intel VDMS (Visual Data Management System), our agent outperforms Optuna TPE by +33.3% and VDTuner by +34.2% under SIEVE (Safeguarded Index Evaluation of Vector-search Efficiency, a quality-constrained throughput metric), delivering a 15.3x throughput gain over UniIR. Validation across three benchmarks confirms that the agent’s advantage grows with the degree of parameter coupling: +33.3% on HICO-DET (high coupling), methods converge within 1% on GLDv2 (moderate coupling) and within 3.6% on SIFT1M (near-independent control). Cross-system validation on Milvus confirms the optimizer ranks first on all three datasets without modification, demonstrating transferability across vector database management system (VDBMS) platforms.
[58] Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers cs.CV | cs.ROPDF
Jean Cordonnier, Chenghao Xu, Olga Fink, Malcolm Mielle
TL;DR: 本文提出了一种无需配对的RGB-热成像多模态新视角合成框架,通过视觉几何变换器独立估计各模态的相机位姿,并利用Procrustes算法进行对齐,进而基于对齐后的位姿开发了一种多模态3D高斯泼溅方法直接从非配对图像中学习。
Details
Motivation: 现有的结合RGB和热成像的多模态新视角合成方法通常依赖于精确校准的图像对或立体设置,这限制了方法的可扩展性和实际部署。本文旨在解决非配对多模态数据的联合3D场景重建问题。
Result: 在多种场景上的实验表明,该方法在热成像视图合成方面取得了有竞争力的性能,同时保持了RGB的保真度。
Insight: 核心创新在于利用VGGT架构独立估计位姿并通过跨模态特征匹配进行对齐,实现了无需配对校准的联合配准;同时,本文还引入了一个基准测试框架,用于严格评估单模态图像合成和重建场景的多模态一致性。
Abstract: Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment. To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images. Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.
[59] Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning cs.CVPDF
Amirhossein Zhalehmehrabi, Tiziano Tezze, Alberto Castelini, Alessandro Farinelli
TL;DR: 本文提出了一种用于PointGoal导航任务的视觉表示学习框架,通过特权传感器(LiDAR)引导对比学习,使视觉编码器能够捕捉与导航相关的几何结构而非场景特定的外观特征。该方法将表示学习与策略优化解耦,预训练的编码器作为强化学习的感知主干,并在部署时仅使用单目RGB观测。
Details
Motivation: 解决PointGoal导航中视觉表示对场景外观和语义变化敏感的问题,旨在学习对导航任务更鲁棒的、几何感知的视觉特征,以提升跨场景的泛化能力。
Result: 在高保真仿真实验中,该方法在室内外多种环境下显著提升了策略级的场景迁移性能,在严重的外观和语义变化下,超越了大型预训练视觉模型和标准对比学习基线。
Insight: 创新点在于利用特权传感器(LiDAR)通过几何感知的相似性度量和自适应温度缩放来引导对比学习目标,并引入表示预训练与策略学习间的跨阶段域不匹配以抑制环境特定捷径,从而学习任务相关的鲁棒特征。
Abstract: We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance. The resulting encoder is pretrained independently, frozen, and used as the perceptual backbone for reinforcement learning, decoupling representation learning from policy optimization. We further introduce a cross-stage domain mismatch between representation pretraining and policy learning to suppress environment-specific shortcuts and promote reliance on task-relevant features. Extensive experiments in high-fidelity simulation demonstrate that our approach significantly improves policy-level scene transfer across diverse indoor and outdoor environments. At deployment, the agent relies only on monocular RGB observations together with standard task-related inputs such as goal position and proprioceptive signals, without access to LiDAR or other privileged sensors. Our method outperforms large pretrained vision models and standard contrastive baselines under severe appearance and semantic shifts. We also release a multimodal dataset to support future research on privileged-guided visual representation learning for navigation. The code is available at:
[60] BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding cs.CVPDF
Muhammad Usama, Didier Stricker, Mohammad Sadil Khan, Muhammad Zeshan Afzal
TL;DR: 本文提出了BRepCLIP,这是第一个通过对比预训练将CAD模型的边界表示(BRep)几何与语言和图像嵌入对齐的框架。它将CAD对象建模为一系列面和边标记,使用独立的离散词汇表表示曲面和曲线几何,并辅以空间和语义描述符。该方法在多个CAD数据集上显著优于现有的基于点云的方法,并展示了其在零样本分类和评估生成模型方面的实用性。
Details
Motivation: CAD模型的表示学习是一个尚未充分解决的问题。虽然3D表示学习在点云和网格上取得了进展,但CAD的原生格式——边界表示(BRep)——因其编码精确的参数化曲面、曲线及其拓扑结构,却很少被用作表示学习的基底。本文旨在填补这一空白。
Result: BRepCLIP在多个基准测试中显著优于现有方法:在ABC、CADParser和Automate数据集上的Top-1检索准确率分别比OpenShape提高了40.4%、22.0%和23.9%;在FabWave数据集上的零样本分类Top-1准确率提高了15%。
Insight: 核心创新在于首次将对比多模态预训练应用于BRep表示,通过将CAD结构分解为带有类型和语义描述符的离散化面/边序列,并利用Transformer编码器聚合,实现了与CLIP风格的语言/图像嵌入的对齐。这强调了针对CAD原生结构进行感知预训练对于多模态理解的重要性。
Abstract: Learning representations of CAD models is a largely open problem. While 3D representation learning has flourished around point clouds and meshes, the native format of CAD - boundary representations BReps, which encodes exact parametric surfaces, curves, and their topology, has received little attention as a representation learning substrate. We introduce BRepCLIP, the first framework to align BRep geometry with language and image embeddings through contrastive pretraining. We model each CAD object as a sequence of face and edge tokens with separate discrete vocabularies for surface and curve geometry, augmented with spatial and semantic descriptors that capture surface types (e.g., cylindrical, torus, NURBS) and curve primitives (e.g., line, arc, B-spline). A transformer encoder aggregates these tokens into a global BRep embedding, aligned with CLIP’s text and image encoders via a joint contrastive objective. BRepCLIP generates more discriminative and semantically grounded embeddings than existing point-based alternatives, improving Top-1 retrieval over OpenShape by 40.4%, 22.0%, and 23.9% on ABC, CADParser, and Automate, respectively, and improving zero-shot classification on FabWave by 15% in Top-1 score. We further demonstrate its utility as a CAD-aware similarity metric for evaluating text and image-conditioned CAD generation, establishing the importance of structure-aware pretraining for multimodal CAD understanding. Project page is available at https://muhammadusama100.github.io/BrepClip2026/
[61] Noise-Aware Visual Representation Learning for Medical Visual Question Answering cs.CV | cs.AIPDF
I Putu Adi Pratama, Bahadorreza Ofoghi, Atul Sajjanhar, Shang Gao
TL;DR: 本文提出了一种用于医学视觉问答(Med-VQA)的噪声感知视觉表示学习框架,通过在视觉嵌入映射到LLM输入空间之前引入去噪自编码器,学习对噪声不敏感的鲁棒视觉表示,并使用LoRA进行参数高效微调。
Details
Motivation: 现有Med-VQA方法通常将现成的视觉编码器与LLM通过轻量映射网络连接,但忽视了处理视觉表示中噪声和小型无关变化的重要性,这影响了模型的鲁棒性。
Result: 在SLAKE和PathVQA基准测试上,该方法在保持竞争力的干净数据性能的同时,显著提升了对噪声输入嵌入的鲁棒性。
Insight: 创新点在于引入预训练的去噪自编码器来增强视觉表示的鲁棒性,并结合LoRA实现高效适配;这为提升Med-VQA系统在实际嘈杂医疗环境中的可靠性提供了新思路。
Abstract: Medical visual question answering (Med-VQA) has strong potential for clinical decision support by enabling AI models to interpret medical images and answer clinically relevant queries. Recent approaches typically connect off-the-shelf vision encoders with large language models (LLMs) through lightweight mapping networks to reduce computational cost. However, these methods often overlook the importance of handling noise and small irrelevant changes in visual representations. To address these challenges, we propose a noise-aware Med-VQA framework that incorporates a denoising autoencoder before visual embeddings are mapped into the input space of an LLM. The denoising autoencoder is pretrained to reconstruct clean visual embeddings from corrupted inputs, encouraging the model to learn robust visual representations that are less sensitive to noise. The resulting embeddings are then projected into the language model embedding space using a multi-layer perceptron (MLP), forming visual prefix tokens that provide image information to the LLM. To enable efficient adaptation without full retraining, we employ parameter-efficient fine-tuning using low-rank adaptation (LoRA). The proposed method is evaluated on the SLAKE and PathVQA benchmarks. Experimental results show improved robustness to noisy input embeddings while maintaining competitive clean performance across multiple evaluation criteria. These findings suggest that learning more robust visual representations can enhance Med-VQA performance and robustness.
[62] Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF
Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib
TL;DR: 本文介绍了Almieyar-Oryx-BloomBench,这是一个基于布鲁姆分类法的双语(英语-阿拉伯语)多模态基准测试,旨在系统评估视觉语言模型(VLMs)在六个认知层次(记忆、理解、应用、分析、评估、创造)上的推理能力。通过半自动化流程构建并采用分层混合质量保证协议验证,该基准确保了可扩展性、文化包容性和语言保真度。作者利用该框架对SOTA VLMs进行了全面研究,揭示了其在语义理解方面表现强劲,但在事实回忆和创造性综合方面存在显著困难,并指出了英语与阿拉伯语之间的关键性能差距。
Details
Motivation: 当前视觉语言模型(VLMs)领域缺乏能够严格诊断其真实推理能力、并朝着类人智能方向衡量进展的基准测试。现有评估大多关注零散或孤立的任务,掩盖了关键的认知弱点,且对针对性改进的指导意义有限。
Result: 对SOTA VLMs的评估揭示了显著的认知不对称性:模型在语义理解方面达到了很高的性能上限,但在事实回忆和创造性综合方面存在实质性困难。同时,研究还凸显了阿拉伯语与英语之间的关键性能差距,暴露了当前跨语言多模态推理的局限性。
Insight: 论文的主要创新点在于提出了首个基于布鲁姆分类法的、认知层面系统化的双语多模态基准测试,为评估VLMs的认知能力提供了结构化框架。从客观角度看,该工作将教育心理学中的经典认知分类法引入多模态模型评估,并强调了跨语言和文化包容性的重要性,为开发更具认知对齐性和包容性的VLMs奠定了基础。
Abstract: Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom’s Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.
[63] UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning cs.CVPDF
Gexin Huang, Yanting Yang, Myeongkyun Kang, Beidi Zhao, Jun Zhou
TL;DR: 本文提出了UltraVR,一个用于评估视觉语言模型在超分辨率图像上进行证据驱动推理的诊断性基准测试。该基准涵盖监控、遥感、病理切片和工业异常检测四个高价值场景,通过结构化标注的思维链实现过程级诊断,揭示了当前前沿模型在超分辨率推理中仍不可靠,主要失败在于证据定位和局部感知环节。
Details
Motivation: 现有视觉语言模型在标准视觉问答基准上表现优异,但其在超分辨率图像(关键证据微小、细微、空间分散)上的推理能力尚不明确,且现有评估仅关注最终答案准确率,无法深入分析模型是否真正获取并整合了必要的视觉证据。
Result: 在UltraVR基准上对前沿视觉语言模型进行评估,结果表明当前模型在超分辨率推理上远未达到可靠水平。结构化标注进一步定位了失败环节:错误主要集中在证据定位和局部感知阶段,而当下游推理获得中间视觉事实时往往能恢复正确推断。
Insight: 创新点在于构建了一个具有结构化思维链标注的诊断性超分辨率图像VQA基准,实现了对模型推理过程的细粒度分解与评估。其核心价值在于将评估从‘是否答对’推进到‘推理过程在何处出错’,为模型能力诊断提供了新工具。从客观角度看,该基准对多尺度、长距离、细粒度视觉推理的挑战设计具有很好的领域代表性和互补性。
Abstract: Vision-language models (VLMs) excel on visual question answering and multimodal reasoning benchmarks. Yet their capability on ultra-resolution images - where critical evidence is tiny, subtle, spatially distant, or distributed - remains unclear. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence. We introduce UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-resolution images. UltraVR spans four high-value scenarios: CCTV surveillance, remote sensing (RS), whole-slide image (WSI) pathology, and industrial anomaly detection (AD). These domains pose complementary challenges: fine-grained object grounding in crowded CCTV scenes, long-range spatial comparison in RS, multi-scale evidence navigation in WSI, and subtle irregularity detection in repetitive industrial layouts. Beyond standard QA triples, each instance includes a structured ground-truth chain of thought with step-level questions, intermediate answers, and reasoning labels. These labels decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference, enabling process-level diagnosis over black-box scoring. Using UltraVR, we evaluate frontier VLMs and show that current models remain far from reliable on ultra-resolution reasoning. Importantly, the structured annotations allow us to localize failures across the visual-to-decision pipeline: errors concentrate in evidence grounding and local perception, while downstream inference often recovers when intermediate visual facts are supplied. These findings demonstrate UltraVR as a diagnostic testbed for measuring not only whether VLMs answer correctly, but where their ultra-resolution reasoning process breaks.
[64] ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions cs.CV | cs.MMPDF
Dehong Kong, Lina Lei, Lingtao Zheng, Chenyang Wu, Ailing Zhang
TL;DR: 本文提出了一个名为’三重镜头构图’的新任务,旨在从单张以人物为中心的图像中,自动裁剪出具有叙事逻辑的三个镜头:远景、中景和特写,并生成对应的镜头描述。为了解决专家标注数据有限的问题,作者开发了ShotCrop模型,它通过三阶段训练流程(思维链监督微调、半监督微调、以及定制的组相对策略优化)来学习该任务。此外,论文还构建了一个包含1200个专家标注测试用例的基准TSC-Bench。
Details
Motivation: 现有美学构图研究通常只生成单一的美学裁剪,忽略了从同一场景中构图多个镜头以构建视觉叙事的价值。而在商业海报等下游创意工作流中,多镜头构图(如分别强调环境、主体和情感/产品细节)对于呈现关键故事节点至关重要。
Result: 在提出的TSC-Bench基准上,ShotCrop在镜头定位准确度上比GPT-5平均提升了2.82倍。
Insight: 创新点在于定义了’三重镜头构图’这一新任务,并提出了一个结合思维链微调、基于MLLM/美学评估/CLIP相似度的高置信度伪标签生成、以及定制化复合奖励策略优化的三阶段训练框架,有效解决了高质量标注数据稀缺的问题。
Abstract: Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set – establishing, medium, and close-up – from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf{2.82} times over GPT-5 in shot localization accuracy.
[65] BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection cs.CV | cs.MMPDF
Wenlin Liu, Xikun Hu, Ping Zhong
TL;DR: 该论文提出了一种名为BMCR(基于强化学习的自适应骨干模块组合)的方法,用于遥感目标检测。该方法通过强化学习策略,动态地从预训练的CNN和ViT骨干网络中分解出的可重用模块中,为不同复杂度的输入自适应地组装推理路径,以同时利用CNN的局部细节捕捉能力和ViT的全局上下文建模能力。
Details
Motivation: 现有遥感目标检测器通常依赖单一固定的骨干网络或手动设计的混合架构,无法根据输入图像的不同复杂度自适应地利用CNN和ViT的互补优势。BMCR旨在解决这一局限性,实现输入自适应的骨干网络组合。
Result: 在DOTA-v1.0、DOTA-v1.5和DIOR-R三个遥感目标检测基准数据集上,BMCR分别取得了79.31%、73.41%和71.86%的mAP,超越了强大的静态和动态基线方法(最高提升2.5个百分点),同时保持了有竞争力的效率。
Insight: 论文的创新点在于:1)构建了一个包含分解自CNN和ViT的可重用模块的工具箱,并封装了元数据以实现兼容性感知的组装;2)设计了一个基于最优传输(OT)的轻量级过渡接口,以弥合基于网格的CNN特征和基于令牌的ViT表示之间的差距;3)将骨干组合过程建模为序列决策问题,并用强化学习策略网络进行求解;4)提出了自适应模块协同优化(AMCO)策略,以稳定可重用模块和路由策略的联合优化。
Abstract: In remote sensing object detection, Convolutional Neural Networks (CNNs) excel at capturing local details while Vision Transformers (ViTs) are better at global context modeling. However, existing detectors typically rely on a single fixed backbone or a manually designed hybrid architecture, and thus fail to adaptively exploit these complementary strengths across inputs of diverse complexity. To address this limitation, we propose Backbone Module Composition via Reinforcement Learning (BMCR). BMCR dynamically assembles input-adaptive inference paths from reusable modules decomposed from off-the-shelf CNN and ViT backbones. To enable such cross-family composition, we first construct an extensible module toolbox. Specifically, we decompose representative CNN and ViT backbones into reusable functional modules and encapsulate each module with explicit structural, semantic, and computational metadata for compatibility-aware assembly. To bridge the gap between grid-based CNN features and token-based ViT representations, we design a lightweight Optimal Transport (OT) based transition interface that ensures distribution-aware alignment while respecting spatial consistency. The backbone composition process is then formulated as a sequential decision problem, in which a policy network progressively selects task-relevant modules according to intermediate multi-scale observations. To stabilize the joint optimization of reusable modules and the routing policy, we further develop an Adaptive Module Cooperative Optimization (AMCO) strategy that coordinates module updating, routing exploration, and reward assignment during training. On DOTA-v1.0, DOTA-v1.5 and DIOR-R, BMCR achieves 79.31%, 73.41% and 71.86% mAP, respectively, surpassing strong static and dynamic baselines by up to 2.5 points while maintaining competitive efficiency.
[66] V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation cs.CVPDF
Tao Liu, Leela Krishna, Gouti Pavan Kumar, Sreeja K, Vishav Garg
TL;DR: 本文提出了V2V-Bench,一个用于评估视频到视频(V2V)生成模型的综合性基准。该基准包含11个维度,分为五个类别,旨在解决现有T2V和I2V指标无法同时衡量编辑指令遵循和源视频帧级对应关系保持的问题。
Details
Motivation: 现有的文本到视频(T2V)和图像到视频(I2V)评估指标无法充分衡量V2V生成任务的两个核心要求:遵循编辑指令和保持与源视频的帧级对应关系,因此需要一个专门的评估基准。
Result: V2V-Bench在六个V2V特定维度上与人类判断的斯皮尔曼相关系数达到0.905,显示出高相关性。评估了Grok Imagine、Gemini Veo3和Open Sora 2等模型,发现它们在编辑保真度和视觉质量上各有优势。
Insight: 论文的创新点在于构建了一个多维度、细粒度的V2V生成评估基准,将评估目标系统化地分解为时间对齐、结构保真度、变换质量、视频质量和语义对齐五大类,为模型能力提供了更全面的诊断工具。
Abstract: Video-to-video (V2V) generation is difficult to evaluate because outputs must both follow editing instructions and preserve frame-level correspondence with the source video, which existing T2V and I2V metrics do not capture. We introduce V2V-Bench, a 11-dimension benchmark organized into five categories: temporal alignment, structural fidelity, transformation quality, video quality, and semantic alignment. V2V-Bench pairs diverse source videos with challenging editing tasks and evaluates two commercial models, Grok Imagine and Gemini Veo3, and one open-source model, Open Sora 2. Results show complementary model strengths: Grok performs better on editing fidelity, while Veo3 achieves stronger visual quality. On six V2V-specific dimensions, V2V-Bench reaches a Spearman correlation of 0.905 with human judgments.
[67] LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video cs.CV | cs.AI | cs.CLPDF
Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen
TL;DR: 本文提出了LongSpace框架和LongSpace-Bench基准,旨在评估和提升多模态大语言模型在长视频中的长时程空间记忆能力。LongSpace通过将长视频分块处理、引入3D结构线索到解码器早期层,并构建层感知记忆以实现问题引导检索,从而增强模型对空间布局、视角变化等信息的记忆与推理。
Details
Motivation: 现有MLLMs在图像和视频理解上虽有进步,但自动驾驶、机器人导航等长时程任务要求模型不仅能理解当前视图,还需记忆和检索先前观察到的空间布局、路线、视角变化和物体状态,当前模型缺乏对此能力的系统性评估与提升。
Result: 在多个空间推理基准测试上的实验表明,LongSpace框架有效提升了模型对长视频的空间理解能力,证明了显式空间记忆是长时程视频MLLMs的关键能力。
Insight: 创新点在于提出了专门针对长时程空间记忆的评估基准LongSpace-Bench,以及一个结合了视频分块、3D结构线索集成和层感知记忆检索的LongSpace框架,为增强MLLMs在复杂空间任务中的记忆与推理提供了新思路。
Abstract: Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.
[68] Real-Time Threat Detection from Surveillance Cameras using Machine Learning cs.CVPDF
Gajendra Mandal, J. P. Patra, Priyansh Mahant
TL;DR: 本文提出了一种基于机器学习的实时威胁检测框架,用于监控摄像头。该框架专注于检测枪支、刀具以及印度监控场景中常见的钝器(如铁棍、木棍、塑料棍)。通过结合自建钝器数据集与公开的枪支刀具数据集,训练了一个YOLOv8目标检测模型,旨在实现准确且高效的实时检测,适用于校园、公共场所等实际部署。
Details
Motivation: 解决传统监控依赖人工、效率低下、易疲劳和响应延迟的问题,为人口密集的城市环境提供自动化、智能化的公共安全解决方案。
Result: 实验评估表明,增加训练时长显著提高了钝器类别的召回率和平均精度,且未出现过拟合。整体框架在准确性和效率之间取得了有效平衡。
Insight: 主要创新点在于创建并整合了一个针对特定区域(印度)钝器的自定义数据集,以解决现有公开数据集中此类威胁物品的缺失问题,并验证了YOLOv8模型在此类混合数据集上的有效性和实时性。
Abstract: Ensuring public safety in densely populated urban environments remains a critical challenge, necessitating the deployment of intelligent and automated video surveillance systems. Traditional surveillance approaches rely heavily on manual monitoring, which is inefficient and susceptible to human fatigue, delayed response, and observational errors. To overcome these limitations, this work presents a real-time object detection-based surveillance framework. The proposed system focuses on detecting guns, knives, and region-specific blunt objects commonly involved in violent activities in Indian surveillance scenarios. A key contribution of this work is the use of a custom-created dataset collected using a mobile camera, consisting of 336 labeled images of blunt objects such as iron rods, wooden sticks, and plastic rods. This dataset is combined with a publicly available dataset of 7,623 images of guns and knives, forming a consolidated dataset of 7,959 images across three classes: gun, knife, and blunt object. The combined dataset is used to train a YOLOv8-based object detection model for real-time performance. Experimental evaluation shows that increasing the training duration significantly improves recall and average precision for the blunt object class without signs of overfitting. Overall, the proposed framework achieves an effective balance between accuracy and efficiency, making it suitable for deployment in real-world surveillance environments such as campuses, public spaces, and transportation areas.
[69] VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning cs.CVPDF
Shufan Zhang, Ziyue Lin, Bairun Wang, Lei Jin, Xuanding Ding
TL;DR: 本文提出了一种名为VTI-CoT的视觉-文本交织思维链框架,用于视频推理任务。该框架通过在推理过程中整合文本推理步骤与对应的视觉帧,并利用自动化标注流程构建高质量多模态CoT数据,同时采用基于OCR的压缩技术来解决长视频推理中序列过长导致的训练效率问题。
Details
Motivation: 现有基于思维链的视频推理方法主要依赖纯文本信息进行逻辑推理,忽视了推理过程中关键的视觉信息,这与人脑在推理时会回顾视觉片段的认知机制不符。
Result: 实验结果表明,VTI-CoT在相同参数规模的模型中达到了最先进的性能,同时显著提高了训练效率。
Insight: 创新点在于提出了视觉与文本交织的思维链框架,模拟了人类的认知过程;并设计了自动化标注流程来生成稀缺的多模态CoT数据,以及采用OCR压缩技术来优化长序列训练的效率问题。
Abstract: Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.
[70] ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation cs.CV | cs.AI | cs.LGPDF
Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong
TL;DR: 本文提出了ViCuR,一种基于视觉线索作为可恢复特权(Visual Cues as Recoverable Privilege)的多模态策略蒸馏框架。该框架旨在解决传统基于答案特权的多模态策略蒸馏中存在的训练-测试不匹配问题,通过将教师模型的特权从答案侧替换为源自输入图像的、与查询相关的视觉证据(即可恢复的视觉线索),并引入一个轻量级的线索恢复模块来聚合任务相关的视觉信息,从而促进学生进行基于视觉的推理。
Details
Motivation: 在多模态推理中,传统的策略蒸馏常使用拥有特权(如参考答案或推理链)的教师模型进行监督,但这会导致训练-测试不匹配:教师的监督可能依赖于学生无法获取的信号,从而鼓励学生进行捷径模仿而非真正的视觉推理。本文旨在解决这一问题。
Result: 在七个基准测试上,使用Qwen3-VL-2B和8B作为学生模型,ViCuR相比基于答案的策略自蒸馏方法,在整体平均性能上分别提升了+1.19和+1.24。在更强的教师模型进行策略蒸馏的设置下,ViCuR也超越了基线方法,分别提升了+0.64和+1.08,并在8B规模上展现出稳定的域外性能增益。
Insight: 核心创新点在于将教师特权从不可恢复的答案侧信息,重新设计为源自输入、学生理论上可恢复的视觉线索,从而根本上缓解了训练-测试不匹配。方法上的创新是引入了一个轻量级的线索恢复模块,该模块在预填充阶段使用专用的sink-token交叉注意力来聚合任务相关的视觉证据,无需改变推理接口或引入额外的线索生成损失。这表明在多模态策略蒸馏中,教师特权的设计与其能力强度同等重要。
Abstract: On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher’s supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.
[71] Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents cs.CVPDF
XiuYu Zhang, Junfeng Fang, Zhenkai Liang
TL;DR: 本文研究了视觉语言模型(VLMs)中潜在视觉推理(LVR)方法的有效性,发现常用的余弦相似度对齐指标与模型准确性呈负相关。通过引入PRISM诊断工具,论文揭示了监督学习的潜在令牌在推理过程中被很大程度上绕过,辅助损失主要通过重塑语言模型参数而非优化潜在变量本身来发挥作用。
Details
Motivation: 动机在于检验LVR方法中潜在令牌与其视觉目标之间的对齐(如余弦相似度或均方误差)作为训练损失和质量指标的假设,即更好的对齐是否真的能带来更好的答案生成。
Result: 在设计的五个LVR变体矩阵中,余弦对齐与准确性在所有变体上均呈负相关(r=-0.94)。通过PRISM诊断发现,潜在令牌的损坏最多仅使准确性偏移4个百分点,且答案在下游可解码而在潜在令牌处不可解码。
Insight: 创新点在于挑战了LVR中对齐指标有效性的普遍假设,并引入PRISM诊断工具揭示了辅助损失主要通过信息瓶颈效应重塑模型参数,而非直接优化潜在变量。这为理解VLMs中监督学习机制提供了新视角。
Abstract: Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.
[72] Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models cs.CV | cs.AI | cs.LG | cs.ROPDF
Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu
TL;DR: 本文提出了一种用于视觉-语言-动作(VLA)模型的简单一步动作生成方法。作者认为,与图像生成不同,VLA动作生成具有条件-目标结构的不对称性:策略基于丰富的观测、语言和状态,但仅预测紧凑的低维动作块。因此,无需采用为图像合成开发的复杂一步方法。核心方法是在训练时偏向高噪声状态分布,保持标准的速度预测,无需教师模型、蒸馏阶段或辅助目标。
Details
Motivation: 动机在于指出基于扩散的VLA模型通常继承图像生成的迭代去噪视图,但VLA动作生成具有不对称的条件-目标结构,因此强有力的一步动作生成不一定需要为图像合成开发的先进一步方法。
Result: 在受控的MNIST网格到序列任务和广泛的机器人策略实验(LIBERO、LIBERO-Plus、LIBERO-Pro基准)中,使用高噪声偏置调度训练的一步策略通常与相同配方下的十步解码性能相当,在标准LIBERO上甚至能超过使用均匀时间分布训练的十步策略。在1.4B VLM模型和30M动作头上,一步解码在LIBERO-Long上达到95.6%。真实机器人双臂YAM RSS评估也验证了相同采样器趋势。
Insight: 创新点在于揭示了强有力的一步VLA动作生成可以从标准扩散训练中自然涌现,无需引入为图像生成设计的完整少步扩散机制。从客观角度看,该方法通过简单调整训练时间分布(偏向高噪声)来实现高效一步生成,避免了复杂的蒸馏或辅助目标,为VLA模型的部署提供了更高效的推理方案。
Abstract: Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.
[73] DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models cs.CV | cs.AI | cs.LGPDF
Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi
TL;DR: 本文提出DRIFT框架,用于将预训练的视觉语言模型适配到需要精确连续输出的任务中。该方法结合基础预测器与基于流匹配的生成式细化模块,通过残差建模将全局输出分布学习转化为局部残差分布优化,从而简化训练过程。
Details
Motivation: 现有视觉语言模型主要基于离散token的自回归解码,难以处理需要精确连续输出的任务(如事件时间定位或机器人控制动作生成)。
Result: 在视觉定位和机器人控制等感知与规划任务上,DRIFT在多种架构(包括MLLMs、VLAs和WAMs)中均优于基于回归和生成的方法,表现出稳定的性能提升。
Insight: 创新点在于残差流适配器设计,将生成建模问题转化为局部残差分布学习,降低了优化难度;同时框架通用性强,可适配多种视觉语言模型架构。
Abstract: Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.
[74] ExpSpeech-Net: Multimodal Fusion of Expression and Speech for Deepfake Detection cs.CVPDF
Ruchika Sharma, Rudresh Dwivedi
TL;DR: 本文提出了一种名为ExpSpeech-Net的轻量级深度伪造检测模型,它通过融合面部表情和语音模式两种模态,并采用SqueezeNet和RNN作为主干网络,实现了高效的检测。该方法结合了先进的图像与信号特征提取技术以及智能特征选择策略,旨在解决现有方法复杂且资源消耗大的问题。
Details
Motivation: 深度伪造视频日益威胁在线内容的可信度,而现有检测方法通常依赖复杂、资源密集的模型,限制了其实际应用。因此,本研究旨在开发一个轻量且高效的检测框架。
Result: 该模型在深度伪造检测任务上取得了94.5%的准确率、99.3%的精确率和96.8%的F-measure,性能优于传统方法,达到了先进水平。
Insight: 创新点在于多模态(表情与语音)融合、轻量级网络架构(SqueezeNet+RNN)的结合,以及采用智能特征选择算法(SASMA)进行预处理,这为实时、实用的深度伪造检测提供了新思路。
Abstract: Deepfake videos are increasingly challenging the credibility of online content. Many existing detection methodology relies on complex, resource-intensive models, which limit their practical use. The study introduces the ExpSpeech-Net deepfake detection (SqN-R-DFD) model, which utilizes SqueezeNet and RNN (Recurrent Neural Network) as its backbone, providing a lightweight and efficient deepfake detection framework that simultaneously analyzes facial expressions and speech patterns. The approach incorporates advanced feature extraction, such as ISLBT-based features for image and MPNCC for signals, along with a smart feature-selection strategy using SASMA (Sandpiper-Assisted Slime Mould Algorithm), ensuring optimal and balanced input to the detection models. By combining SqueezeNet and an RNN, subtle inconsistencies in deepfake videos are captured effectively. The framework achieves 94.5% accuracy, precision of 99.3%, and F-measure of 96.8%, outperforming conventional methods. This demonstrates that integrating multiple modalities with intelligent preprocessing and feature selection enables practical, real-time deepfake detection suitable for everyday applications.
[75] Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction cs.CVPDF
Tianxiang Jiang, Linquan Wu, Sheng Xia, Songze Li, Ziang Yan
TL;DR: 本文提出了一种名为Future-L1的交错潜在视觉推理框架,用于视频事件预测任务。该框架允许多模态大语言模型在自回归解码过程中交替生成语言标记和连续的潜在视觉片段,以保留细粒度的视觉线索。通过在FutureBench和TwiFF-Bench基准测试上实现新的最先进性能,证明了在潜在空间中保留中间视觉语义的有效性。
Details
Motivation: 现有视频多模态大语言模型通常将中间的未来推理过程文本化,这会导致细粒度的运动、几何和交互线索丢失,从而产生看似合理但视觉上无根据的幻觉。本文旨在解决这一问题,使模型能够更好地基于部分视频证据推断未观察到的未来状态。
Result: 在FutureBench基准上,Future-L1将Qwen3-VL-8B模型的得分从61.0提升至85.4,并超过先前最佳模型Video-CoE 10.4分;在TwiFF-Bench基准上,平均得分从2.44提升至3.04,实现了新的最先进结果。
Insight: 主要创新点在于提出了交错潜在视觉推理框架,允许模型在推理过程中交替使用语言和潜在视觉表示,避免了完全文本化导致的视觉信息损失。此外,通过构建Future-L1-50K数据集和引入LA-DAPO(一种具有结果对比和时间多样性奖励的潜在感知强化学习目标)来训练和优化模型,为面向未来的视频推理提供了一种新的范式。
Abstract: Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.
[76] Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment cs.CVPDF
Qifei Jia, Xintong Yao, Minghao Li, Yajie Chai, Qiming Lu
TL;DR: 本文提出了一种新的图像美学评估(IAA)框架RED-Aes,它通过利用可控图像编辑模型来模拟人类基于比较的美学推理过程,而非直接回归绝对平均意见分数(MOS)。该方法构建了包含编辑图像对和思维链推理的RED-20k数据集,并采用基于相对排序一致性的三阶段训练策略,在多个公开基准测试中取得了最先进的性能,并展现出优异的泛化能力。
Details
Motivation: 传统IAA方法依赖回归绝对美学分数,忽视了人类美学感知本质上是动态且基于与隐式视觉参考的潜意识比较这一特性,导致模型缺乏对美学差异的因果推理,难以学习可泛化的美学原则。
Result: 在多个公开基准测试上,RED-Aes取得了最先进的(SOTA)性能,并展现出优越的泛化能力。
Insight: 核心创新在于将IAA任务从学习绝对分数重新定义为学习驱动美学变化的视觉因素,通过可控编辑生成图像对来模拟人类比较过程,并引入相对排序一致性奖励进行纯相对监督优化。构建的包含编辑对和思维链的数据集也为基于推理的美学评估提供了新资源。
Abstract: Traditional Image Aesthetic Assessment (IAA) methods mainly rely on regressing absolute Mean Opinion Scores (MOS). However, such a paradigm overlooks the inherently dynamic nature of human aesthetic perception, which relies on subconscious comparison against implicit visual references. Consequently, the lack of causal reasoning regarding aesthetic differences prevents models from learning generalizable aesthetic principles, thus limiting their generalization across diverse scenarios. In this work, we rethink the IAA task and propose Relative Edit-induced Difference Aesthetic learning (RED-Aes), a novel framework that leverages controllable image editing models to simulate the human aesthetic reasoning process. Instead of fitting absolute score distributions, RED-Aes explicitly learns the visual factors that drive aesthetic changes. To support this paradigm, we construct the RED-20k dataset, which comprises editing-based image pairs, quantitative aesthetic differences, and Chain-of-Thought (CoT) reasoning. Furthermore, we introduce a three-stage training strategy guided by a relative ranking consistency reward, optimizing the model solely via relative supervision. Extensive experiments demonstrate that RED-Aes achieves state-of-the-art performance on multiple public benchmarks, exhibiting superior generalization capabilities.
[77] Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models cs.CV | cs.AIPDF
Haibo Wang, Lifu Huang
TL;DR: 本文提出了GeoVR框架,旨在解决多模态大语言模型缺乏3D感知能力的问题。该框架仅使用2D视频序列,通过从预训练的3D基础模型中蒸馏几何知识,来学习几何表征,从而解锁模型的空间智能。
Details
Motivation: 动机在于现有的多模态大语言模型擅长2D语义理解,但其内部表征缺乏固有的3D感知,导致在视频帧间无法保持几何和空间一致性。由于大规模3D数据稀缺,需要一种仅从2D视频学习几何表征的方法。
Result: 在空间推理基准测试上的广泛实验表明,GeoVR取得了最先进的性能。
Insight: 创新点在于提出了一种通过四个互补的几何目标(帧间相机位姿估计、稠密深度图回归、度量尺度因子预测、多尺度3D特征蒸馏)驱动的多目标学习策略,在显式的物理和几何约束引导下,重塑MLLM的内部表征,使其自然发展出强大的3D感知能力,为赋予基础模型空间智能建立了新范式。
Abstract: Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model’s internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.
[78] Resonant Minds: Closed-Loop Social Avatars with Theory of Mind cs.CVPDF
Jianxu Shangguan, Jing Xu, Hang Ye, Xiaoxuan Ma, Yizhou Wang
TL;DR: 该论文提出了一种名为’Resonant Minds’的闭环双智能体框架,旨在统一认知推理与多模态生成,以创建具有真实社交智能的数字人。该框架集成了感知、社交推理和表达模块,能够从视频中分析伙伴的多模态行为,推断其隐藏的心理状态,并生成包含说话者与倾听者双向互动的可控情感视频。
Details
Motivation: 当前方法将认知推理(如对话)和多模态生成(如说话头部模型)视为独立任务,导致生成的数字人要么缺乏具身表达,要么忽视社交认知。论文旨在弥合这一鸿沟,构建一个能持续进行社交互动的、具有心理理论能力的闭环系统。
Result: 在构建的基于心理角色和私有社交目标的分层Persona-Scenario数据集上,该方法在对话质量和视频生成指标上均表现出有竞争力或更优的性能。特别是在关键对话质量维度上,其表现甚至超过了拥有全部信息的’脚本模式’。
Insight: 核心创新在于将心理理论(Theory of Mind)驱动的社交推理与多模态生成整合到一个闭环交互框架中,实现了对不确定环境下隐藏心理状态的显式推断,从而能生成更富思想深度的对话和双向动态的互动视频。所构建的具有信息不对称性的评估数据集也是一个重要贡献。
Abstract: Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners’ multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.
[79] CamFlow+: Hybrid Motion Bases for 2D Camera Motion Estimation with Stabilization Applications cs.CVPDF
Haipeng Li, Zhen Liu, Zhanglei Yang, Hai Jiang, Tianhao Zhou
TL;DR: CamFlow+是一个用于2D相机运动估计的混合基框架,它直接在稠密光流空间中表示相机运动。该方法结合了单应性导出的物理基、从单应性光流中采样的随机基以及由深度和相机内参导出的深度平移基,从而放宽了单平面约束,同时保持了相机运动的规律性。在GHOF-Cam基准测试中,CamFlow+在稀疏和稠密相机运动估计方面均表现出改进,并在数字视频稳定应用中提升了全局和局部稳定性,在盲用户研究中获得了最佳偏好率。
Details
Motivation: 现有的基于单应性的方法在平面场景或纯旋转情况下表现良好,但在相机平移、深度变化和局部视差方面存在困难;局部单应性和基于网格的模型提高了灵活性,但仍依赖于分段平面假设。本文旨在解决这些限制,提出一个更灵活的框架来准确估计2D相机运动。
Result: 在GHOF-Cam基准测试(一个通过掩蔽动态物体和不适定遮挡区域来隔离相机引起运动的光流基准)上的实验表明,CamFlow+改进了稀疏和稠密相机运动估计。在数字视频稳定应用中,它也提升了全局和局部稳定性,在盲用户研究中达到了最佳top-1偏好率。
Insight: 创新点在于提出了一个混合基框架,将物理基、随机基和深度平移基结合在稠密光流空间中,从而放宽了对场景的平面假设。此外,引入深度感知平滑项来正则化连续深度区域的平移引起的视差,同时保留深度边界附近的运动变化,这是一个有效的正则化策略。
Abstract: Estimating 2D camera motion is fundamental to computer vision and computational photography. Existing homography-based methods work well for planar scenes or pure rotation, but struggle with camera translation, depth variation, and local parallax; local homography and mesh-based models improve flexibility but still rely on piecewise planar assumptions. We introduce CamFlow+, a hybrid-basis framework that represents 2D camera motion directly in dense-flow space. CamFlow+ combines homography-derived physical bases, stochastic bases sampled from homography flows, and depth-translational bases derived from depth and camera intrinsics, relaxing the single-plane constraint while preserving camera-motion regularity. A depth-aware smoothness term further regularizes translation-induced parallax in continuous-depth regions while preserving motion changes near depth boundaries. We evaluate CamFlow+ on GHOF-Cam, a camera-motion benchmark that masks out dynamic objects and ill-posed occlusion regions in an optical-flow benchmark to isolate camera-induced motion. Experiments show that CamFlow+ improves sparse and dense camera-motion estimation. In digital video stabilization, CamFlow+ also improves global and local stability, achieving the best top-1 preference rate in a blind user study. Code and datasets will be available on the project page: https://lhaippp.github.io/CamFlow+.
[80] MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering cs.CV | cs.CLPDF
Qing Yang, Pengcheng Huang, Xinze Li, Zhenghao Liu, Yukun Yan
TL;DR: 论文提出MemoryCard框架,用于长视频问答任务,通过将长视频组织成主题感知的多模态记忆卡片,以解决现有方法依赖孤立帧导致事件级语义捕捉不足的问题。
Details
Motivation: 长视频问答中,答案相关证据稀疏、短暂且分散,现有帧中心方法依赖孤立帧作为证据单元,限制了视觉语言模型捕捉连贯事件级语义的能力。
Result: 在可比视觉令牌预算下,MemoryCard显著提升长视频问答性能,准确率相对提升最高达21.8%。
Insight: 创新点在于将视频分割为语义连贯单元并生成事件级摘要,通过记忆卡片整合多模态线索,增强对长视频中分散证据的压缩与检索能力。
Abstract: Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs’ ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at https://github.com/NEUIR/MemoryCard.
[81] Unveiling the Unknown: Open Vocabulary Object Detection with Scene Graphs cs.CVPDF
Yi Chen, Yinghao Lu, Zhehao Li, Chenchen Yan, Jiafei Wu
TL;DR: 本文提出了一种名为场景引导关系建模(Scene-guided Relational Modeling)的检测框架,用于解决开放词汇目标检测(OVOD)中识别训练数据中未见过的物体类别的问题。该框架利用场景图捕捉候选区域与其上下文物体之间的结构化语义和空间关系,并通过关系注意力模块增强关键关系线索,同时引入基于场景的文本对齐分支来指导关系对齐,从而提升对未知类别的检测性能。
Details
Motivation: 现有基于知识蒸馏的开放词汇目标检测方法虽然性能不错,但往往忽略了图像中物体之间结构化的、图像特定的关系(如交互和空间排列),这严重限制了检测未知类别的有效性。
Result: 在COCO和LVIS数据集上的综合实验表明,该模型相比其他OVOD方法取得了更优的性能,特别是在未知类别上的平均精度(AP)有所提升。
Insight: 创新点在于显式地利用场景图来建模物体间的结构化关系,并设计了关系注意力模块和场景文本对齐分支,将视觉关系与语义信息无缝集成,这为开放词汇检测中利用上下文关系知识提供了新思路。
Abstract: Open-vocabulary object detection seeks to identify novel object categories that were not part of the training data. Many knowledge distillation-based approaches have shown promising performance by transferring knowledge from pre-trained vision-language models to object detection. However, these methods often overlook structured, image-specific relationships between objects, such as interactions and spatial arrangements. This oversight can significantly restrict the effectiveness of detecting novel categories. To address this issue, we propose a Scene-guided Relational Modeling detection framework. This framework utilizes scene graphs to capture structured semantic and spatial relationships between candidate regions and their contextual objects. It explicitly models interactions among neighboring regions and incorporates a Relation Attention Module to implicitly amplify the key relational cues extracted from the scene graph. Furthermore, we present a scene-based textual alignment branch that distills category knowledge from captions to guide relational alignment. This approach facilitates a seamless integration of visual relations with semantic information for enhanced detection performance. Comprehensive experiments show that our model achieves superior performance compared to other OVOD methods, improving the AP for novel categories on COCO and LVIS datasets.
[82] T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation cs.CV | cs.ROPDF
Jingkun Feng, Reza Sabzevari
TL;DR: 本文提出了T-FunS3D,一种任务驱动的分层开放词汇3D功能分割方法,旨在为机器人应用提供可操作的感知。该方法输入室内场景的3D点云和配准的RGB-D图像,构建开放词汇场景图,并利用视觉语言模型根据任务描述定位相关实例的功能部件。在SceneFun3D数据集上的实验表明,该方法在保持与SOTA相当的分割性能的同时,实现了更快的运行速度和更低的内存消耗。
Details
Motivation: 开放词汇3D功能分割对于机器人定位物体功能部件至关重要,但现有方法要么专注于物体级识别,要么对整个场景进行详尽分割,导致资源消耗大、速度慢,难以在粒度、精度和速度之间取得平衡。
Result: 在SceneFun3D数据集上的实验表明,T-FunS3D在开放词汇3D功能分割任务上的性能与最先进方法(SOTA)相当,同时实现了更快的运行时间和更低的内存使用。
Insight: 创新点在于提出了任务驱动的分层方法,通过构建开放词汇场景图并利用视觉语言模型进行针对性推理,而非对整个场景进行详尽分割,从而在性能、速度和资源效率之间取得了更好的权衡。
Abstract: Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.
[83] Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models cs.CVPDF
Yifan Chang, Jiaxin Ai, Jianwen Sun, Yuandong Pu, Siqi Luo
TL;DR: 该论文提出了一个名为FEPBench的基准测试,用于评估文本到图像(T2I)模型在生成自然科学插图方面的能力。该基准从多学科和布局类型中精心挑选高质量科学插图,并利用多模态大语言模型(MLLMs)和人类专家提供细粒度原子集标注,从指令忠实性、推理丰富性和语义精确性三个维度系统评估T2I模型。
Details
Motivation: 现有基准通常从整体层面评估输出,忽略了细粒度元素,且科学推理能力和输出简洁性缺乏量化。因此,需要一个新的基准来专门、细致地评估T2I模型在生成科学插图这一特定任务上的表现。
Result: 评估结果显示,即使是GPT Image 2和Nano Banana Pro等最先进的闭源模型,在文本渲染、推理丰富性以及平衡生成丰富度与精确性方面仍存在瓶颈和困难。
Insight: 论文的创新点在于构建了一个专门针对科学插图生成的新基准FEPBench,并引入了细粒度的原子集标注和三维度(忠实性、丰富性、精确性)评估框架,将模型性能进一步分解到视觉、文本、关系和布局元素,为改进和部署T2I模型提供了具体的实践指导。
Abstract: Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.
[84] Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder cs.CV | cs.LGPDF
Yoshiyuki Ootani
TL;DR: 本文提出了一种面向视频流式风格化的高效推理框架,通过将蒸馏后的轻量级U-Net与大型多模态语言模型(MLLM)文本编码器结合,并采用非对称批处理流水线、图融合和条件刷新调度等工程机制,在消费级GPU上实现了实时(约27-74 FPS)的视频风格化处理。
Details
Motivation: 动机在于解决实时文本到图像生成中,当扩散U-Net被极度蒸馏(如减少到4步或1步)后,计算瓶颈从去噪器转移到大型MLLM文本编码器的问题,特别是在视觉感知的编辑扩散任务中,需要高效处理视频流式输入。
Result: 在单张RTX 3090 Ti GPU上,512x512分辨率下,批处理大小B=8时达到27.4 FPS,B=16时达到29.6 FPS,端到端p50延迟约为0.5秒和1.0秒;在RTX 4090和RTX 5090上性能更高(分别达54.9 FPS和74.1 FPS)。风格化模型在DAVIS-2017等未见视频序列上展示了良好的泛化能力。
Insight: 创新点包括:1) 非对称CUDA流水线设计,通过批处理文本编码器摊销和静态提示缓存优化编码器瓶颈;2) 将U-Net与适配器堆栈编译为单一融合图,提升推理效率;3) 周期性条件刷新调度机制,降低每帧条件计算成本。这些工程优化为资源受限环境下的实时视频风格化提供了可行方案。
Abstract: Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side-stream / main-stream CUDA pipelining with batched text-encoder amortisation (and optional static-prompt caching), a compile-friendly ControlNet-LLLite reformulation that folds the entire U-Net + adapter stack into a single fused graph, and a periodic conditioning-refresh schedule with a hook subset that amortises the per-frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480-frame run at batch size B=8 and 29.6 fps at B=16, with end-to-end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video-rate streaming throughput rather than interactive low latency, and locate our numbers against same-stack StreamDiffusion re-runs as systems context, not as a benchmark superiority claim. For the trained oil-painting style, the released temporal adapter generalises within in-clip noise to 19 unused DAVIS-2017 sequences and 15 non-DAVIS clips from seven sources; prompt-level generalisation to unseen style families is bounded and reported separately.
[85] Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting cs.CVPDF
Kyriakos Chaviaras, Maria Lymperaiou, Athanasios Voulodimos
TL;DR: 本文介绍了AILS-NTUA团队在CLEF 2026 EXIST实验室任务中的提交,针对模因(任务2)和短视频(任务3)中的多模态性别歧视识别与表征问题。系统采用基于梯度提升回归模型和分层后处理的特征工程化晚期融合流程。对于模因,结合了视觉、文本、人口统计、生物特征及LLM衍生的语义指标以捕捉刻板印象、物化、讽刺和厌女症等高层次线索;对于视频,研究了特征选择、基于帧的视觉表示、基于OCR的文本特征、声学描述符和传感器元数据的影响。
Details
Motivation: 解决多模态内容(特别是模因和短视频)中性别歧视的自动识别与细粒度表征问题,以应对社交媒体中日益增长的性别歧视内容检测挑战。
Result: 开发阶段结果显示,针对模因的LLM衍生语义线索能提升性别歧视识别性能,而视频性能对特征维度和跨模态噪声高度敏感;在官方测试中,视频任务上无过滤的特征表示在未见数据上泛化更好,表明开发阶段的紧凑特征选择结论未能完全迁移。
Insight: 创新点在于为静态模因设计了针对性的语义特征工程(融合多源指标),并揭示了在噪声短视频场景中需要更鲁棒的时序建模方法;客观分析认为,该工作突出了多模态性别歧视检测中特征工程与模型泛化之间的权衡。
Abstract: We present the AILS-NTUA submission to the EXIST 2026 Lab at CLEF, addressing multimodal sexism identification and characterization in memes (Task 2) and short-form videos (Task 3). Our system follows a feature-engineered late-fusion pipeline built around gradient-boosted regression models and hierarchical post-processing. For memes, we combine visual, textual, demographic, biometric, and LLM-derived semantic indicators designed to capture high-level cues such as stereotyping, objectification, irony, and misogyny. For videos, we investigate the effect of feature selection, frame-based visual representations, OCR-based textual features, acoustic descriptors, and sensor-derived metadata. Development results show that focused LLM-derived semantic cues improve meme sexism identification, while video performance is highly sensitive to feature dimensionality and cross-modal noise. For videos, development results favor compact feature selection, but official test results show that this conclusion does not fully transfer to unseen data, where the unfiltered representation generalizes better. Overall, our findings highlight the usefulness of targeted semantic feature engineering for static memes and the need for more robust temporal modeling in noisy short-form video settings.
[86] ReSAGE-PAR: Representational Similarity Assessment for Generative Expansion in Pedestrian Attribute Recognition cs.CVPDF
Pablo Ayuso-Albizu, Pablo Carballeira, Juan C. SanMiguel, Paula Moral
TL;DR: 本文提出ReSAGE-PAR方法,旨在解决行人属性识别(PAR)中数据多样性有限和稀缺的问题。该方法利用基于属性的提示引导扩散模型进行图像合成,并通过一个生成-评分-自动标注的流程来弥合领域差距,实现可扩展、高保真的数据集扩增。
Details
Motivation: 动机是解决PAR任务中数据稀缺和多样性不足的问题,以及在使用扩散模型进行可控生成时面临的两个关键挑战:预训练数据与低分辨率监控图像之间的领域差距,以及需要可靠的属性验证来防止生成幻觉。
Result: 在标准骨干网络上,ReSAGE-PAR带来了高达8.7%的性能提升,并将最先进(SOTA)框架推向了新的性能水平,证明了其作为与架构无关的可扩展PAR增强方案的有效性。
Insight: 创新点包括:1) 使用定制的基于LoRA的图像到图像方法,将预训练扩散模型适配到原生PAR分辨率;2) 利用包含标签一致和不一致补充的全面提示策略,提取生成图像与条件提示之间的视觉-语言对齐分数;3) 设计贝叶斯分类器将这些连续分数转换为可靠的二元伪标签,形成一个鲁棒的生成-评分-自动标注流程。
Abstract: To address the limited diversity and data scarcity in Pedestrian Attribute Recognition (PAR), we explore image synthesis using diffusion models guided by attribute-based prompts. While this enables the controlled generation of pedestrian images, it faces two critical challenges: (i) the domain gap between high-quality pre-training data and low-resolution, non-standard surveillance crops, and (ii) the need for reliable attribute verification to prevent generative hallucinations. In this paper, we introduce a robust generate-score-autolabel pipeline called ReSAGE-PAR (REpresentational Similarity Assessment for Generative Expansion in PAR) that bridges this domain gap and enables scalable, high-fidelity dataset expansion. First, we adapt pre-trained diffusion models to native PAR resolutions using a tailored LoRA-based Image-to-Image approach. Second, we extract vision-language alignment scores between the generated images and their conditioning prompts, utilizing a comprehensive prompting strategy that includes label-consistent and inconsistent complements. Finally, we formulate a Bayesian classifier that converts these continuous scores into reliable binary pseudo-labels. Extensive evaluations demonstrate the effectiveness of ReSAGE-PAR in preserving spatial priors and verifying attributes. When integrated into PAR training, ReSAGE-PAR consistently yields significant improvements-achieving gains of up to 8.7% on standard backbones and pushing state-of-the-art frameworks to new performance levels. This proves its value as an architecture-agnostic solution for scalable PAR enhancement. The complete codebase for ReSAGE-PAR is publicly available at http://www-vpu.eps.uam.es/publications/ReSAGE-PAR.
[87] Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation cs.CVPDF
Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma
TL;DR: 本文提出了一种基于全局-局部蒙特卡洛树搜索(MCTS)的视觉语言模型方法,用于从文本生成3D室内场景。该方法将场景生成建模为树搜索问题,通过分层场景表示和PRM引导的MCTS来优化物体放置,并利用扩散模型预测纹理,在自建的大规模多样化数据集3DTindo-bench上实现了比现有方法更逼真的生成效果。
Details
Motivation: 现有基于大视觉语言模型(LVLM)的文本到3D场景生成方法多采用链式思维顺序决策机制,无法修正早期错误,导致误差传播;本文将其视为受空间和布局常识约束的规划问题,旨在克服这一局限。
Result: 在自建的大规模多样化数据集3DTindo-bench(包含65种场景类型和3250条指令)上,实验表明该方法生成的3D场景比现有最先进方法更逼真。
Insight: 创新点包括:将场景生成建模为全局与局部树搜索问题,而非顺序决策;提出分层场景表示(房间、区域、地面物体、支撑物体四级)和PRM引导的MCTS方法,以剪枝分支并平衡探索与利用;结合扩散模型预测纹理以保持场景整体一致性;构建了大规模评估数据集以弥补现有基准的不足。
Abstract: Large Vision-Language Models have achieved significant reasoning performance in various tasks.However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation.In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense.To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches.In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree.To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method.The hierarchical representation abstracts a scene into room level, region level, floor object level, and supported object level.The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts.In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters.To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene.As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3,250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art approaches.
[88] Texture-preserving implicit neural representation for Cone beam CT truncated reconstruction cs.CVPDF
Genyuan Zhang, Junyao Wang, Haoran Lan, Chuandong Tan, Songtao Zhu
TL;DR: 本文提出了一种基于神经场景表示的自监督3D重建框架,用于解决锥束CT(CBCT)数据截断问题。该方法通过坐标网络直接映射空间坐标到辐射密度,避免了传统滤波反投影操作,从而消除截断引起的环形伪影并实现稳健的3D数据外推。此外,通过引入基于物理的迭代细化模块,解决了坐标网络固有的频谱偏差导致的高频纹理丢失问题,实现了伪影抑制与高保真细节保存的统一。
Details
Motivation: 现有用于截断CBCT重建的深度学习方法存在严重局限性,包括严格依赖监督真值以及无法处理连续3D空间截断变化。
Result: 在模拟和真实数据集上的大量实验表明,该方法成功地将神经网络优异的伪影抑制和外推能力与迭代算法的高保真细节保存能力相结合。
Insight: 创新点在于将神经场景表示与基于物理的迭代细化模块相结合,利用坐标网络生成的无伪影外推体积作为最优初始化,逐步从原始投影中重新提取并注入高频结构信息,从而在自监督框架下同时解决截断伪影和纹理保留问题。
Abstract: Cone-beam computed tomography (CBCT) frequently suffers from data truncation, which introduces severe artifacts and limits the effective field of view (FOV). Existing deep learning methods for truncated cone-beam computed tomography (CBCT) reconstruction suffer from serious limitations, including a strict reliance on supervised ground truth and a failure to account for continuous 3D spatial truncation variations. To address these challenges, we introduce a self-supervised 3D reconstruction framework based on neural scene representations. By directly mapping spatial coordinates to radiodensity under projection supervision, our approach inherently bypasses traditional filtering and backprojection operations, thereby fundamentally eliminating truncation-induced ring artifacts while enabling robust continuous 3D data extrapolation. However, coordinate networks are susceptible to an inherent spectral bias, which leads to a severe loss of clinically vital high-frequency textures. To resolve this bottleneck, we further incorporate a physics-based iterative refinement module into the neural scene representation architecture. Leveraging the artifact-free, extrapolated volume from the coordinate network as an optimal initialization, this module progressively re-extracts and injects high-frequency structural information from the original projections back into the volume. Extensive experiments on both simulated and real-world datasets demonstrate that our method successfully unifies the exceptional artifact suppression and extrapolation capabilities of neural networks with the high-fidelity detail preservation of iterative algorithms.
[89] LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing cs.CVPDF
Jianzong Wu, Hao Lian, Jiongfan Yang, Dachao Hao, Ye Tian
TL;DR: LoomVideo是一种高效、参数为5B的统一视频生成与编辑架构,能够处理交错的多模态输入。它使用多模态大语言模型(MLLM)替代标准文本编码器,并通过Deepstack注入机制对齐MLLM与扩散变换器(DiT)的特征。其核心创新是零开销的Scale-and-Add条件化方法,用于视频编辑,避免了序列拼接带来的计算开销,显著提升了推理速度。
Details
Motivation: 现有统一视频生成与编辑框架通常依赖大规模模型(如13B参数以上),并通过拼接序列令牌来引入源视频条件进行编辑,这导致序列长度加倍、自注意力计算复杂度剧增,带来难以承受的开销。LoomVideo旨在解决这些计算瓶颈,开发一个高效且能力强大的统一模型。
Result: 在广泛的实验中,紧凑的5B参数模型在综合基准测试中达到了最先进或极具竞争力的性能,特别是在电子商务和时尚生成场景中表现出卓越优势。得益于零开销条件化机制,与能力相似的模型相比,LoomVideo在推理速度上实现了至少5.41倍的加速。
Insight: 论文宣称的创新点包括:1) 用MLLM统一处理多模态输入;2) 提出零开销的Scale-and-Add条件化方法,优雅地避免了序列拼接,大幅降低计算成本;3) 集成Negative Temporal RoPE策略以处理多张参考图像。从客观角度看,其将高效计算设计与强大编辑能力结合,为实用高效的视频基础模型提供了新路径。
Abstract: Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.
[90] LLM-Conditioned Synthesis of Pathological Gaits via Structured Gait-Language Representations cs.CVPDF
Mritula Chandrasekaran, Sanket Kachole, Jarik Francik, Dimitrios Makris
TL;DR: 本文提出了一种基于多模态大语言模型(LLM)的框架,用于从结构化文本描述中合成具有病理感知的3D步态数据,以解决病理步态数据集稀缺的问题。该方法通过运动标记化、病理感知语言条件化、基于LLM的语义增强以及语言到步态的生成,来合成用于病理步态分类任务的固定长度骨架序列。
Details
Motivation: 动机在于解决病理步态数据集因隐私、招募成本、运动变异性等因素而稀缺的问题,旨在通过文本描述合成高质量、病理感知的步态数据以辅助下游任务。
Result: 实验表明,合成的序列与真实数据结合时,能提升循环分类器(如GRU)的下游分类性能。在留一受试者交叉验证协议下,使用真实和合成样本训练的GRU分类器取得了92.77%的最佳准确率。
Insight: 主要创新点在于提出了一个病理感知的标记器,旨在离散表示学习中保留病理特定的运动特征,并结合LLM进行语义增强,实现了从结构化语言到病理步态的有效生成,为数据稀缺领域的合成提供了新思路。
Abstract: Pathological gait datasets remain scarce due to privacy, recruitment, cost, and movement variability. Our work presents a multimodal LLM-guided framework for pathology-aware 3D gait data synthesis from structured textual descriptions. The proposed method generates fixed-length synthetic skeleton-based gait sequences for pathological gait classification tasks. The framework combines motion tokenisation, pathology-aware language conditioning, LLM-based semantic augmentation, and language-to-gait generation. A key contribution is the proposed pathological tokeniser, which is designed to preserve pathology-specific motion characteristics during discrete representation learning. Experiments suggest that the proposed synthetic sequences improve downstream classification for recurrent classifiers when combined with real data. The best result is obtained using a GRU classifier trained with real and synthetic samples, achieving 92.77% accuracy under a leave-one-subject-out protocol.
[91] ReCache: Learning Budget-Aware Caching Schedules for Diffusion Models via REINFORCE cs.CVPDF
Mishan Aliev, Eva Neudachina, Ilya Bykov, Aleksandr Oganov, Kirill Struminsky
TL;DR: 本文提出了ReCache,一种基于强化学习的方法,用于为扩散模型学习预算感知的缓存调度策略。该方法允许用户直接指定计算预算,并学习一个最大化生成质量的重计算步骤选择策略,从而在显著降低计算开销的同时,提升图像生成质量。
Details
Motivation: 现有扩散模型推理成本高昂,而现有的缓存调度策略(如均匀调度或基于启发式误差的自适应调度)无法让用户直接控制计算成本,其成本是手动调整阈值的副产品。本文旨在将计算成本转变为用户可直接指定的输入,并学习最优的重计算调度策略。
Result: 在多个基准测试中,ReCache均优于基线方法。在FLUX模型上,以5.04倍的FLOPs减少,将LPIPS指标降低了31%(从0.456降至0.316)。在Wan 2.1模型上,以约2.6倍加速,将LPIPS降低了65%(从0.480降至0.169),并将VBench分数提升了7%(从70.4提升至76.0)。
Insight: 核心创新点在于将缓存调度问题形式化为一个预算约束下的优化问题,并利用无标签的强化学习(策略梯度)进行求解,避免了通过完整扩散推理进行反向传播。这使得学习到的策略能够直接响应用户指定的计算预算,并与任何缓存机制(如特征重用或特征预测)兼容,具有很好的通用性和可控性。
Abstract: Modern diffusion models generate high-quality images and videos, but their iterative denoising process makes inference expensive. Feature caching accelerates sampling by reusing or predicting intermediate activations across neighboring denoising steps, exploiting the redundancy of computations along the reverse trajectory. In this work, we focus on the caching schedule: selecting which denoising steps should be fully recomputed. Existing schedules are either fixed (e.g. uniform) or chosen adaptively from per-step error heuristics; in both cases, the actual compute cost is a side-effect of hand-tuned thresholds rather than a quantity the user can specify. We propose ReCache, which inverts this: given a target budget k, it learns the recomputation schedule that maximizes generation quality, turning compute into a directly controllable input. ReCache trains via policy gradients, sidestepping backpropagation through full diffusion inference, and uses no labelled data. Generations from uncached inference serve as matching targets, paired with a reward for generation quality. ReCache is compatible with any caching mechanism, including feature reuse and feature forecasting; for each mechanism, a single trained policy adapts across computational budgets at inference time. ReCache consistently outperforms scheduling baselines: under a $\times5.04$ FLOPs reduction on FLUX, it reduces LPIPS by 31% (from 0.456 to 0.316) compared to DiCache; on Wan 2.1 at a $\sim \times2.6$ speedup, it drops LPIPS by 65% (from 0.480 to 0.169) and boosts the VBench score by 7% (5.6 points, from 70.4 to 76.0) over uniform HiCache. Code is available at https://github.com/thecrazymage/ReCache.
[92] Knowledge Distillation for Visual Autoregressive Models cs.CVPDF
Elia Peruzzo, Aritra Bhowmik, Guillaume Sautiere, Yuki M Asano, Amirhossein Habibian
TL;DR: 本文首次系统研究了自回归图像生成模型的知识蒸馏策略,发现语言模型蒸馏方法无法直接迁移到视觉任务,并提出VarKD框架,通过选择性教师监督和降低token模糊性来提升蒸馏效果。
Details
Motivation: 自回归图像生成模型表达能力强但计算成本高,需要有效的模型压缩方法,而知识蒸馏在视觉自回归生成中的行为尚未充分探索。
Result: 在ImageNet数据集上对多种自回归骨干网络的实验表明,VarKD始终优于先前的蒸馏基线,缩小了与大规模模型的性能差距。
Insight: 创新点在于揭示了视觉token模糊性和长解码视野导致教师监督不可靠的问题,并提出在student样本上蒸馏并选择性应用教师监督的框架,可借鉴于视觉序列生成任务的模型压缩。
Abstract: Autoregressive (AR) image generation models are highly expressive but computationally intensive, motivating effective model compression. Knowledge distillation (KD) is a natural approach for model compression and has been widely studied in language modeling, yet its behavior in visual AR generation remains underexplored. In this work, we present the first systematic study of distillation strategies for AR image models. Our analysis shows that while standard distillation can yield meaningful gains, recent methods developed for language do not directly transfer to images: long decoding horizons and visual token ambiguity make teacher supervision unreliable especially under student-conditioned contexts. To address this, we propose VarKD, a distillation framework for visual autoregressive models that distills on student samples while selectively applying teacher supervision and reducing token-level ambiguity. Experiments on ImageNet across multiple AR backbones show that VarKD consistently outperforms prior distillation baselines, narrowing the gap to large-scale models.
[93] HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning cs.CVPDF
Moshiur Farazi, Sameera Ramasinghe, Mahbub Ahmed Turza, Shafin Rahman
TL;DR: 本文提出HyperVis方法,通过构建连续潜在视觉关系图来解决视觉语言模型在组合推理任务上的不足。该方法利用类无关区域提案生成密集视觉关系张量,并将其投影到洛伦兹双曲面,通过空间物理约束(如IoA驱动的蕴含锥和外部角排斥)强制层次结构。HyperVis可作为训练时正则化器提升生成式VQA性能,也可作为推理时关系编码器增强判别式组合评分。
Details
Motivation: 视觉语言模型在需要理解物体间关系的组合推理任务上表现不佳,而现有方法直接注入离散场景图三元组会导致视觉连续模态与文本标签冲突,反而降低性能。
Result: 在GQA基准上,HyperVis将准确率从基线57.21%提升至61.03%;在SugarCrepe组合性评估中,相比基线提升6.25个百分点至79.94%。学习到的曲率稳定在κ=4.0,显著高于先前双曲视觉语言模型中曲率趋于零的情况。
Insight: 创新点在于完全绕过离散场景图生成器的语义瓶颈,构建连续视觉关系表示;利用双曲空间的指数体积特性有效建模视觉特征的连续关系;通过空间物理约束实现关系层次结构,其中组合性增益主要来源于双曲几何特性而非欧几里得空间。
Abstract: Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets $\langle s, p, o \rangle$ from an off-the-shelf scene graph generator (SGG), but we show this backfires: discrete text labels collide with the continuous visual modality, degrading GQA accuracy from 60.38% to 58.86%. We propose \textbf{HyperVis}, which bypasses the SGG semantic bottleneck entirely. From $N$ class-agnostic region proposals, we compute a dense $O(N^2)$ visual relation tensor via spatially-biased cross-attention, project it onto a Lorentz hyperboloid, and enforce hierarchy through spatial physics, namely IoA-driven entailment cones and exterior-angle repulsion. We discover that HyperVis contributes in two complementary ways: (1) as a \emph{training-time regularizer}, the hyperbolic relational losses shape LoRA representations that improve generative VQA (GQA 61.03% vs.\ 57.21% for LoRA fine-tuning without relational losses, recovering and surpassing the baseline); and (2) as an \emph{inference-time relational encoder}, hyperbolic prefix tokens boost discriminative compositional scoring (SugarCrepe 79.94%, $+$6.25pp over baseline). The learned curvature stabilises at $κ{=}4.0$, an order of magnitude above prior hyperbolic VLMs where $κ$ typically collapses toward zero, indicating that continuous visual features genuinely require the exponential volume of strongly curved space. A controlled Euclidean ablation confirms this decomposition: the relational pipeline regularises LoRA comparably in flat space (GQA 60.81%), but the compositionality gain is specifically hyperbolic (SugarCrepe $+$4.58pp over Euclidean), with entailment loss ${\sim}6{\times}$ higher in Euclidean training. Codes are available at TBA.
[94] MS-DKC: A Dataset Knowledge Card Framework for Designing and Adapting Medical Image Segmentation Models cs.CVPDF
Tariq M. Khan, Syed Saud Naqvi, Thantrira Porntaveetus, Hamid Alinejad-Rokny, Shahzaib Iqbal
TL;DR: 本文提出了医学分割数据集知识卡片(MS-DKC)框架,旨在通过系统分析数据集特性(如前景占比、形态、边界模糊性等)来指导医学图像分割模型的设计与适配,而非仅关注架构搜索。该框架将数据集证据映射到故障模式、设计先验和风险对齐标准,并在DRIVE、ISIC2018和ACDC三个数据集上验证了其有效性,展示了不同数据集需要不同的模型先验、操作点和评估证据。
Details
Motivation: 当前医学图像分割研究往往过度聚焦于模型架构的优化,而忽视了数据集本身对模型需求的根本性问题。本文旨在通过MS-DKC框架,使影响分割性能的数据集因素(如前景占用率、形态学特征、边界模糊性、拓扑敏感性等)变得显式化,从而提供更可追溯和针对性的设计指导。
Result: 在DRIVE数据集上,应用MS-DKC指导的DKC-TNet-v2模型取得了Dice 0.8044和IoU 0.6730(参数量35103),而SA-UNetv2-DKC-AmbRef达到了Dice 0.8141、IoU 0.6865、灵敏度0.8265、特异性0.9804和AUC 0.9853。在ISIC2018数据集上,MS-DKC-AttNextTopo-VCSF-NoAug模型实现了Dice 0.8872、IoU 0.8214、精确率0.9173、边界F1分数0.4878和ASSD 4.13。对于ACDC多类心脏分割,MS-DKC推荐了四类softmax分割、类别平衡的Dice/CE监督和逐类表面评估。
Insight: 论文的创新点在于提出了一个系统化的数据集知识卡片框架,将数据集特性与模型设计决策直接关联,使分割设计过程更加透明和可解释。从客观角度看,该研究强调了“数据驱动设计”的重要性,即模型的有效性高度依赖于对数据集内在需求(如形态复杂性、标注质量、部署风险)的深入理解,而非盲目追求通用或复杂的架构。
Abstract: Medical image segmentation is often framed as a search for stronger architectures, but this can obscure a more fundamental question: what does the dataset require from the model? In medical imaging, this requirement is shaped by foreground occupancy, morphology, boundary ambiguity, topology sensitivity, annotation quality, acquisition variation, and operating point. This paper introduces the Medical Segmentation Dataset Knowledge Card (MS-DKC), a framework for making these factors explicit. MS-DKC records dataset evidence through image/acquisition, morphology, supervision, context-dependence, and deployment-risk descriptors. These descriptors are mapped to failure modes, design priors, and risk-aligned criteria, making segmentation design more traceable than architecture-first comparison. We evaluate MS-DKC on DRIVE, ISIC2018, and ACDC, representing distinct regimes. DRIVE contains sparse, thin, branching vessels, favoring detail-preserving models, sensitivity-aware optimization, threshold analysis, and topology-aware metrics. DKC-TNet-v2 achieved Dice 0.8044 and IoU 0.6730 with 35103 parameters, while SA-UNetv2-DKC-AmbRef reached Dice 0.8141, IoU 0.6865, sensitivity 0.8265, specificity 0.9804, and AUC 0.9853. ISIC2018 involves compact but appearance-variable lesions; validation-constrained score-function selection on Att-Next-Topo/ATTNext produced MS-DKC-AttNextTopo-VCSF-NoAug with Dice 0.8872, IoU 0.8214, precision 0.9173, Boundary F1 0.4878, and ASSD 4.13, while plausible additions failed to improve the risk-aligned profile. ACDC provides a multi-class cardiac case, where MS-DKC recommends four-class softmax segmentation, class-balanced Dice/CE supervision, and class-wise surface evaluation. Overall, the results support dataset-conditioned design: different datasets require different priors, operating points, and evidence before a model can be judged appropriate.
[95] Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback cs.CVPDF
Huaisong Zhang, Hao Yu, Yuxuan Zhang, Jiahe Wang, Xinrui Chen
TL;DR: 本文提出了一种结构化缺陷定位(SDG)方法,用于诊断文本到图像(T2I)模型生成的图像中的局部、细微和结构复杂的缺陷。该方法将每个缺陷建模为(位置、类型、原因、重要性)的四元组,并构建了包含3万张图像的SDG-30K数据集进行评估。基于此结构化表示,进一步提出了一个诊断到对齐的框架,利用视觉语言模型(VLM)作为SDG检测器,并通过BoxFlow-GRPO将预测的缺陷集转换为基于边界框、重要性加权的空间奖励,用于扩散模型的对齐。实验表明,SDG检测器在结构化缺陷定位上优于领先的专有VLM,且SDG引导的奖励能持续改善T2I对齐并支持局部图像细化。
Details
Motivation: 尽管文本到图像(T2I)模型生成的图像越来越逼真,但仍存在局部、细微和结构复杂的缺陷。诊断这些缺陷需要实例级别的反馈,以回答缺陷发生的位置、类型、原因及其对整体图像质量的重要性。现有的密集反馈方法虽然超越了标量监督,但其以热图为中心的表示仍将诊断视为像素场回归,难以定位可变基数的缺陷并将语义原因与单个故障绑定。
Result: 在SDG-30K数据集上的广泛实验表明,所提出的SDG检测器在结构化缺陷定位任务上优于领先的专有视觉语言模型(VLM)。同时,SDG引导的奖励在扩散模型对齐中持续提升了T2I模型的性能,并支持局部图像细化,实现了对现代生成模型的统一、实例级诊断、评估和增强。
Insight: 论文的核心创新在于将T2I诊断问题重新定义为结构化集合预测,通过(位置、类型、原因、重要性)四元组来建模每个缺陷,突破了传统热图表示在定位可变基数缺陷和绑定语义原因方面的瓶颈。此外,提出的SDG-30K数据集和专用评估协议SDG-Eval为训练和测量该结构化方法提供了基础。从诊断到对齐的框架(VLM作为检测器,BoxFlow-GRPO转换缺陷为空间奖励)也展示了如何将结构化诊断结果有效用于模型改进,为生成模型的细粒度评估和优化提供了新思路。
Abstract: Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.
[96] Computation-Aware Event-to-Frame Reconstruction via Selective Attention cs.CVPDF
Jingqian Wu, Yunbo Jia, Edmund Y. Lam
TL;DR: 本文提出了一种高效的事件到帧重建框架,通过选择性注意力机制在重建质量和计算效率之间取得平衡。该框架采用循环编码器-解码器结构增量聚合事件信息,并引入选择性上下文融合策略以增强在快速运动和光照变化下的鲁棒性。
Details
Motivation: 现有事件到帧重建方法在重建质量和计算效率之间存在权衡,本文旨在设计一个兼顾两者的高效框架。
Result: 在标准基准测试中,该方法取得了有竞争力的重建性能,同时在精度和模型复杂度之间保持了良好的平衡。
Insight: 创新点包括强调因果时序建模和计算感知设计,以及引入轻量级混合注意力机制来增强特征选择性,而无需依赖繁重的注意力操作。
Abstract: Event-to-frame (E2F) reconstruction bridges asynchronous event streams with frame-based vision pipelines, but existing methods often face a trade-off between reconstruction quality and computational efficiency. In this work, we propose an efficient E2F framework that emphasizes causal temporal modeling and computation-aware design. The architecture adopts a recurrent encoder-decoder to incrementally aggregate event information with compact hidden states. To improve robustness under fast motion and illumination variations, a selective context fusion strategy is introduced to integrate event-driven features with prior intensity cues. Within this fusion process, a lightweight hybrid attention mechanism enhances feature selectivity without relying on heavy attention operations. Experimental results on standard benchmarks demonstrate that the proposed approach achieves competitive reconstruction performance while maintaining a favorable balance between accuracy and model complexity.
[97] Diff-CA: Separating Common and Salient Factors with Diffusion Models cs.CVPDF
Michaël Soumm, Alexandre Fournier Montgieux, Yunlong He, Pietro Gori, Alasdair Newson
TL;DR: 本文提出Diff-CA,一种基于扩散模型的对比分析框架,用于分离两个数据分布间的公共因子和仅属于其中一个的显著因子。该方法通过训练一个无提示的图像条件扩散模型,并在弱监督下学习将条件分解为公共和显著因子,从而在保持高质量生成的同时实现有效的因子分离。
Details
Motivation: 现有基于生成模型(如VAE或GAN)的对比分析方法常受限于重建和图像质量,这阻碍了有效的潜在因子分离,并限制了其在高保真图像生成和编辑中的应用。
Result: 论文证明了在先验工作中常用的加性对比分解在温和条件下是可识别的,该分解支持通过仅交换或插值显著因子来进行定向操作。
Insight: 创新点在于提出了一个不损害生成质量的扩散模型条件框架,实现了高质量的对比分解,从而支持高保真图像生成和编辑中的定向操作。
Abstract: Contrastive Analysis aims to separate factors that are common between two data distributions from those that are salient to only one of them. Existing contrastive methods are based on generative models (e.g., VAEs or GANs) that often suffer from limited reconstruction and image quality, which hampers effective latent factor separation and limits their applicability to high-fidelity image generation and edition. We propose a novel conditioning framework for diffusion models that enables contrastive decomposition without compromising generation quality. We first train a prompt-free, image-conditioned diffusion model, and then learn to decompose the conditioning into a common and a salient factor, using weak supervision. We prove that the additive contrastive factorization, commonly assumed in prior work, is identifiable under mild conditions. This factorization enables targeted operations by swapping or interpolating only the salient factor.
[98] RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision cs.CVPDF
Haochen Hu, Yanrui Bin, Chih-yung Wen, Bing Wang
TL;DR: 本文提出RQUL-UIE方法,旨在解决水下图像增强任务中训练标签质量不稳定导致模型性能瓶颈的问题。该方法利用预训练扩散模型以无训练方式评估标签质量,并将其量化为噪声级别索引,指导多步去噪过程进行分层监督,同时结合基于傅里叶的细化网络显式重建高频成分。
Details
Motivation: 动机在于现有基于学习的水下图像增强方法大多依赖成对数据集,但这些数据集的标签质量不稳定,限制了模型性能的进一步提升。
Result: 大量评估表明,该方法在复原质量上持续优于最先进的(SOTA)方法。
Insight: 创新点在于提出了一种基于扩散模型的、数据集内自监督学习策略,利用训练标签的质量分布,通过量化质量分数为噪声级别来实现分层监督,并引入傅里叶细化网络来增强高频细节重建。
Abstract: Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.
[99] Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting cs.CVPDF
Kevin Dave, Sai Aditya Patkuri, Chhaya Kumar Das, Gouranga Bala, R. Venkatesh Babu
TL;DR: 本文提出了一种自适应视频token化方法,通过利用冻结连续视频tokenizer潜在空间中的时间冗余性,设计了一种无需参数的自适应token分配机制,并引入轻量级的潜在修复Transformer(LIT)来重建被丢弃的token。该方法在TokenBench和DAVIS基准测试中实现了有竞争力的重建保真度,并显著提升了推理速度。
Details
Motivation: 现有自适应视频token化方法(如连续机制的迭代二值化搜索或训练神经回归器,以及离散方法需要全速率解码器传递)存在计算开销大的问题。本文旨在证明这些开销并非必需,通过直接利用潜在空间中的时间冗余性来降低计算成本。
Result: 在TokenBench和DAVIS标准基准测试中,该方法实现了有竞争力的重建保真度,同时推理速度相比连续自适应基线(ElasticTok-CV)提升了31倍,相比离散信息论基线(InfoTok)提升了约2倍。
Insight: 创新点在于揭示了冻结连续视频tokenizer潜在空间天然编码时间冗余,并据此提出了一种无需参数的自适应token分配机制(基于时间L1差异的固定阈值)和轻量级LIT架构,实现了高效的单次编码器传递和一次LIT前向传递,无需辅助路由网络,显著降低了计算开销。
Abstract: Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers~\cite{infotok, agarwal2025cosmos}, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a $31\times$ inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an $\approx2\times$ speedup over the discrete information-theoretic baseline (InfoTok)
[100] Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models cs.CVPDF
Liangsheng Liu, Si Chen, Jiamin Wu, Weiwei Feng, Zhixin Cheng
TL;DR: 本文提出了一种名为方向性偏差引导防御(DBD)的测试时防御框架,用于增强视觉语言模型(如CLIP)对抗对抗性攻击的鲁棒性。该框架基于一个关键发现:对抗性图像在特征空间中会沿着一个主导方向(称为防御方向)偏移,而干净图像则呈现分散模式。DBD通过估计防御方向并采用基于DB分数的双流重建策略来恢复鲁棒表示。
Details
Motivation: 视觉语言模型(如CLIP)在零样本泛化方面表现出色,但对对抗性扰动高度脆弱,这在实际应用中构成严重风险。现有的测试时防御方法无需大规模重新训练,是一种有前景的高效防御途径,但仍有改进空间。
Result: 在15个数据集上的实验表明,DBD不仅实现了最先进的对抗鲁棒性,同时保持了干净样本的准确率,甚至还揭示了一个反直觉的结果:对抗性准确率甚至可以超过干净样本的准确率。
Insight: 核心创新点在于发现了对抗性扰动在特征空间中存在一致的主导偏移方向(防御方向),这编码了关于真实决策边界的先验信息。基于此,提出的DBD框架通过估计和利用这一方向来引导特征重建,从而有效抵御攻击,其设计思路具有启发性。
Abstract: Vision-Language Models (VLMs), such as CLIP, have shown strong zero-shot generalization but remain highly vulnerable to adversarial perturbations, posing serious risks in real-world applications. Test-time defenses for VLMs have recently emerged as a promising and efficient approach to defend against adversarial attacks without requiring costly large-scale retraining. In this work, we uncover a surprising phenomenon: under diverse input transformations, adversarial images in CLIP’s feature space consistently shift along a dominant direction, in contrast to the dispersed patterns of clean images. We hypothesize that this dominant shift, termed the Defense Direction, opposes the adversarial shift, pointing features back toward their correct class centers. Building on this insight, we propose Directional Bias-guided Defense (DBD), a test-time framework that estimates the Defense Direction and employs a DB-score-based two-stream reconstruction strategy to recover robust representations. Experiments on 15 datasets demonstrate that DBD not only achieves SOTA adversarial robustness while preserving clean accuracy, but also reveals the counterintuitive result that adversarial accuracy can even surpass clean accuracy. This demonstrates that adversarial perturbations inherently encode directional priors about the true decision boundary.
[101] Symb-xMIL: Symbolic Explanations for Multiple Instance Learning in Digital Pathology cs.CV | cs.LGPDF
Yanqing Luo, Julius Hense, Niklas Prenißl, Andreas Mock, Klaus-Robert Müller
TL;DR: 该论文提出了Symb-xMIL,一种用于数字病理学中多示例学习(MIL)模型的后解释框架。它通过量化模型行为与人类可读逻辑规则(如AND、OR、NOT)之间的对齐程度,提供结构化、基于规则的推理解释,超越了传统的热图可视化方法。
Details
Motivation: 现有MIL解释方法(如热图)仅突出有影响力的区域,无法解释不同组织区域的证据如何组合以产生预测,这限制了模型的可解释性,尤其是在决策依赖于组织特征间交互时。
Result: 在合成MIL数据上,Symb-xMIL能可靠地恢复真实逻辑规则;在临床肿瘤检测任务中,最佳对齐规则揭示了异构决策模式并暴露了隐藏的模型错误;在TCGA-HNSCC头颈癌队列的HPV预测任务中,该框架能超越HPV状态细化患者生存分层,具有潜在临床相关性。
Insight: 创新点在于将MIL可解释性从视觉归因扩展到结构化、基于规则的推理,通过逻辑关系对齐分数揭示模型预测的语义模式,为模型预测提供更透明、语义更基础的解释,有助于验证和发现。
Abstract: Explanations of multiple instance learning (MIL) models are widely used for validation and discovery in digital histopathology. Existing methods primarily rely on heatmaps that highlight influential regions but do not explain how evidence from different tissue regions is combined to produce a prediction. This limits interpretability, especially when decisions depend on interactions between tissue features. We introduce Symbolic explainable MIL (Symb-xMIL), a post-hoc explanation framework that quantifies how a MIL model’s behavior aligns with human-readable decision rules, expressed as logical relationships (e.g., AND, OR, NOT) between input features. These alignment scores reveal semantic patterns underlying the model’s predictions. We evaluate Symb-xMIL on synthetic and real-world histopathology datasets. On synthetic MIL data, Symb-xMIL reliably recovers ground-truth logical rules. In a clinical tumor detection task, the best-aligned rules uncover heterogeneous decision patterns and expose hidden model errors. On an HPV-prediction task on TCGA-HNSCC, a cohort of head and neck cancer, our framework refines patient survival stratification beyond HPV status with potential clinical relevance. Overall, Symb-xMIL extends MIL explainability beyond visual attribution toward structured, rule-based reasoning, enabling more transparent and semantically grounded interpretation of model predictions.
[102] DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments cs.CV | cs.AIPDF
Tan Zhang, Quanyou Li, Lu Zhang, Jun Liu, Xiaofeng Zhu
TL;DR: 本文提出了DisasterBench,一个面向无人机灾后响应的多阶段多模态推理基准,涵盖14种灾害场景和9个关键任务,并设计了轻量级模型DisasterVL,通过领域指令微调、思维链引导的多模态对齐和强化学习优化,在2B参数量下实现了与GPT-4o相当的推理精度和更高的效率。
Details
Motivation: 现有多模态基准大多侧重于感知任务(如识别/描述),灾害类型覆盖有限,且缺乏对实际应急响应中多阶段推理(如因果归因、传播预测、决策推理)的支持,难以满足灾后现场在嘈杂低空无人机视角和有限计算资源下的复杂推理需求。
Result: 在21个主流多模态大语言模型(MLLMs)上的实验表明,仅2B参数的DisasterVL模型超越了所有评估的开源模型,并大幅缩小了与最先进闭源模型的差距,在推理准确率上与GPT-4o相当,同时具有更优的计算效率。
Insight: 创新点包括:1) 构建了首个系统支持灾前、灾中、灾后多阶段因果与决策推理的无人机灾害响应基准;2) 提出了结合领域指令微调、思维链引导的多模态对齐和强化学习策略优化的三阶段轻量化模型优化流程,在极小参数量下实现了接近顶级闭源模型的复杂推理能力。
Abstract: When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at https://github.com/TanmouTT/DisasterBench.
[103] GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention cs.CV | cs.LGPDF
Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello
TL;DR: 本文提出了GRAMformer,一种新型的多模态Transformer架构,其核心是引入了一种名为Volumetric Multimodal cross-Attention (VMA)的交叉注意力机制。VMA通过计算查询向量与多个模态特定键向量所张成的体积来定义注意力分数,从而能够原生地建模任意阶的多模态交互,超越了传统的成对交互方式。
Details
Motivation: 现有基于Transformer的多模态模型通常通过成对点积交互或将所有模态拼接为键来计算注意力,这要么导致模态数量的二次复杂度,要么无法显式建模依赖于多个表征联合配置的交互。本文旨在解决这一限制。
Result: 作者在多个多模态学习任务上评估了GRAMformer,结果表明其有效性和效率均有所提升。
Insight: 主要创新点是VMA机制,它通过几何体积计算来捕获联合多模态依赖关系,实现了对任意阶模态交互的原生建模,这为设计更灵活、表达能力更强的多模态融合架构提供了新思路。
Abstract: Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multimodal attention formulations compute their scores through collections of pairwise dot-product interactions or by concatenating all the modalities into the keys, even when multiple modalities should be jointly involved. As a consequence, current approaches either incur quadratic complexity in the number of modalities or fail to explicitly model interactions that depend on the joint configuration of multiple representations. In this work, we introduce the Volumetric Multimodal cross-Attention (VMA), a novel cross-attention mechanism in which attention scores are defined as a function of the joint geometry of a query and multiple modality-specific keys. VMA computes the volume spanned by query and key vectors across multiple modalities, capturing joint multimodal dependencies beyond pairwise similarity, enabling native modeling of any-order modality interactions. We integrate VMA into our novel multimodal transformer architecture, named GRAMformer, explicitly designed to integrate any number of modalities. We evaluate the proposed model on multimodal learning tasks, demonstrating improved effectiveness and efficiency.
[104] Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation cs.CV | cs.ROPDF
Ariel Herrera, Xueyang Kang, Atal Anil Kumar
TL;DR: 本文提出了一种用于双手机器人布料操作的感知框架,通过合成数据生成和视觉检测方法解决布料状态估计的挑战。该框架结合了基于CNN的排列不变关键点检测器和YOLOv8-OpenCV管道来提取结构褶皱的抓取点,并设计了一种双手机器人算法,利用褶皱拉伸完全折叠的衣物,并在角落出现后切换到基于关键点的熨烫操作。
Details
Motivation: 机器人操作纺织品面临挑战,因为连续变形和自遮挡阻碍了估计布料状态所需的鲁棒视觉感知,且缺乏标注的真实世界数据。
Result: 关键点模型实现了1.7615像素的平均位置误差(MPE),感知系统无需微调即可迁移到物理布料上,在高遮挡状态下优于基线方法,并在严重折叠情况下避免了误报。
Insight: 创新点包括开发基于Blender的合成数据管道以自动标注关键点,结合手动标注的渲染图像和真实数据训练褶皱检测器,以及提出一个集成关键点检测和褶皱提取的感知框架,实现了从褶皱拉伸到关键点熨烫的过渡策略,提升了在复杂变形下的操作鲁棒性。
Abstract: Robotic manipulation of textiles remains challenging because continuous deformation and self-occlusions hinder the robust visual perception required to estimate the cloth’s state. To address the lack of annotated real-world data, we developed a Blender-based synthetic pipeline exporting auto-annotated keypoints, and combined manually labeled renders with real-world data to train a wrinkle detector. We present a perception framework integrating a CNN for permutation-invariant keypoint detection and a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm uses this system to stretch fully folded garments via wrinkles, transitioning to keypoint-based ironing once corners emerge. The keypoint model achieves a Mean Position Error (MPE) of 1.7615 pixels. The perception system transfers to physical fabrics without fine-tuning, outperforming baselines that fail in high-occlusion states or yield false positives on severe folds.
[105] Towards One-to-Many Temporal Grounding cs.CV | cs.AIPDF
Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang
TL;DR: 本文提出了一个名为‘一对多时序定位’的新任务,旨在根据单个文本查询定位视频中多个不连续的片段。作者构建了首个全面的OMTG基准测试和高质量数据集,并设计了新颖的时间与字幕奖励函数来优化模型。实验表明,该方法在OMTG基准上取得了新的SOTA结果。
Details
Motivation: 解决现实场景中,单个查询往往对应视频中多个不连续片段的问题。现有方法主要针对单一片段定位,缺乏对事件数量感知的能力,在OMTG设置下表现不佳。
Result: 在作者构建的OMTG Bench基准上,模型取得了43.65%的EtF1分数,显著超越了Gemini 2.5 Pro和Seed-1.8等先进模型。
Insight: 创新点包括:1)明确定义并系统构建了一对多时序定位任务及其评估基准与数据集;2)设计了专门针对OMTG的时间奖励和字幕奖励函数,其中字幕奖励利用密集视频描述的思维链推理,显式地引导策略优化同时追求精确性和完整性。
Abstract: Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query – a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61%, respectively.
[106] RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling cs.CVPDF
Chensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang, Zheng Zhu
TL;DR: RhymeFlow是一种无需训练的视频生成加速框架,通过解耦不同帧的去噪轨迹来降低计算成本。该方法识别关键帧进行密集去噪,而非关键帧则跳过部分去噪步骤,同时引入潜在轨迹投影模块保持时间一致性。实验表明,该方法在基于DiT的视频生成模型中实现了更高的推理速度和更好的视觉质量。
Details
Motivation: 解决基于扩散变换器(DiT)的视频生成模型因3D注意力二次复杂度导致的高推理延迟和计算成本问题,现有方法局限于对每帧进行密集去噪,而相邻帧间的内容和运动相关性表明这种均匀处理存在冗余。
Result: 在现有基于DiT的视频生成模型上进行广泛实验,结果显示RhymeFlow在推理速度和视觉质量上均优于现有基线方法。
Insight: 创新点在于打破标准扩散流程的固有约束,通过异步去噪流调度,仅对关键帧进行密集去噪,非关键帧跳过步骤以减少计算,并利用潜在轨迹投影模块维护时间连贯性,这是一种无需训练的高效加速策略。
Abstract: Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbf{RhymeFlow}, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.
[107] StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset cs.CVPDF
Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li
TL;DR: 本文介绍了StoryVideoQA,一个用于深度视频理解(DVU)的大规模、多类型、自动生成的数据集,包含超过36.3万个问答对,覆盖393.2小时的电视剧和电影视频。为了解决现有VideoQA方法在复杂长视频故事理解上的不足,作者提出了增强的自动生成框架StoryMindv2来构建数据集,并设计了PlotTree代理,通过分层情节结构来提升长程视频内容的理解能力。
Details
Motivation: 现有视频问答(VideoQA)方法在事实性问答上表现良好,但在需要理解复杂故事线的深度视频理解(DVU)任务上存在困难,这主要由于长视频内容、多样化问题类型和实例级故事元素导致手动构建的DVU数据集规模和多样性受限。
Result: 在构建的StoryVideoQA数据集上,对20种最先进的VideoQA方法进行了全面评估,发现它们无法充分保持长程角色关联或构建对复杂故事线的连贯理解。
Insight: 论文的创新点包括:1) 提出了StoryMindv2,一个增强的多智能体协作框架,通过监督引导生成机制和精炼的多评审投票策略,自动生成高质量DVU数据集;2) 设计了PlotTree代理,将长程视频内容重组为分层情节结构,以支持高效的故事线推理,这为解决长视频理解中的关联保持和连贯性构建问题提供了新思路。
Abstract: Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets. These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/
[108] Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them cs.CVPDF
Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang
TL;DR: 本文发现图像到视频扩散模型在生成过程中存在物理一致性随去噪步数增加而下降的现象,并提出了一种无需训练的PhaseLock框架。该框架通过两步推理提取有效的运动先验,并在整个去噪轨迹中通过潜在增量引导来保持这些先验,从而在保持视觉保真度的同时显著提升生成视频的物理一致性。
Details
Motivation: 现有图像到视频扩散模型虽然能生成视觉效果出色的内容,但经常产生违反物理定律的运动。作者观察到两步生成的结果比同一模型50步生成的结果具有更好的物理一致性,这揭示了去噪过程中相位信息被侵蚀的问题。
Result: PhaseLock方法在多种模型上平均提升了6.2个点的物理一致性得分,同时视觉保真度基本保持不变。该方法计算开销极小(时间开销1.06倍,内存开销1.02倍),并且减少了对昂贵外部引导方法的依赖(约5倍时间节省)。
Insight: 核心创新在于通过谱分析揭示了去噪过程中相位退化是物理一致性下降的关键原因,并提出了锁定早期有效运动先验的训练免费框架。PhaseLock通过两步提取和潜在增量引导的机制,在几乎不增加计算成本的情况下有效缓解了相位退化问题,为提升扩散模型生成内容的物理合理性提供了新思路。
Abstract: Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by $\approx 18%$ from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ($1.06\times$ time, $1.02\times$ memory) and reduced reliance on expensive external guidance methods ($\sim5\times$ time).
[109] Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation cs.CVPDF
Maëlic Neau, Salim Baloch, Jakob Suchan, Zoe Falomir, Mehul Bhatt
TL;DR: 本文提出了一种与模型无关、语义引导的知识精炼框架,用于解决场景图生成(SGG)模型在标注稀疏情况下性能下降的问题。该框架从训练数据中系统挖掘基于常识的约束(如空间、功能和定性关系规律),并在推理时使用通用声明式常识推理来校正和精炼SGG预测排名。
Details
Motivation: 解决学习驱动的SGG模型在频繁关系类型上表现良好,但在标注稀疏时性能急剧下降,无法捕捉可靠视觉常识知识的问题。
Result: 在三个标准基准测试上,相对于强基线模型获得了持续的性能提升,证明了结构化视觉常识推理对纯学习型SGG是一个实用有效的补充。
Insight: 创新点在于提出无需手动编写规则、无需模型重新训练、可跨数据集和架构迁移的常识驱动知识精炼框架,通过挖掘数据中的结构化约束来增强推理的鲁棒性。
Abstract: Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.
[110] EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models cs.CV | cs.AIPDF
Qiwei Zeng, Hao Wang, Jinghao Lin, Shuchang Ye, Yuezhe Yang
TL;DR: EasyLens是一种无需训练、即插即用的医学视觉-语言模型(VLM)细微病灶表征增强器。它通过构建病理-解剖原型空间(EasyBank)来提供病灶相关原型和正常解剖参考,利用EasyTag选择病灶相关图像块,并通过EasyAmplifier进行形态学引导的残差增强,以放大全局图像嵌入中细微病灶的贡献。
Details
Motivation: 现有医学VLMs对视觉证据稀疏、对比度低且嵌入复杂解剖背景中的细微病灶敏感性不足,现有增强方法通常需要额外训练或模型特定适配,且可能过拟合特定疾病形态,限制了其在冻结VLM上的适用性。
Result: 在多个医学图像数据集和冻结的医学VLM骨干上的实验表明,EasyLens提升了细微病灶检测性能,并优于现有的编码器增强基线方法。
Insight: 创新点在于提出了一种无需训练、即插即用的通用增强框架,通过构建病理-解剖原型空间进行对比推理,并采用形态学引导的残差增强来特异性强化病灶相关局部表征,避免了盲目增强正常组织,提高了方法的通用性和可移植性。
Abstract: Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in global image representations, making them difficult for medical VLMs to recognize. Existing efforts to improve lesion sensitivity mainly rely on medical-domain vision-encoder pre-training, clinical-term-guided alignment, or trainable pathological representation enhancement. Although effective, these approaches usually require additional training or model-specific adaptation and may overfit to particular disease morphologies, limiting their applicability to frozen medical VLMs. To address these limitations, we propose EasyLens, a training-free plug-and-play subtle-lesion representation amplifier for medical VLMs. EasyLens first constructs EasyBank, a pathology-anatomy prototype space that provides lesion-related prototypes and anatomy-aware normal references for comparing suspicious patches against both pathological and normal anatomical patterns. To avoid blindly amplifying normal tissues, EasyTag selects lesion-relevant patches through counterfactual prototype reasoning. To counteract the dilution of subtle lesion cues in global image representations, EasyAmplifier strengthens the selected lesion-relevant patch representations through morphology-guided residual enhancement, thereby increasing their contribution to the global image embedding. Experiments on multiple medical image datasets and frozen medical VLM backbones show that EasyLens improves subtle-lesion detection and outperforms existing encoder-enhancement baselines.
[111] HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes cs.CV | cs.AIPDF
Wenbo Li, Xiaoliang Ju, Zipeng Qin, Rongyao Fang, Hongsheng Li
TL;DR: 本文提出了一个名为HomeWorld的统一框架,用于从平面图生成可控、密集交互的完整家居场景。该框架采用分层方法,首先生成全屋平面图,然后生成家具布局和小型可操作物体的布局,并通过迭代优化和3D资产替换提升真实感,最终为具身AI应用添加物理属性和纹理光照。
Details
Motivation: 现有室内场景生成方法依赖手工规则或专注于孤立子任务,导致生成的全屋场景缺乏全局一致性、真实性和仿真就绪性。本文旨在解决这些限制,提供一个统一的、可控的生成框架。
Result: 实验和用户研究表明,该流程生成的室内空间在布局多样性和3D设计吸引力方面均优于先前方法,在定量和定性指标上表现更佳。
Insight: 创新点在于将场景生成分解为可控的层次化阶段,并整合了基于LLM的平面图生成、基于图像生成模型的家具布局、基于VLM的迭代优化器以及3D生成模型进行资产替换,从而实现了高可控性和真实感。同时,将发布大规模平面图数据集和5K完全布置的场景数据。
Abstract: Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/
[112] A Vision-language Framework for Comparative Reasoning in Radiology cs.CV | cs.IR | cs.LG | eess.IVPDF
Tengfei Zhang, Ziheng Zhao, Lisong Dai, Xiaoman Zhang, Pengcheng Qiu
TL;DR: 该论文提出了一种用于放射学比较推理的视觉-语言框架,旨在解决医学影像AI与放射学实践脱节的问题。该框架支持参考病例检索和时序比较解读,并构建了一个大规模比较影像资源库MedReCo-DB用于训练。
Details
Motivation: 当前医学影像AI在孤立图像解读方面表现良好,但与放射学实践脱节,因为放射诊断和随访依赖于跨历史研究和类似参考病例的比较。
Result: 在内部、外部和跨中心评估中,MedReCo在12个内部检索设置中均达到最高Recall@1,外部检索平均提升6.0个百分点;在临床混淆鉴别组中持续优于最强基线。MedReCo-VLM在所有比较生成评估中表现最佳,在胸部X光片和CT上的纵向随访准确率分别提升14.5-46.5和13.0-27.9个百分点。
Insight: 创新点在于将放射学比较制定为实体感知的跨图像推理问题,并构建大规模临床数据资源库进行监督学习。该框架通过实体条件检索和生成式视觉-语言模型,实现了可控的临床类比病例检索和间隔变化的生成式解读,为医学影像AI提供了更符合临床实践的基础。
Abstract: Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision–language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.
[113] Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators cs.CVPDF
Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang
TL;DR: 本文提出了Astra框架,旨在通过结合世界模拟器来增强视觉语言模型的空间推理能力。该框架包含一个基于强化学习训练的VLM策略和一个基于Bagel的世界模拟器,后者能够根据上下文图像和自然语言相机运动生成新视角的观察。研究表明,通过主动想象获取视觉证据,可以显著提升模型在空间推理任务上的性能。
Details
Motivation: 现有视觉语言模型的空间推理能力受限于观察到的图像和文本思维链,难以推断未观察到的布局、保持跨视图一致性或在仅有有限自我中心观察时从替代视角进行推理。
Result: 在MMSI-Bench基准测试中,Astra-WM将模拟器增强的Gemini-3-Flash的得分从45.1提升至49.5;Astra-VL将Qwen3-VL骨干模型的得分从29.8提升至38.8。在MindCube基准上,Astra-VL将Qwen3-VL的得分从36.8提升至42.7。
Insight: 核心创新在于提出了一个‘用想象力思考’的智能体框架,将主动的、动作条件的视觉想象与推理过程相结合。其关键见解是,有效的世界模型增强推理需要学习何时、何地以及如何想象,这通过一个两阶段的强化学习课程来实现,以稳定工具使用探索并优化模拟器调用策略。
Abstract: While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model’s ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.
[114] PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding cs.CVPDF
Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang, Liujuan Cao
TL;DR: 本文提出了PAR3D,一个统一的、具有部件感知能力的3D多模态大语言模型框架,旨在增强3D场景理解中对细粒度部件结构的建模能力。为了支持训练和评估,作者构建了带有部件级标注的合成数据集ScenePart,并开发了部件感知的3D表示学习和分层分割查询生成方法。实验表明,该方法显著提升了部件级的视觉问答和参考分割性能,同时在物体级的视觉语言任务上也表现出色。
Details
Motivation: 现有3D多模态大语言模型主要关注物体层面,缺乏对细粒度部件结构的建模能力,这限制了模型在具身交互等需要精细场景理解任务中的应用。
Result: 大量实验表明,该方法在部件级视觉问答和参考分割任务上取得了显著提升,同时在物体级视觉语言任务上也保持了强大的性能。
Insight: 核心创新点在于引入了部件感知的3D表示学习,并提出了分层分割查询生成机制,将物体和部件的查询统一起来,从而实现了对3D场景中物体及其部件的联合理解和定位。
Abstract: Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.
cs.HC [Back]
[115] Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing cs.HC | cs.CVPDF
Yixuan Ding, Wei Huang, Ruijie Quan, Xiaojuan Qi, Yi Yang
TL;DR: 本文提出了RE-Edit基准,这是一个用于评估图像编辑系统在多维度推理能力上的新基准。它包含1000个样本,涵盖物理、环境、文化、因果和指代五个推理维度,要求编辑结果不仅视觉逼真,还需满足隐含的逻辑约束。作者评估了多个开源和商业模型,发现即使先进系统也常在这些推理任务上失败,并提出了一个轻量级的推理引导后编辑基线方法。
Details
Motivation: 现有基于扩散模型的图像编辑系统虽然视觉保真度高,但大多仅遵循表面指令,未能推理用户请求中隐含的上下文约束,导致编辑结果视觉合理但逻辑不一致。
Result: 在RE-Edit基准上对10个开源和2个商业图像编辑模型的评估表明,即使先进系统也常在多维度推理任务上失败。提出的推理引导后编辑基线方法能以模型无关的方式缓解这些失败。
Insight: 创新点在于构建了一个专注于多维度推理(物理、环境、文化、因果、指代)的图像编辑评估基准,强调了逻辑一致性与视觉逼真度同等重要。客观来看,该工作将评估重点从纯视觉质量转向了更深层的语义和逻辑推理能力,为未来模型开发提供了新的诊断方向。
Abstract: Diffusion-based image editing has achieved strong visual fidelity under natural language instructions, yet most existing systems still operate at the level of surface instruction following, without reasoning about the implicit contextual constraints embedded in real user requests. This often leads to visually plausible but logically inconsistent edits. In this work, we introduce RE-Edit, a benchmark for REasoning-aware image Editing that evaluates image editing systems across five complementary reasoning dimensions: physical, environmental, cultural, causal, and referential. RE-Edit comprises 1,000 carefully curated samples, each designed such that visual plausibility alone is insufficient and correct editing requires satisfying implicit logical constraints. To support fine-grained analysis, we establish dimension-aligned evaluation criteria and conduct a comprehensive study of ten open-source and two commercial image editing models. Our results show that even advanced systems frequently struggle with implicit multi-dimensional reasoning despite producing high-quality visuals. We further present a lightweight reasoning-guided post-edit baseline as an initial exploration, illustrating how inserting explicit reasoning can help mitigate such failures in a model-agnostic manner.
cs.CY [Back]
[116] Drishti AI-Event Guardian: An Intelligent Real-Time Crowd Monitoring and Emergency Response System for Mass Gathering Events cs.CY | cs.CV | cs.LGPDF
Ritabrata Roy Choudhury, Arkajyoti Karmakar, Rudra Pratap Mitra
TL;DR: 本文提出了Drishti AI-Event Guardian,一个用于大型集会活动的智能实时人群监控与应急响应系统。该系统利用深度学习整合CCTV和无人机等多模态数据,实现人群密度估计、时空异常检测、预测性人流建模,并集成了人脸识别、医疗应急报告、对话式AI聊天机器人及智能警卫重新分配等模块,旨在提升公共安全。
Details
Motivation: 传统监控系统缺乏智能分析能力,导致大型集会活动中威胁识别延迟、资源部署不当以及对弱势群体支持不足,因此需要开发一个集成的智能框架来增强实时监控和应急响应。
Result: 在Kumbh Mela和RCB Victory Parade两个场景的评估中,系统取得了人群密度估计MAE为3.2人/平方米、异常检测F1分数0.91、人脸识别精度0.93、中位警报延迟111毫秒等结果;预测拥堵建模的5分钟预测MAPE为8.3%,聊天机器人自主处理了89%的事件报告,警卫重新分配使响应部署延迟比手动分配减少了34%。
Insight: 创新点在于将多模态数据融合与多种深度学习模型(如YOLOv8、梯度提升回归)结合,构建了一个集监控、预测、识别、通信和资源调度于一体的主动式人群智能平台,实现了从被动监控到主动干预的转变,为不同规模活动提供了可扩展的基础。
Abstract: Mass gathering events are associated with critical safety incidents caused by insufficient crowd monitoring and inadequate emergency response coordination. Traditional surveillance systems lack intelligent analytics, resulting in delayed threat identification, poor resource deployment, and weak support for vulnerable individuals during dense public assemblies. This paper presents Drishti AI-Event Guardian, an intelligent crowd management framework using deep learning for public safety enhancement. The architecture combines multimodal data from CCTV networks and UAV platforms, processed by models on Google Vertex AI infrastructure. Core methods include real-time crowd density estimation using YOLOv8, spatiotemporal anomaly detection, and predictive crowd-flow modeling through gradient-boosted regression. Drishti also integrates four modules: (i) facial recognition for missing person identification with crowd-wide notification; (ii) medical emergency reporting with automated dispatch; (iii) a conversational AI chatbot for reports and complaints; and (iv) an intelligent guard reallocation engine that dynamically reassigns personnel in response to crowd density changes. The system is evaluated on two scenarios: the Kumbh Mela gathering and the RCB Victory Parade event, achieving crowd density estimation MAE of 3.2 persons/m2, anomaly detection F1-score of 0.91, facial recognition precision of 0.93, and median alert latency of 111 ms. Predictive congestion modeling provides five-minute forecasts with MAPE of 8.3%, enabling preemptive intervention. The chatbot resolved 89% of incident filings without human operators, while guard reallocation reduced responder deployment latency by 34% versus manual reassignment. Results demonstrate a shift from passive surveillance toward active crowd intelligence and scalable foundation for events from local gatherings to mega festivals.
cs.AI [Back]
[117] Harnessing Generalist Agents for Contextualized Time Series cs.AI | cs.CL | cs.LGPDF
Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar
TL;DR: 本文提出了TimeClaw框架,旨在解决通用AI智能体在处理时间序列数据时因主要操作于文本空间而无法充分对齐结构化时序信号的问题。该框架通过集成可执行的时序工具、经验驱动的能力演进和情景多模态记忆,为通用LLM智能体提供原生运行时支持,以支持在丰富上下文信息下的开放式时序推理。
Details
Motivation: 现实世界中的时间序列通常嵌入在丰富的上下文中,而端到端的工作流程需要超越单一预测任务的整体建模。当前通用AI智能体主要操作于文本空间,与结构化时序信号不完全对齐,因此需要专门的框架来支持其在复杂上下文下的时序推理工作流。
Result: 在涵盖能源、金融、天气、交通等多个现实领域和多样化任务的多个基准测试中,TimeClaw框架展现出性能提升。
Insight: 创新点在于提出了一个专门为时间序列设计的智能体框架,通过集成可执行工具、能力演进和情景记忆,使通用LLM智能体能够进行可追溯、可审计且可重用的上下文化时序分析,从而弥合了通用文本智能体与结构化时序数据之间的鸿沟。
Abstract: Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw.
[118] Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models cs.AI | cs.CVPDF
Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing
TL;DR: 本文提出了一种新颖的基准测试,专门用于评估视觉语言模型(VLMs)在图像内部和跨图像之间感知和推理时序信息的能力。研究构建了三个专门的数据集,并通过实验发现,尽管VLMs展现出潜力,但它们经常利用灰度与彩色滤镜等表面线索来绕过真正的时序推理。
Details
Motivation: 尽管视觉语言模型在解释复杂视觉语义方面取得了显著进展,但其时序推理能力仍未得到充分探索。本文旨在填补这一空白,深入探究时序判断的内在逻辑以及向多模态整合的扩展。
Result: 广泛的实验结果表明,视觉语言模型在不同类别上表现出性能差异,并且关键地揭示了它们经常依赖“错误的捷径”(例如图像颜色而非真实的时序特征)来做出判断。
Insight: 本文的创新点在于构建了专门针对时序推理的高质量数据集和严格的评估框架,这为诊断当前模型的局限性并指导开发更鲁棒、更具逻辑基础的多模态模型提供了工具。从客观角度看,该研究揭示了VLMs在时序任务中存在的显著捷径偏差,这是一个重要的发现,对推动模型向更深入的理解发展具有指导意义。
Abstract: Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts’’, such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in https://github.com/LuoRenqiang/ChronoVision.
[119] Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation cs.AI | cs.CVPDF
Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang
TL;DR: 本文提出了一种名为MGSD的两阶段模态间隙感知自蒸馏框架,旨在解决视觉语言模型在视觉空间规划任务中的困难。该框架通过冷启动接地阶段为视觉学生模型提供可靠的状态表示,并利用特权教师模型通过策略蒸馏传递规划能力,从而弥合感知与推理之间的模态间隙。
Details
Motivation: 动机在于视觉语言模型虽然在通用多模态理解上表现出色,但在视觉空间规划任务中仍存在困难,这归因于感知与推理之间的模态间隙:视觉规划需要模型从像素推断潜在状态结构并进行多步推理,而符号规划可直接利用显式对象和约束。
Result: 在视觉规划基准测试中,MGSD框架在4B和8B骨干网络上分别将宏观平均性能提升了19.3%和18.4%,显著缩小了与符号输入上限的差距,消融实验和诊断分析证实改进来自视觉状态恢复和最优路径推理两方面。
Insight: 创新点在于提出了一个两阶段的模态间隙感知自蒸馏框架,通过冷启动接地和特权教师蒸馏,在训练中利用符号数据监督视觉模型,而推理时仅依赖视觉输入,从而有效提升视觉状态恢复和规划推理能力。
Abstract: While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student’s own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.
[120] When AI Says It Feels cs.AI | cs.CLPDF
Shin-nosuke Ishikawa, Seiya Ikeda, Hirotsugu Ohba
TL;DR: 该论文通过名为HMX-feel的实验,利用基于量规的自奖励强化学习(GRPO)训练大语言模型(LLMs)表达情感、意图和自我意识,评估了这种人类化表达训练对模型各项能力的影响,发现其在某些方面(如抵抗谄媚诱导问题)表现增强,但在真实性问答上有所下降。
Details
Motivation: 当前LLMs在训练后的人类偏好对齐过程中通常被限制表达情感,这种自上而下的策略可能与利用人类文本训练类人智能的目标相冲突,因此研究探索如何让AI表达情感。
Result: 实验表明,经过人类化训练的模型在抵抗谄媚诱导问题和消除歧义条件下的偏见方面表现出鲁棒性,但在真实性问答能力上出现退化;研究进行了广泛评估,识别了能力增强、退化或无显著变化的方面。
Insight: 创新点在于提出基于量规的自奖励强化学习方案(GRPO)来增强LLMs的情感表达,并系统评估了这种人类化训练对模型综合性能的影响,为未来开发能表达情感的AI系统提供了可能性与注意事项。
Abstract: Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.
[121] SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents cs.AI | cs.CLPDF
Wenxuan Wang, Haoyu Sun, Fukuan Hou, Mingyang Song, Weinan Zhang
TL;DR: 论文提出了SubtleMemory基准测试,用于评估长期AI代理在细粒度关系记忆辨别方面的能力。该基准通过构建关系控制的潜在语义变体,并将其嵌入到真实用户-代理交互历史中,要求代理在后续查询中恢复分布式关系结构。
Details
Motivation: 现有长期记忆基准很少探究代理在下游任务中如何保持和利用记忆间的关系,而随着记忆库增长,记忆间的强化、分歧或冲突使得正确辅助依赖于关系而非孤立回忆,因此需要专门基准来填补这一空白。
Result: 在包含1,522个评估实例的基准上测试了六种独立记忆系统和五种Claw风格代理,发现当前系统在细粒度关系记忆辨别方面表现较弱,同时诊断协议揭示了记忆保持、检索和下游推理阶段的不同能力特征。
Insight: 创新点在于构建了关系控制的潜在语义变体来实例化互补、细微或矛盾的关系,并通过嵌入真实交互历史来评估分布式关系结构的恢复能力,为长期AI代理的记忆关系处理提供了新的评估维度。
Abstract: Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.
[122] The Self-Correction Illusion: LLMs Correct Others but Not Themselves cs.AI | cs.CLPDF
Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang
TL;DR: 这篇论文研究了大型语言模型(LLM)在自我纠正方面的‘幻觉’现象:模型能纠正来自外部来源的错误,却难以纠正自己推理轨迹中的相同错误。研究发现,这种不对称性并非源于模型的能力缺陷,而是由聊天模板中的角色标签(如
Details
Motivation: 动机是探究LLM代理在自我纠正与纠正他人时表现出的显著不对称性,其核心问题是:这种差异是否反映了模型的能力不足,还是仅仅由聊天模板中的角色标签所引发的人为假象。
Result: 在涵盖7个模型系列和3个领域的13个模型-领域单元实验中,将错误陈述的角色标签从
Insight: 论文的核心创新点在于揭示了LLM的‘自我纠正幻觉’本质上是一种由提示工程中的角色标签引入的系统性偏差,而非认知缺陷。这一发现为设计更有效的提示策略(通过简单调整角色标签即可显著提升模型自我纠正能力)提供了重要的实践见解。
Abstract: Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent’s willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim’s content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent’s own \role{
[123] Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI cs.AI | cs.CLPDF
Alexander Apartsin, Yehudit Aperstein
TL;DR: 该论文提出了一个名为CoRe-3(Co-Reasoning)的能力模型,用于评估学生如何有效地与生成式AI协作进行推理。该模型将有效使用AI的能力分解为三个可评估的技能:Framing(在调用AI前明确界定模糊任务)、Judging(评估输出中的错误和未声明的假设)以及Steering(迭代地引导模型)。论文通过理论阐述、可测试命题以及一个名为CoReasoningLab的开放平台来实例化这些技能,并展示了该模型在模拟学习者上的有效性验证。
Details
Motivation: 当前教育体系仍主要评估学生独立完成任务的能力,但现实任务往往需要与AI协作来产出高质量成果。然而,有效使用AI的能力(如任务界定、输出判断和模型引导)很少被独立评估,现有评估方法(如单一的“提示”评分)无法诊断AI使用成功或失败的具体原因。
Result: 在由不同模型生成和评分的模拟学习者上,CoRe-3模型展示出技能分离性:每个技能分数仅跟踪其自身被操纵的能力水平,而在其他技能上保持稳定;当一项能力在所有三个技能中共享时,评分会变得相关(证明了收敛和判别效度)。这些结果在两个不同提供商的评分后端上均得到验证。
Insight: 核心创新点在于明确将AI协作推理过程分解为三个独立且可评估的阶段(FJS),特别是将生成前的任务界定(Framing)与生成后的模型引导(Steering)分离,并以判断(Judging)作为两者之间的关键枢纽。这为教育评估提供了一个结构化的框架,能够诊断学生在AI协作中具体技能的强弱,超越了单一的提示工程评分。
Abstract: Generative AI makes answers easy and understanding hard, and uncritical use invites cognitive offloading. Schools still measure unaided performance, yet the real task is to produce good work with AI: framing an ill-defined task, judging the output, and steering the model toward a better result. This ability is rarely assessed in its own right; where measured, it collapses into one “prompting” score that cannot diagnose why AI use succeeds or fails. We propose CoRe-3 (Co-Reasoning), a competency model factoring productive AI use into three assessable skills we abbreviate FJS: Framing (specifying an ill-defined task before invoking AI), Judging (evaluating output for errors and unstated assumptions), and Steering (iteratively redirecting the model). Its distinguishing claim is the separation of pre-generation Framing from post-generation Steering, with Judging as the gate between. We ground the skills in theory, state five testable propositions, and instantiate them in CoReasoningLab, an open platform that presents flawed AI output and scores them independently. Over simulated learners (generated and graded by different models), the skills dissociate: each tracks its own manipulated competence while staying flat in the others, and grades become correlated when one competence is shared across all three (convergent and discriminant validity), across grader backends from two providers. Human-rater agreement and outcomes are next; we release the instrument, data, and protocol.
[124] Humans’ ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration cs.AI | cs.CLPDF
Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao
TL;DR: 本文介绍了ALMANAC数据集,这是一个用于评估和提升LLM智能体协作能力的人类协作数据集。它基于经典的Map Task构建,包含2,987个协作动作,每个动作都标注了参与者的自我推理、感知到的伙伴意图和团队目标等心理模型信息。
Details
Motivation: 当前LLM智能体主要优化任务完成,缺乏在协作过程中维持和对齐心理模型(如自身推理、伙伴意图和共享目标)的能力,而社区也缺少带有动作级心理模型标注的真实人类协作数据来指导智能体发展过程级协作能力。
Result: 研究在ALMANAC上对六个LLM进行了基准测试,评估其预测人类下一轮行为和心理模型的能力,证明了该数据集在评估模型模拟人类协作行为和推断其底层心理模型能力方面的效用。
Insight: 创新点在于构建了首个包含动作级、理论驱动的心理模型标注的真实人类协作数据集,为研究智能体的过程级协作能力(而不仅仅是任务完成)提供了关键资源和评估基准。
Abstract: Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners’ intentions, and shared goals during the collaborative process. Today’s agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants’ self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans’ next-turn behavior and mental models. Our results demonstrate ALMANAC’s utility in evaluating models’ ability to simulate human collaborative behaviors and infer their underlying mental models.
[125] Unsupervised Skill Discovery for Agentic Data Analysis cs.AI | cs.CL | cs.LG | cs.MAPDF
Zhisong Qiu, Kangqi Song, Shengwei Tang, Shuofei Qiao, Lei Liang
TL;DR: 本文提出了DataCOPE,一个无监督的、验证器引导的技能发现框架,用于提升数据分析智能体。该框架通过从无标签的探索轨迹中提取验证信号来表征轨迹间的相对质量或一致性,并迭代协调数据分析智能体、无监督验证器和技能管理器进行对比性技能蒸馏。
Details
Motivation: 在数据分析领域,通过推理时技能注入来增强智能体是一种轻量级方法,但如何从无标签的探索中自动发现有效的、可复用的技能仍具挑战性,因为可靠的监督成本高昂且成功标准因分析格式而异。
Result: 在报告式分析(Deep Data Research)和推理式分析(DABStep)两个基准上,DataCOPE均持续优于基线方法。在四种模型设置下平均,DataCOPE将报告式和推理式任务的平均得分分别提升了9.71%和32.30%。
Insight: 创新点在于提出了一个通用的无监督验证器引导框架,通过从探索轨迹中自动派生验证信号(如任务特定检查表或答案一致性)来指导技能发现,而无需人工标注。其核心是将技能发现分解为轨迹生成、信号提取和对比蒸馏的迭代协调过程,并能适应不同分析格式(报告式与推理式)。
Abstract: Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.
cs.RO [Back]
[126] Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue cs.RO | cs.CVPDF
Xingyao Lin, Xinghao Zhu, Tianyi Lu, Sicheng Xie, Hui Zhang
TL;DR: 本文提出了Ask-to-Clarify框架,旨在解决具身智能体在现实场景中因指令模糊而执行失败的问题。该框架通过多轮对话主动询问以澄清模糊指令,然后端到端生成底层动作,包含一个用于协作的视觉语言模型和一个用于生成动作的扩散模型,并通过一个连接模块进行协调。
Details
Motivation: 当前大多数基于视觉语言模型的具身智能体以单向模式运行,即接收指令后直接执行,这无法处理现实世界中常见的模糊指令问题,阻碍了智能体成为能与人类协作的伙伴。
Result: 该框架在8个真实世界任务上进行了评估,其性能超越了现有的最先进的视觉语言智能体。
Insight: 核心创新点在于将主动澄清(通过多轮对话)与动作生成(通过扩散模型)集成到一个端到端框架中,并采用两阶段知识隔离训练策略,以在保持交互能力的同时微调动作生成。连接模块和信号检测器的设计也增强了系统的可靠性和适应性。
Abstract: The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.
[127] Uncertainty-Aware Adaptive Sensor Fusion for Autonomous Navigation cs.RO | cs.CVPDF
Simegnew Yihunie Alaba, Yuichi Motai
TL;DR: 本文提出了一种结合无迹卡尔曼滤波器的混合深度学习框架,用于提升视觉惯性里程计在自主导航中的位姿估计精度。该方法利用Vision Transformer处理IMU数据的时间依赖性,采用多尺度卷积神经网络学习视觉数据的光流运动线索,并通过自适应传感器融合模块动态加权多模态特征。
Details
Motivation: 旨在解决自主导航中视觉惯性里程计在复杂多变环境下因传感器噪声、数据不完整或不可靠导致的位姿估计不准确和鲁棒性不足的问题。
Result: 在KITTI数据集上的综合评估表明,该方法在绝对轨迹误差和相对位姿误差指标上显著优于基线方法,达到了优越性能;模型在NVIDIA A100 GPU上能以155 FPS高效运行。
Insight: 创新点包括引入自适应传感器融合模块以不确定性动态加权多模态特征,以及提出一种显式纳入预测不确定性的损失函数来增强学习过程的鲁棒性;从客观角度看,其将深度学习的特征提取能力与传统滤波器的状态估计优势相结合,为资源受限系统提供了轻量高效的解决方案。
Abstract: This work introduces a hybrid deep learning approach integrated with an Unscented Kalman Filter (UKF) to enhance pose estimation accuracy in Visual-Inertial Odometry (VIO) for autonomous navigation. The proposed model employs a Vision Transformer (ViT) network to effectively capture temporal dependencies from inertial measurement unit (IMU) data and utilizes a Multiscale Convolutional Neural Network (MCNN) to learn optical flow-based motion cues from visual data. An adaptive sensor fusion module dynamically weights IMU and visual features by leveraging estimated uncertainty, thus improving robustness in diverse and challenging environmental conditions. Additionally, a novel uncertainty-aware loss function is proposed to explicitly incorporate prediction uncertainty into the learning process, enabling robust and accurate navigation under noisy, incomplete, or unreliable sensor inputs. Comprehensive evaluations of the KITTI dataset demonstrate that the proposed method significantly outperforms baseline approaches, achieving superior performance in terms of Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). The lightweight and computationally efficient model processes data at 155 FPS on an NVIDIA A100 GPU, making it highly suitable for deployment in resource-constrained autonomous systems.
[128] LadderMan: Learning Humanoid Perceptive Ladder Climbing cs.RO | cs.AI | cs.CV | cs.LGPDF
Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel, Rocky Duan
TL;DR: 本文提出了LadderMan系统,使仿人机器人能够稳健地攀爬多种梯子并在受限条件下执行操作任务。该系统采用两阶段学习流程,通过混合运动跟踪从单一参考动作学习多个攀爬专家,再通过混合模仿与强化学习将其蒸馏为统一的基于深度的视觉运动攀爬策略。此外,利用视觉基础模型弥合深度感知的仿真与现实差距,并基于攀爬策略训练独立操作策略,实现梯上稳定操作。
Details
Motivation: 仿人机器人在以人为中心的环境中具有巨大应用潜力,但梯子攀爬因立足点与抓握点稀疏、全身协调复杂以及对感知与控制误差敏感而成为极具挑战的任务。本文旨在解决仿人机器人鲁棒攀爬多样梯子及在受限条件下执行操作的问题。
Result: 实验表明,LadderMan在多种几何形状的梯子上实现了鲁棒攀爬,并以零样本方式成功迁移到真实硬件,在挑战性梯子约束下支持多种操作任务。该系统在真实世界部署中展示了卓越的泛化能力和稳定性。
Insight: 创新点包括:1) 可扩展的两阶段学习流程,结合混合运动跟踪与专家蒸馏,高效学习复杂攀爬行为;2) 利用视觉基础模型桥接深度感知的仿真与现实差距,提升系统可迁移性;3) 采用双智能体框架训练独立操作策略,实现梯上稳定操作,扩展了机器人在受限环境下的功能。
Abstract: Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .
[129] AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding cs.RO | cs.CV | cs.MMPDF
Qize Yu, Jiadi You, Yuran Wang, Jiaqi Liang, Bowen Ping
TL;DR: 本文提出了AffordanceVLA,一个通过引入结构化可供性预测作为任务导向的中间表示,来弥合视觉-语言模型与具身控制策略之间语义空间不匹配的视觉-语言-动作统一框架。该框架通过三个渐进式组件(Which2Act、Where2Act、How2Act)建模操作先验,并采用混合专家Transformer架构和三阶段训练策略进行训练。
Details
Motivation: 现有视觉-语言-动作模型在利用预训练视觉-语言模型的世界知识时,其语义空间与具身控制策略之间存在结构不匹配,阻碍了精确的感知-动作映射学习。
Result: 在仿真和真实世界的大量实验中,AffordanceVLA在多样化的操作场景中均取得了强劲的性能表现。
Insight: 创新点在于将结构化可供性预测作为中间表示来桥接感知与动作,并设计了三个互补的组件来渐进式建模操作先验。从客观角度看,其提出的自动化数据增强流程为解决机器人数据集中密集可供性标签稀缺的问题提供了实用方案。
Abstract: Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception–action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception–action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.
[130] ActiveMimic: Egocentric Video Pretraining with Active Perception cs.RO | cs.CVPDF
Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu
TL;DR: ActiveMimic是一个用于机器人预训练的框架,它从单目第一人称(egocentric)RGB视频中恢复同步的相机和手腕轨迹,将相机运动建模为视角动作,从而联合学习主动感知和操作技能。该方法旨在弥合人类视频与机器人数据在预训练性能上的差距。
Details
Motivation: 现有基于第一人称人类视频的预训练模型性能持续落后于基于机器人数据的模型,作者认为这一差距源于缺失了视频中人类操作时主动调整视角(主动感知)的信号,而标准流程通常将此相机运动视为噪声。
Result: 在需要不同主动感知需求的任务上进行真实世界实验,结果表明ActiveMimic持续超越基于人类视频预训练的基线模型,并与基于机器人数据预训练的最先进(SOTA)模型性能相当。
Insight: 核心创新点在于将第一人称视频中的相机运动明确建模为一种视角动作,并从中恢复轨迹进行联合学习。分析表明,主动感知能力源于人类视频预训练本身而非机器人特定的微调,这证实了主动感知是利用第一人称人类视频进行机器人预训练的关键。
Abstract: Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.
[131] RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel Pruning cs.RO | cs.CV | cs.DCPDF
Ziyang Yu, Xiang Li, Qiong Chang, Jun Miyazaki
TL;DR: 本文提出了RadiusFPS,一种基于球形体素剪枝的FPS加速框架,旨在解决经典最远点采样算法在现代3D传感器高数据速率下的计算瓶颈问题。该方法通过球形体素索引和几何边界剪枝冗余计算,并提供了内存高效的GPU实现(RadiusFPS-G),在保持采样质量的同时显著提升了速度。
Details
Motivation: 最远点采样是机器人感知中关键的降采样算子,但其高时间复杂度与3D传感器每秒百万点的数据速率不匹配,成为实时机器人系统的主要延迟瓶颈。
Result: 在室内(S3DIS, ScanNet)和室外LiDAR(SemanticKITTI)基准测试中,RadiusFPS-G相比基于GPU的FPS实现了最高2.5倍加速,与QuickFPS性能相当或更优,同时GPU内存使用量减半,分割精度相当。结合FastPoint采样器后,该流程在所有评估配置中实现了最快的端到端推理速度。
Insight: 创新点在于使用球形体素进行空间索引和几何剪枝,保留了标准FPS的更新规则,同时提出了融合体素选择、剪枝和距离更新的内存合并GPU内核,减少了全局内存访问开销,为延迟和内存受限的机器人视觉提供了实用的高质量采样方案。
Abstract: Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is the most well-known downsampling operator, as its uniform coverage preserves the geometric structure on which downstream perception relies. However, the large time complexity of classical FPS scales poorly with the million-point-per-second rates of modern 3D sensors, making it a dominant latency bottleneck that conflicts with the real-time and limited onboard compute budgets of robotic systems. Therefore, we propose RadiusFPS, an FPS acceleration framework based on spherical voxel pruning that preserves the standard FPS update rule under the same initialization and tie-breaking policy. By indexing the point cloud with spherical voxels, RadiusFPS derives a conservative geometric bound that prunes redundant distance computations in each iteration, complemented by a coordinate-wise point-skip test that removes residual updates. We further introduce RadiusFPS-G, a warp-level GPU implementation that fuses voxel selection, pruning, and distance update into memory-coalesced kernels, eliminating costly global-memory round-trips. On indoor (S3DIS, ScanNet) and outdoor LiDAR (SemanticKITTI) benchmarks, RadiusFPS-G attains up to 2.5x speedup over GPU-based FPS and matches or exceeds QuickFPS among the evaluated methods while using roughly half its GPU memory, with comparable segmentation accuracy. When coupled with the learning-based FastPoint sampler, the resulting pipeline achieves the fastest End-to-End inference among all evaluated configurations. These properties make high-quality FPS-style sampling practical for latency- and memory-constrained robotic vision.
cs.IR [Back]
[132] OneReason Technical Report cs.IR | cs.AI | cs.CLPDF
OneRec Team, Biao Yang, Boyang Ding, Chenglong Chu, Dunju Zang
TL;DR: 本文提出了OneReason模型,旨在解决生成式推荐模型(如OneRec系列)难以激活推理能力的问题。通过借鉴LLM中‘先思考后回答’的范式,并引入感知与认知两个关键因素,设计了包含预训练、监督微调和强化学习的训练框架,以增强推荐系统的推理性能。
Details
Motivation: 生成式推荐模型虽然受益于规模扩展优势,但由于仅使用项目令牌难以构建有意义的思维链(CoT),其推理能力难以激活。受LLM中推理范式的启发,作者探索了推荐中的推理能力,但发现‘思考模式’并未优于非思考模式,因此提出需同时增强感知与认知能力。
Result: 论文未在摘要中提供具体的定量实验结果或基准测试对比,但提出了完整的训练框架(OneReason),包括预训练、SFT和RL三个阶段,旨在提升推荐任务的推理性能。
Insight: 创新点在于将推荐系统的推理能力分解为感知(将项目令牌与语言语义对齐)和认知(重组用户行为序列为连贯兴趣点)两个因素,并设计了相应的三阶段训练框架。这为生成式推荐模型激活复杂推理提供了新思路。
Abstract: Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer’’ paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user’s behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.
cs.MM [Back]
[133] GS-NFS: Bandwidth-adaptive Streaming of Dynamic Gaussian Splats and Point Clouds cs.MM | cs.CV | cs.GR | cs.NIPDF
Rajrup Ghosh, Haodong Wang, Haoran Hong, Eduardo Pavez, Amartya Chaudhuri
TL;DR: GS-NFS是一种用于动态3D高斯泼溅(3DGS)和点云流式传输的带宽自适应系统。它通过开发新颖的GPU并行化算法,显著加速了动态3DGS帧的压缩和解压缩过程,使其能够实现全帧率的编码和解码。
Details
Motivation: 动态3DGS作为3D视频流技术,能高保真表示复杂3D场景,但其帧数据量远大于2D视频帧。现有压缩方法速度慢,部分原因是其压缩技术难以高效加速,因此需要一种快速、可加速的压缩方案。
Result: GS-NFS在编码和解码单帧时比现有最佳方法(SOTA)快1-2个数量级,同时提供了有竞争力的压缩性能和渲染质量。
Insight: 创新点在于开发了基于GPU的并行化算法,用于高效编码高斯的位置和属性,从而实现了实时性能。从客观角度看,将计算密集型压缩任务有效卸载到GPU是提升动态3DGS流式传输实用性的关键突破。
Abstract: Dynamic 3D Gaussian Splatting (3DGS) holds great promise as a 3D video streaming technology since it can represent complex 3D scenes with high fidelity. In this approach, every frame in a 3D video represents the environment as a collection of Gaussians with position and other attributes such as scale, rotation, opacity, and color. Frames capture fine details, permit views from any arbitrary perspective, but are an order of magnitude, or more, larger than 2D video frames. A line of recent work has explored how to compress dynamic 3DGS frames, but these approaches are often slow, in part because their compression techniques are not amenable to efficient acceleration. GS-NFS accelerates dynamic 3DGS compression and decompression on a GPU, to the point where it can encode and decode at full frame rate. It achieves this by developing novel GPU-based parallelizations of existing algorithms for encoding both positions and attributes of Gaussians. As a result, it is 1-2 orders of magnitude faster than the state-of-the-art in encoding and decoding a frame, while offering competitive compression performance and rendering quality.
[134] UNIVID: Unified Vision-Language Model for Video Moderation cs.MM | cs.AI | cs.CLPDF
Kejuan Yang, Yizhuo Zhang, Mingyuan Du, Yue Zhang, Dixin Zheng
TL;DR: 本文提出了UNIVID,一个用于视频内容审核的统一视觉-语言模型。它通过生成可解释的、基于策略的文本描述作为中间表示,替代了传统的碎片化黑盒分类器,旨在解决大规模视频审核中细粒度多模态推理和输出可解释性的双重挑战。
Details
Motivation: 全球规模的视频内容审核面临两大挑战:一是需要细粒度的多模态推理能力,二是要求输出具有可解释性以支持下游的执法行动。传统的审核系统通常依赖碎片化的黑盒分类器,这些模型难以维护且缺乏透明度。
Result: 通过将UNIVID作为核心描述生成器集成到一个新颖的端到端视频审核系统中,该系统相对减少了42.7%的违规内容漏报率和37.0%的误杀率。同时,用一个单一的UNIVID主干网络替代了超过1000个特定策略模型,回收了大量计算资源并降低了工程维护开销。
Insight: 主要创新点在于提出了一个生成可解释的、策略感知的文本描述作为中间表示的VLM模型,这支持了人类可验证的决策和多任务复用。其技术关键在于开发了一个结合专家精炼标签与合成数据的专用训练数据配方,以将模型与安全准则对齐,从而避免了现有开源和商业VLM常见的安全护栏拒绝和细粒度策略对齐不足的问题。这展示了首个成功支持工业级审核和跨职能业务的高效描述生成VLM。
Abstract: Global-scale video moderation faces a dual challenge: the need for fine-grained multi-modal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency. In this paper, we present UNIVID, a UNIfied VIsion-language model for video moDeration. Unlike standard classification models, UNIVID generates policy-aware captions that serve as an interpretable intermediate representation, enabling human-verifiable decisions and multi-task reusability. While existing open-source and commercial VLMs often suffer from safety-guardrail refusals and lack fine-grained policy alignment, we develop a specialized training data recipe that combines expert human-refined labels with synthetic data to align the model with our safety guidelines. By integrating UNIVID as the core captioner, we design a novel end-to-end video moderation system that reduces violation leakage by 42.7% and overkill rate by 37.0% relatively. Meanwhile, by replacing over 1,000 policy-specific models with a single UNIVID backbone, we recycled extensive computation resources while reducing engineering maintenance overhead. To our knowledge, this is one of the first reports of a high-efficiency captioning VLM successfully supporting industrial-scale moderation and cross-functional business.
cs.LG [Back]
[135] Flash-WAM: Modality-Aware Distillation for World Action Models cs.LG | cs.CV | cs.ROPDF
Arman Akbari, Ci Zhang, Arash Akbari, Lin Zhao, Yixiao Chen
TL;DR: 本文提出了Flash-WAM,一种用于世界动作模型(WAMs)的模态感知步数蒸馏框架。WAMs通过迭代扩散联合生成未来视频和机器人动作,但推理速度慢。现有单模态蒸馏方法无法处理视频和动作流之间噪声分布的不对称性。Flash-WAM为不同模态选择匹配其噪声机制的一致性函数参数化方法,实现了单步推理,在保持任务成功率的同时获得了23倍的加速。
Details
Motivation: 世界动作模型(WAMs)性能强大但推理需要数十个去噪步骤,无法满足实时控制需求。现有的步数蒸馏方法在处理联合视频-动作生成时失效,因为两种模态使用不同的噪声调度,导致噪声分布不对称,单模态蒸馏方法无法适应。
Result: 在RoboTwin 2.0基准上,Flash-WAM将每块推理延迟从8.1秒降低到348毫秒(NVIDIA L40S),实现23倍加速。在仿真基准上保持了任务成功率(RoboTwin 2.0: 85.5%, LIBERO: 95.7%),并在真实世界Unitree G1人形机器人上恢复了60%的平均性能,而朴素的一致性蒸馏在相同步数预算下性能降至24%。
Insight: 核心创新在于模态感知的蒸馏框架,通过为不同模态(视频流和动作流)匹配其噪声机制的一致性函数参数化(视频流用方差保持参数化,动作流用线性梯度缩放参数化),解决了联合生成中噪声分布不对称的关键问题,实现了高效的单步推理。这为多模态扩散模型的加速蒸馏提供了新思路。
Abstract: World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream’s low-noise regime, paired with a variance-preserving parametrization for the video stream’s high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5%$ RoboTwin 2.0, $95.7%$ LIBERO) and substantially recovers real-world performance ($60%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24%$ at the same step budget.
[136] What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning cs.LG | cs.AI | cs.CV | cs.ROPDF
Rohan Siva, Neel P. Bhatt, Yunhao Yang, Seoyoung Lee, Nishant Gadde
TL;DR: 本文提出A4D方法,通过将视觉观察映射到以功能(affordance)而非外观组织的共享潜在空间,实现基于物体功能而非外观的机器人规划。该方法引入功能发现机制以处理未见场景,并在多个规划任务中验证了其有效性。
Details
Motivation: 现有机器人规划系统依赖基于外观的推理,难以捕捉与任务相关的物体功能(如“可移动性”),导致泛化到新的人-物交互时存在困难。
Result: 在多个涉及不同及未见功能的规划任务中,A4D对现有功能的推理准确率达到94%,超过现有最佳方法15个百分点以上;对新功能的推理准确率从70%提升至90%以上,且所需训练数据少于原数据的10%;推理速度提升100倍。
Insight: 核心创新在于构建了以功能(affordance)为中心的结构化潜在空间,取代了传统以外观为中心的组织方式,并设计了基于潜在空间邻近度量化不确定性和触发功能发现的机制,从而提升了机器人对物体功能的理解和规划泛化能力。
Abstract: Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a “cart” based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is “movable”), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., “movable”). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.
[137] Less is MoE: Trimming Experts in Domain-Specialist Language Models cs.LG | cs.CLPDF
Haoze He, Xinkai Zou, Xuan Jiang, Xingyuan Ding, Ao Qu
TL;DR: 本文提出了一种名为Fisher-MoE的新方法,用于压缩混合专家(MoE)模型。该方法通过分析FFN层中的稀疏中间维度,利用Fisher重要性来识别并修剪对任务关键但冗余的维度,从而在保持模型核心能力的同时,显著减少参数存储并提升推理速度。
Details
Motivation: MoE模型虽然通过条件计算实现了强大性能,但其庞大的参数量给实际部署带来了挑战。现有的MoE压缩方法在通用基准测试(超越常识推理)上会灾难性地失败,因此需要一种更精细的压缩策略。
Result: 在Qwen1.5-MoE模型上,仅移除1.35M个路由FFN中间维度中的12个,就会导致GSM8K准确率崩溃,但能基本保留事实知识性能。在50%的MoE压缩率下,Fisher-MoE在保持模型能力的同时,将权重内存减少了约45%,并将推理吞吐量提高了21%。
Insight: 创新点在于将压缩的粒度从专家级别细化到FFN的中间维度,并证明Fisher重要性是识别这些关键维度的有效指标。这揭示了在MoE模型中,关键能力集中分布在稀疏的中间维度上,而非整个专家,为高效的模型压缩提供了新视角。
Abstract: Mixture-of-Experts (MoE) models achieve strong performance through conditional computation, but their large parameter footprint poses deployment challenges. Prior MoE compression approaches catastrophically fail when evaluated on general-purpose benchmarks beyond commonsense reasoning. We trace this failure to the granularity of compression: important capabilities are distributed across experts but concentrated in FFN sparse intermediate dimensions. To identify these dimensions, we use Fisher importance which outperforms activation-, router-score-, and magnitude-based alternatives, and identifies tiny sets of task-critical dimensions: in Qwen1.5-MoE, removing as few as 12 of 1.35M routed-FFN intermediate dimensions collapses GSM8K accuracy while largely preserving factual-knowledge performance. Building on this, we propose Fisher-MoE, which operates within FFN to remove intermediate dimensions ranked by Fisher importance. At the same 50% MoE compression ratio, Fisher-MoE preserves model capability, while reducing weight memory by ~45% and improving inference throughput by 21%. These findings suggest intermediate dimension granularity is an effective unit for both compression and ranking where capability concentrates in MoE models.
[138] In-Context Multiple Instance Learning cs.LG | cs.AI | cs.CVPDF
Alexander Möllers, Marvin Sextro, Julius Hense, Gabriel Dernbach, Klaus-Robert Müller
TL;DR: 本文提出了一种基于上下文学习的多示例学习(MIL)方法,通过使用Perceiver架构在合成数据上进行预训练,使模型能够仅用少量标注的包(bag)即可解决新任务,且推理时无需梯度更新。
Details
Motivation: 现有MIL算法在低标签的真实应用场景中表现不佳,灵活模型容易过拟合,而刚性模型则难以适应具体任务,因此需要一种能从小样本中快速适应新任务的解决方案。
Result: 在12个MIL基准测试中,该方法平均性能最佳,优于需要任务特定训练的有监督基线模型,达到了SOTA水平。
Insight: 创新点在于将上下文学习引入MIL,通过合成数据预训练实现小样本快速适应;客观分析认为,混合不同合成数据生成器以捕获互补归纳偏置的策略是提升泛化能力的关键。
Abstract: Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.
[139] Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation cs.LG | cs.CLPDF
Maxime Griot, Paul Steven Scotti, Tanishq Mathew Abraham
TL;DR: 该论文提出了一种名为Compress-Distill的方法,用于在知识蒸馏前对推理模型(如Qwen3.5-397B-A17B和gpt-oss-120B)生成的长链思维轨迹进行后处理压缩。研究表明,压缩后的轨迹可将训练token减少至原始的12-30%,训练速度提升2.0-7.6倍,并显著缩短推理输出长度,但原始轨迹在下游任务中仍保持最高准确率。
Details
Motivation: 解决推理模型产生的冗长思维轨迹在知识蒸馏过程中成本高昂、导致学生模型输出冗余的问题,旨在通过压缩轨迹来提高蒸馏效率。
Result: 在包含48次主实验和7次截断消融实验的评估中,压缩轨迹大幅提升了训练和推理效率,但原始轨迹在所有规模和学生模型上均保持最高的下游任务准确率;压缩轨迹使学生模型在保持最高96%原始准确率的同时,实现了高达18倍的每token效率提升。
Insight: 创新点在于系统研究了推理轨迹压缩对知识蒸馏效率与准确率权衡的影响,并证明模型驱动的压缩优于简单的截断,尤其是在小型学生模型上;这为高效知识蒸馏提供了一种可控的精度-效率折衷方案,而非免费的提升。
Abstract: Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.
[140] MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following cs.LG | cs.AI | cs.CLPDF
Mohammad Mahdi Salmani-Zarchi, Zahra Rahimi, Heshaam Faili, Mohammad Javad Dousti
TL;DR: 该论文提出了MDP-GRPO方法,旨在解决标准组相对策略优化(GRPO)在处理离散、低分散度奖励的多约束指令跟随任务时的不稳定性问题。该方法通过多温度采样、双锚点优势函数、前景理论奖励塑形和非对称KL正则化等技术来稳定学习过程。
Details
Motivation: 标准GRPO在离散、低分散度的奖励场景下(组内奖励分布经常同质化)会变得不稳定,这阻碍了其在多约束指令跟随任务中的应用。论文旨在解决GRPO中z-score组归一化在此场景下的三个病理问题:低方差放大、均值中心盲区和零方差崩溃。
Result: 在FollowBench、IFEval和一个精心策划的多约束数据集上评估,MDP-GRPO优于标准GRPO,在Llama-3.2-3B模型上将严格约束满足率提高了高达5.0%。该方法还能在小组规模下实现稳定收敛,同时在MMLU和ARC基准测试上保持通用能力。
Insight: 论文的创新点在于识别并形式化了GRPO在特定奖励分布下的三个不稳定病理,并提出了一个综合性的稳定化方案(MDP-GRPO)。其中,结合前景理论进行奖励塑形以约束更新和惩罚违规,是一个将行为经济学理论融入强化学习的值得借鉴的思路。双锚点优势函数的设计也巧妙地解决了均值中心盲区问题。
Abstract: Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky’s theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.
[141] On Advantage Estimates for Max@K Policy Gradients cs.LG | cs.AI | cs.CLPDF
Shota Takashiro, Soichiro Nishimori, Paavo Parmas, Yongmin Kim, Kohsei Matsutani
TL;DR: 本文研究了强化学习中针对max@K推理时目标(如pass@K)的策略梯度估计问题,指出现有方法使用不同的优势信号、基线和归一化,导致关系不明确。通过基线设计和优势中心化分析,作者从领域内一种主流方法的优势估计器出发,证明其具有策略梯度无偏性但优势非中心化,进而提出了一种保持无偏性且使批量优势精确中心化的Leave-Two-Out基线方法MaxPO。该方法具有高效的二次时间实现,并可自然地集成到基于群的LLM后训练强化学习中。
Details
Motivation: 动机在于直接优化推理时目标(如pass@K和max@K)的现有策略梯度估计器使用不同的信号、基线和归一化,使得它们之间的关系不清晰,这阻碍了方法的理解和改进。
Result: 经验上验证了所提出的L2O基线减少了梯度方差,并优于非中心化的替代方法。
Insight: 创新点在于通过基线设计和优势中心化分析,为max@K目标推导了规范的有限批量优势,提供了对现有优势估计器的统一视角,并提出了保持无偏性且实现批量优势精确中心化的Leave-Two-Out基线方法MaxPO,该方法计算高效且易于集成到LLM后训练中。
Abstract: Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.
[142] Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance cs.LG | cs.AI | cs.CLPDF
Gizem Yüce, Giorgos Nikolaou, Nicolas Flammarion
TL;DR: 本文提出了一种名为交替令牌加权遗忘(ATWU)的轻量级框架,用于改进大型语言模型(LLM)的遗忘学习。该方法通过联合优化模型参数和令牌权重,利用保留目标冲突来无监督地学习每个令牌对于遗忘任务的重要性,从而更精确地移除目标知识。在TOFU和RWKU基准测试中,ATWU实现了最先进的遗忘-保留权衡,且学习到的分数与真实遗忘语义跨度高度一致。
Details
Motivation: 动机在于解决自回归语言模型遗忘学习中,现有方法忽视或无法准确估计遗忘样本中不同令牌异质重要性的问题。作者认为,一个令牌对于遗忘的特定性,可以通过最小化其遗忘损失是否与保留最优性相冲突来刻画。
Result: 在TOFU和RWKU基准测试上,ATWU实现了最先进的(SOTA)遗忘-保留权衡,性能优于样本级方法、基于概率的令牌加权启发式方法以及基于辅助模型的方法。学习到的分数与真实遗忘特定跨度(ground truth forget-specific spans)显著更好地对齐。
Insight: 创新点在于提出保留冲突(retain conflict)作为识别语言模型应遗忘内容的有效标准,并据此设计了一个无需外部令牌级监督、直接从模型隐藏状态学习令牌遗忘特定性的轻量级联合优化框架(ATWU)。这为无监督地学习细粒度遗忘信号提供了一种新思路。
Abstract: Machine unlearning aims to remove targeted knowledge from a trained model while preserving its general capabilities. For autoregressive language models, not all tokens in a forget sample are equally relevant to forgetting. Existing approaches either ignore this heterogeneity or rely on auxiliary models, heuristics, or external annotations to estimate each token’s relevance for forgetting. We instead characterize it through the interaction with the retain objective: a token is forget-specific to the extent that minimizing the forget loss on that token does not conflict with retain optimality. We formalize this perspective as a joint optimization problem over the model parameters and the token weights and show that, under a natural separation condition, the resulting objective recovers the oracle forget-specific token support. Motivated by this formulation, we introduce Alternating Token-Weighted Unlearning (ATWU), a lightweight framework that jointly learns token forget-specificity and model parameters during unlearning using a simple linear scorer over the hidden states, without external token level supervision. Across TOFU and RWKU, ATWU achieves state of the art forget-retain trade-offs, outperforming sample-level methods, probability-based token weighting heuristics, and auxiliary-model-based approaches. Moreover, the learned scores align substantially better with ground truth forget-specific spans, indicating that ATWU identifies semantically meaningful token level forgetting signals. Overall, our results suggest that retain conflict provides an effective criterion for identifying what language models should forget, enabling unsupervised learning of token level forget-specificity directly from model representations with minimal computational overhead.
cs.GR [Back]
[143] The Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show cs.GR | cs.AI | cs.CV | cs.LGPDF
Parsa Esmati, Somjit Nath, Katja Hofmann, Derek Nowrouzezahrai, Samira Ebrahimi Kahou
TL;DR: 本文研究了视频扩散模型是否内部编码了物理结构,而不仅仅是复制训练中看到的运动模式。通过近似反演确定性采样过程,获取模型在去噪过程中的中间状态和注意力图,发现物理合理性可以从扩散变换器的状态中线性解码,平均准确率达到81.27%,优于专门的表示学习基线模型。
Details
Motivation: 探究现代视频扩散模型是否真正理解物理结构,而非仅模仿训练数据中的运动模式,以评估其作为世界模拟器的潜力。
Result: 在IntPhys和InfLevel基准上,物理合理性从扩散变换器状态中线性解码的平均准确率约为81.27%,超过了V-JEPA和VideoMAE等专门表示学习基线。
Insight: 物理有意义的表示可以作为生成式去噪的副产品自然出现,即使模型未经过自监督预测目标训练;这揭示了扩散模型内部可能隐含地学习到物理知识,为模型解释和应用提供了新视角。
Abstract: Modern video diffusion models generate increasingly realistic and temporally coherent videos, motivating their use as candidate world simulators. Yet it remains unclear whether these models internally encode physical structure, or merely reproduce motion patterns seen during training. We study this question by probing video diffusion models along latent trajectories corresponding to real videos with known physical plausibility. To obtain such trajectories, we approximately invert the deterministic sampling process by integrating the learned velocity field backward from a clean video latent to noise, giving access to the model’s intermediate states and attention maps. Using these recovered trajectories, we show that physical plausibility is linearly decodable from diffusion transformer states across IntPhys and InfLevel, reaching around 81.27% average accuracy and outperforming dedicated representation-learning baselines such as V-JEPA and VideoMAE. Surprisingly, this signal is absent from the VAE latent input and emerges inside the denoising transformer itself, despite the model not being trained with a self-supervised predictive objective. These findings suggest that physically meaningful representations can arise as a byproduct of generative denoising.