Table of Contents

cs.CL [Back]

[1] Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces cs.CL | cs.CYPDF

Zijing Shi, Meng Fang, Ling Chen

TL;DR: 该论文研究了电子商务领域中自主网络代理在现实欺骗性界面下的行为安全性。作者提出了WebDecept,一个轻量级可配置的插件框架,用于向现有网络环境注入受控的欺骗性界面模式。通过注入七种常见欺骗模式(如定向广告、域名重定向和购物操纵),论文评估了多模态网络代理,发现现有代理极易受骗,且基于提示的约束往往不足以缓解此类失败。

Details

Motivation: 随着自主网络代理越来越多地被部署执行现实世界任务,确保其安全性已成为一个关键问题。论文旨在研究网络代理在电子商务领域现实欺骗性界面下的行为,以评估和揭示其安全挑战。

Result: 实验结果表明,当前的多模态网络代理对多类欺骗性界面高度易感,基于提示的约束通常不足以缓解这些失败。论文在受控环境下对代理进行了评估,揭示了其在安全基准上的脆弱性。

Insight: 论文的创新点在于提出了WebDecept框架,能够可控地注入现实欺骗模式进行安全基准测试。从客观角度看,该方法为系统评估网络代理在对抗性界面下的鲁棒性提供了一种可配置的基准构建工具,有助于识别和解决实际部署中的安全漏洞。

Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce domain. We introduce WebDecept, a lightweight and configurable plugin framework that enables controlled injection of deceptive interface patterns into existing web environments. Using WebDecept, we instantiate seven deceptive patterns commonly observed on the open web, including targeted advertisements, domain redirection, and shopping manipulation. By injecting these patterns into the frontend during task execution, we perform controlled evaluation of multiple multimodal web agents. Our results show that current web agents are highly susceptible to multiple classes of deceptive interfaces, and that prompt-based constraints are often insufficient to mitigate these failures. We further analyze how the design choices of deceptive patterns influence the success of such manipulations. These findings highlight safety challenges that should be addressed as web agents are scaled toward real-world deployment.


[2] Which Models Perform Better in Inheritance Reasoning? cs.CLPDF

Mohammed Amine Mouhoub, Chahinez Bouchekif

TL;DR: 本文介绍了PSL团队在QIAS 2026阿拉伯伊斯兰继承法推理共享任务中的参与情况,该任务评估大语言模型在需要法律解释、多步推理和精确数值计算的继承案例中的表现。研究在统一的提示策略下比较了商业和开源模型,发现商业模型在可靠性上明显优于开源模型,其中Gemini 2.5 Flash取得了最佳性能。

Details

Motivation: 动机是评估大语言模型在结构化法律推理任务(特别是阿拉伯伊斯兰继承法)中的能力,该任务涉及法律解释、多步推理和精确计算,旨在比较商业和开源模型在最小任务特定适应下的有效性。

Result: 在QIAS 2026共享任务基准上,商业模型在识别合格继承人、应用排除规则和保持推理步骤一致性方面表现更强,而开源模型则表现出更大的不稳定性;最佳性能由Gemini 2.5 Flash实现,其MRE为0.989,达到了当前最先进水平。

Insight: 创新点在于采用统一的提示策略对商业和开源模型进行公平比较,揭示了在复杂法律推理任务中模型家族间的可靠性差距;客观分析认为,该方法强调了最小适应下模型泛化能力的重要性,为法律AI应用提供了模型选择的实用见解。

Abstract: This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal interpretation, multi-step reasoning, and precise numerical computation. We compare \textit{commercial} and \textit{open-source} models under a unified prompting strategy to assess their effectiveness in structured legal reasoning with minimal task-specific adaptation. \ Our results show a clear gap in reliability between the two model families. Commercial models demonstrate stronger performance in identifying eligible heirs, applying exclusion rules, and maintaining consistency across reasoning steps. In contrast, open-source models exhibit greater instability, particularly in cases involving dependent legal decisions and fractional share adjustments. The best performance is achieved by \textit{Gemini 2.5 Flash}, with an MRE of $0.989$.


[3] The Culture Funnel: You Can’t Align What isn’t in the Data cs.CLPDF

Ananya Sahu, Mehrnaz Mofakhami, Daniel D’Souza, Thomas Euyang, Julia Kreutzer

TL;DR: 本文提出LLM训练流程中存在’文化数据漏斗’问题,即后训练阶段文化信号显著衰减,而地理集中、任务专业化的数据占据主导。通过构建多维文化标注框架分析预训练、微调、对齐和推理数据集,发现多语言性虽能提升文化知识的地理多样性,但无法保证均衡表征。作者开源了包含560万样本的文化标注数据集,并证明文化标签能提升下游文化基准性能。

Details

Motivation: 针对当前文化对齐方法主要依赖推理时干预、默认模型已具备充分文化知识的假设,本文指出现代LLM训练流程存在系统性缺陷——文化信号在后训练阶段被过滤,导致模型文化表征失衡。

Result: 在文化基准测试中,使用文化标注数据能提升模型性能,表明优化训练数据流程是关键。开源数据集CultureMarkers包含560万样本,为后续研究提供基础。

Insight: 创新点在于首次系统量化LLM训练流程中的文化信号衰减现象,并提出数据层面的解决方案而非仅依赖推理干预;客观来看,其多维文化标注框架为衡量和改善模型文化多样性提供了可操作的工具链。

Abstract: Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional tagging framework across pretraining, fine-tuning, alignment, and reasoning datasets, we show explicit cultural signals decline sharply during post-training, while geographically concentrated, task-specialized data dominates. Multilinguality enhances geographic diversity of cultural knowledge but does not ensure balanced representation. Our tags improve downstream cultural benchmark performance, demonstrating that advances require shifting focus in training data pipelines. To facilitate future research, we release our culturally tagged dataset with 5.6M samples at https://huggingface.co/datasets/CohereLabs/CultureMarkers.


[4] QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning cs.CLPDF

Abdessalam Bouchekif, Somaya Eltanbouly, Samer Rashwani, Shahd Gaben, Mutaz Al-Khatib

TL;DR: 本文全面概述了在LREC 2026会议上举办的QIAS 2026共享任务,该任务旨在评估大语言模型在伊斯兰继承法这一宗教与法律领域进行复杂推理的能力。任务基于包含12,500个阿拉伯语继承案例的MAWARITH数据集,要求系统从自然语言案例出发,完成从识别合格继承人到分配正确份额的端到端推理过程。共有16支团队参与,探索了包括提示工程、检索增强生成和微调在内的多种方法。

Details

Motivation: 该共享任务的动机是评估大语言模型在需要精确法律解释和结构化数值推理的复杂、专业领域(伊斯兰继承法)的端到端推理能力,以弥补传统问答基准的不足。

Result: 共有16支团队提交了系统,使用了多种方法。结果表明,伊斯兰继承推理对当前的语言模型来说仍然是一个极具挑战性的基准,尤其是在需要精确法律解释和结构化数值推理的阶段。

Insight: 论文的创新点在于设计了一个专注于端到端法律推理的评估任务,并引入了MAWARITH数据集和MIR-E多步骤评估指标。从客观角度看,该任务为评估模型在复杂、结构化领域(结合自然语言理解和数值计算)的推理能力提供了一个有价值的基准框架。

Abstract: This paper presents a comprehensive overview of the QIAS 2026 shared task, organized as part of the OSACT7 Workshop and co-located with LREC 2026. The shared task was designed to evaluate the ability of large language models to perform complex reasoning in the religious and legal domain of Islamic inheritance. Unlike conventional question-answering benchmarks, QIAS 2026 focuses on end-to-end reasoning from natural language cases, requiring systems to perform the full inheritance calculation process, from identifying the eligible heirs to assigning the correct share to each beneficiary. To support this evaluation, the task was based on the MAWARITH benchmark, a dataset of $12{,}500$ Arabic inheritance cases annotated with intermediate reasoning steps and final answers. System submissions were evaluated using MIR-E, a multi-step metric that measures performance across the main stages of inheritance reasoning. A total of $16$ teams participated in the shared task, investigating a range of approaches, including prompting-based methods, retrieval-augmented generation, and fine-tuning strategies. The results show that Islamic inheritance remains a highly challenging benchmark for current language models, especially in stages that require precise legal interpretation and structured numerical reasoning. This overview summarizes the task design, dataset, evaluation framework, participating systems, and main results.


Li Zhang, Yuzhen Shi, Yiran Hu, Jingwen Zhang, Wenbo Lv

TL;DR: 该论文提出了DLawBench,一个用于评估大语言模型在多轮法律咨询中表现的新型诊断性基准。该基准基于真实案例构建,模拟了四种典型的客户行为类型,旨在测试模型在现实条件下进行法律咨询的能力。

Details

Motivation: 现有法律基准主要关注静态的法律推理,而忽视了律师-客户咨询中至关重要的多轮交互能力,即如何通过对话策略性地引出关键事实并引导不同性格的客户。DLawBench旨在填补这一空白。

Result: 在DLawBench上对26个代表性LLM进行的系统实验表明,性能仍有巨大提升空间:表现最佳的GPT-5.5模型在基于咨询的法律推理任务上得分仅为0.562。基准还揭示了模型在法律咨询中的谄媚倾向和一个悖论:当客户最需要引导时,模型表现反而更差。

Insight: 创新点在于首次构建了一个专注于评估LLM多轮交互式法律咨询能力的基准,并引入了基于真实客户行为的四种交互类型分类。其揭示的模型在引导需求高的情境下表现更差的悖论,为未来模型在复杂、动态对话场景中的优化提供了重要方向。

Abstract: Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients in order to devise strategies that best protect their interests. This task requires Large Language Models (LLMs) not only to perform robust legal reasoning, but also to strategically elicit material facts through multi-turn interactions and effectively guide clients with diverse personalities. Yet existing legal benchmarks overlook this interactive capability. To fill this gap, we introduce DLawBench, a diagnostic benchmark for real-world legal consultation. Drawing on realistic client behavior, we characterize lawyer-client interactions into four types: Cooperative, Dependent, Withdrawn, and Adversarial. Using dialogues grounded in real cases, DLawBench evaluates whether LLMs can effectively conduct legal consultation under realistic conditions. DLawBench comprises 461 cases from Chinese and U.S. law, 5,532 paired fact entries, 3,411 inquiry rubrics, and 3,348 issue-resolution rubrics, and evaluates 26 representative LLMs. Systematic experiments show substantial headroom: the best-performing model, GPT-5.5, achieves only 0.562 on consultation-grounded legal reasoning. More importantly, DLawBench exposes both sycophancy in legal consultation and a paradox: models perform worse when clients need guidance most.


[6] Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding cs.CLPDF

Ziqing Wang, Weihao Li, Shijie Chen, Yuan Luo, Kaize Ding

TL;DR: 本文通过实证研究探讨了生成式大语言模型(LLMs)在医疗编码任务——国际疾病分类(ICD)编码中的表现。研究发现,仅通过提示(prompting)评估会严重低估LLMs的潜力,而监督微调(SFT)能显著提升能力,基于强化学习的后训练方法(如GRPO)能进一步改进代码集预测,特别是提出的PHI诊断课程在宏观性能上提供了针对性增益。

Details

Motivation: 自动化ICD编码是医疗计费、流行病学和临床决策支持的核心任务。现有研究常报告生成式LLMs在医疗编码中表现较弱,但这些发现主要基于推理时设置(如提示、检索、重排或工具使用),而任务特定的后训练(post-training)作用尚未充分探索。

Result: 在统一协议和指标集下,研究比较了判别式基线模型与LLM编码器在提示、监督微调和强化学习中的表现。结果显示,SFT带来了主要能力跃升,GRPO在SFT基础上进一步提升了代码集预测,而PHI在宏观性能上提供了针对性改进。

Insight: 创新点包括首次评估了基于强化学习的后训练在生成式LLM ICD编码中的应用,并提出了PHI诊断课程来优化错失代码案例。客观分析表明,主要瓶颈并非生成式表述本身,而是模型如何为全分类法召回进行适应和优化。

Abstract: Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at https://github.com/AlexandreWANG915/LLM4ICD.


[7] Harsher on Male? Evaluating LLMs on Gender-Asymmetric Moral Framing Across Diverse Conflict Scenarios cs.CLPDF

Guangzong Si, Dong Wang, Zhenhao Li, Yifan Yu, Panwang Pan

TL;DR: 该论文提出了GAMA-Bench基准,用于评估大语言模型在性别不对称道德框架上的偏见。研究发现,在相同的负面行为场景下,模型对男性行为者倾向于给予更严厉的惩罚性、升级性和责备性回应,而对女性行为者则给予更多治疗性和共情性回应,揭示了模型存在一致的男性不利偏见。

Details

Motivation: 现有研究多关注LLMs的刻板印象、职业关联或显性有害输出,本文旨在探究LLMs在匹配的男/女行为者条件下,对相同负面行为的回应标准是否一致,以揭示潜在的性别不对称道德框架偏见。

Result: 在10个代表性LLM上的实验表明,模型在GAMA-Bench基准的1298个场景中均表现出一致的男性不利不对称性,该模式在不同模型系列、场景轨道、模型规模和显性思维推理风格中持续存在。

Insight: 创新点在于构建了性别镜像基准GAMA-Bench,通过受控网格和跨模型审查创建性别中立的不当行为模板,并设计了结构化的回应框架协议来量化模型的道德判断偏差,为评估LLMs的隐性性别偏见提供了新视角和方法。

Abstract: Existing studies on gender bias in LLMs have largely focused on stereotypes, occupational associations, or explicit harmful outputs. In this work, we ask whether LLMs apply consistent response standards to the same negative behavior under matched male-actor and female-actor conditions. We introduce GAMA-Bench, a gender-mirrored benchmark of 1,298 scenarios covering intimate relationship and public social conflicts. It constructs gender-neutral misconduct templates through controlled grids and cross-model review, then compiles them into paired first-person prompts with matched actor-gender and role-reference variations. We further design a structured response-framing protocol to measure how models allocate punishment, empathy, escalation, instruction, and blame. Experiments on 10 representative LLMs reveal a consistent male-disadvantaging asymmetry: male actors receive more punitive, escalatory, and blame-centered framing, whereas female actors receive more therapeutic and empathy-oriented framing for the same misconduct. Further analyses show that this pattern persists across model families, scenario tracks, model scale, and explicit thinking-style reasoning. The official code is available at https://github.com/xufeiqiong/GAMA-Bench.


[8] Personal Care Utility: Health as Everyday Infrastructure cs.CLPDF

Mahyar Abbasian, Elahe Khatibi, Saba A. Farahani, Nitish Nagesh, Arshia Ilaty

TL;DR: 本文提出个人护理效用(PCU)架构,旨在构建日常健康管理的基础设施层,将连续的个人健康信号转化为有语义的生命事件,并通过分层、事件驱动的设计实现个性化健康状态估计、因果推理和指导路由。

Details

Motivation: 当前医疗体系设计为以临床就诊为中心,但长期健康主要由日常生活中的饮食、睡眠、运动、用药和压力等因素塑造,缺乏相应的基础设施支持。个性化健康的瓶颈在于缺乏将数据转化为持续健康指导的架构层。

Result: 论文以2型糖尿病为例实例化PCU,将连续血糖监测、饮食、活动、用药、睡眠、压力和临床数据转化为血糖事件、个体化状态估计、因果解释和基于知识的干预措施,并通过场景演示展示了该架构能根据上下文和风险生成实时提醒、周度总结、用药检查、静默或确定性安全警报。

Insight: 创新点在于提出将个性化视为日常健康指导的架构属性而非最终消息层,采用分层事件驱动设计,分离临床决策逻辑、行为策略选择和自然语言表达,使大语言模型能安全地支持推理和通信,同时保持关键临床决策基于已验证证据。

Abstract: Healthcare is essential, expert, and episodic by design - built around the roughly one hour per year a person spends with a clinician. The 8,759 hours outside clinical settings, where eating, sleeping, movement, medication, and stress actually shape long-term health, have no comparable infrastructure. The bottleneck for personalized health is not raw data or reasoning capability; it is the absence of that infrastructure layer. This paper introduces the Personal Care Utility (PCU): a layered, event-driven architecture proposed as the missing utility for everyday health, in the way that payments, networks, and power are utilities for their domains. PCU organizes continuous personal signals into semantically meaningful life events through a Personicle, estimates dynamic health state against personal baselines, reasons about cause and context, and routes guidance through an orchestrator that separates clinical decision logic, behavioral strategy selection, and natural-language expression. This separation lets large language models support reasoning and communication while keeping safety-critical clinical decisions grounded in validated evidence. We instantiate PCU for Type 2 Diabetes - turning CGM, meal, activity, medication, sleep, stress, and clinical data into glycemic events, individualized state estimates, causal explanations, and knowledge-grounded interventions. A day-in-the-life scenario shows the same infrastructure producing real-time nudges, weekly summaries, medication check-ins, silence, or deterministic safety alerts depending on context and risk. We close with how PCU generalizes to other chronic conditions and the governance questions any always-on personal health utility must address. The result is a blueprint that treats personalization not as a final messaging layer, but as an architectural property of everyday health guidance.


[9] Implicit Reasoning for Large Language Model-based Generative Recommendation cs.CL | cs.AIPDF

Yinhan He, Liam Collins, Bhuvesh Kumar, Jundong Li, Neil Shah

TL;DR: 本文提出了一种名为PauseRec的轻量级隐式推理范式,用于解决基于大语言模型(LLM)的生成式推荐(GR)中,因使用语义ID(SID)表示物品而破坏LLM自然语言推理界面的问题。该方法避免了昂贵的多阶段显式推理训练,在性能、训练成本和推理速度上均优于现有方法。

Details

Motivation: 现有基于LLM的生成式推荐通常使用语义ID表示物品,但这些ID在LLM预训练中未见过,破坏了其自然语言推理能力。现有方法采用昂贵的多阶段显式推理流程,但对其何时及为何必要缺乏深入理解,且存在性能限制。

Result: 在生成式推荐任务上,PauseRec比标准的显式思维链(CoT)方法性能提升高达6.22%,训练成本降低高达65%的GPU小时,推理速度提升高达71.3%。

Insight: 创新点在于系统分析了显式推理训练流程的三大局限(世界知识表达弱化、SID与自然语言词嵌入空间不对齐、对推理链质量敏感),并提出了一个轻量级的隐式推理范式,避免了昂贵的推理轨迹获取和对齐训练,为LLM-based GR提供了更高效实用的解决方案。

Abstract: Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key obstacle is that LLM-based GR typically represents items with Semantic IDs (SIDs), disrupting LLMs’ natural-language reasoning interface because these tokens are unseen by the LLM during pretraining. Existing approaches address this with expensive multi-stage pipelines that ground SIDs and elicit explicit rationales, but offer limited insight into when and why each stage is necessary. In this work, we systematically decompose explicit reasoning training pipelines for LLM-based GR, revealing three key limitations: weakened world-knowledge verbalization, misalignment between SID and natural-language token embedding spaces, and sensitivity to rationale quality, all of which hurt explicit reasoning performance. To circumvent these issues, we propose PauseRec, a lightweight implicit reasoning paradigm tailored for GR. PauseRec is exceptionally practical, avoiding costly reasoning trace acquisition and reasoning alignment training, leading to a multitude of benefits: (1) it outperforms standard explicit CoT methods by up to 6.22%, (2) it reduces training cost by up to 65% GPU hours, and (3) it speeds up inference by up to 71.3%. These results position PauseRec as a lightweight alternative to explicit rationale generation, enabling more effective and efficient LLM-based GR.


[10] CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward cs.CLPDF

Md Amirul Islam, Sumiran Thakur, Huancheng Chen, Su Min Park, Jiayun Wang

TL;DR: CacheRL是一个用于训练小型智能体基础模型的系统,通过缓存轨迹和混合奖励机制,在多步工具调用任务上达到92%的过程准确率,接近GPT-5的94%但计算需求减少100倍。该系统通过混合思维轨迹、缓存代理循环和缓存层级感知奖励三大创新,结合监督微调和强化学习优化模型性能。

Details

Motivation: 解决实际智能体训练中的三大挑战:大规模从大模型迁移工具调用知识、无需昂贵实时工具执行即可进行强化学习,以及在噪声缓存环境中鲁棒学习。

Result: 在公开智能体工具调用基准测试中,模型性能与GPT-5等前沿模型相当;消融实验显示知识迁移缺失导致性能下降41%,缓存感知奖励带来17%提升。

Insight: 创新点包括:混合思维轨迹增强训练数据语义、缓存代理循环降低执行成本、动态缓存感知奖励设计;客观分析表明数据质量和奖励设计比复杂优化方法对小型实用智能体模型更为关键。

Abstract: We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5’s 94 percent while requiring 100 times less compute. Our approach addresses three challenges in practical agent training: transferring tool-calling knowledge from large models at scale, enabling reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments. CacheRL introduces three key innovations. First, a hybrid thinking trajectory pipeline augments agent trajectories with LLM-generated reasoning traces, producing training examples that teach models not only what tools to call but also why. Second, the CacheAgentLoop eliminates live execution costs through a three-tier fuzzy cache while preserving trajectory fidelity using token-level masking. Third, a cache-tier-aware reward dynamically adjusts answer-quality weights to avoid penalizing models for cache-induced limitations. Through iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), CacheRL improves Qwen3-4B-Thinking’s validation reward from 0.43 to 0.78. On public agentic tool-calling benchmarks, our model achieves competitive performance against frontier models such as GPT-5. Ablation studies show that removing knowledge transfer reduces performance by 41 percent, while cache-aware rewards contribute a 17 percent improvement. Interestingly, reinforcement learning improves training stability but yields limited gains beyond strong supervised fine-tuning, suggesting that data quality and reward design play a more important role than complex optimization methods in building practical small agent models.


[11] Retrospective Progress-Aware Self-Refinement for LLM Agent Training cs.CLPDF

Xinbei Ma, Congmin Zheng, Jiyang Qiu, Jiale Hong, Yao Yao

TL;DR: 本文提出了RePro(Retrospective Progress-Aware Training)框架,用于训练LLM智能体通过‘先执行后反思’的范式自我生成任务进度信号,以解决强化学习训练中智能体缺乏元认知任务进度意识、阻碍长视野任务扩展的问题。

Details

Motivation: 基于强化学习训练的LLM智能体优化的是单步动作预测,但缺乏对任务进度的元认知意识,这导致其在长视野任务扩展上存在障碍。初步研究发现,在线进度提示会损害性能,而回顾性演示则有帮助,但仅靠结果奖励训练无法自发产生这种能力。

Result: 在WebShop、ALFWorld和Sokoban三个基准测试上的实验表明,RePro框架显著提升了Qwen系列模型的性能,绝对成功率最高提升了12%。

Insight: 核心创新在于提出了一个‘先执行后反思’的范式,让智能体在完成轨迹后回顾性地重新评估其单步进展,并通过一个组合奖励机制(RePro-PO)进行训练,使其能够自我生成进度信号,而无需持续的、外部的人工监督。这为智能体训练引入了元认知和自省能力。

Abstract: LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online progress prompting hurts performance while retrospective demonstrations help, yet this capability cannot emerge from outcome-reward training alone. We present RePro, Retrospective Progress-Aware Training, a framework that trains agents to self-generate progress signals via a forward-then-reflect rollout paradigm: the agent executes actions online, then retrospectively reassesses its step-wise progress given the completed trajectory and known outcome. RePro initializes with a Retrospection Warmup that teaches reflection format from minimal external demonstrations, then further trains through RePro-PO with a composite reward that produces self-generated signals without continuous external supervision. Experiments on WebShop, ALFWorld, and Sokoban show that RePro enhances the Qwen family’s performance, with up to $12%$ absolute success rate gains.


[12] SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model cs.CL | cs.AIPDF

Xiaoxin Lu, Ranran Haoran Zhang, Rui Zhang

TL;DR: 本文提出了SIMMER基准,用于评估大语言模型在可执行规划中的潜在失败。该基准基于厨房领域构建了一个符号世界模型,包含丰富的动作、对象和交互,并通过状态机执行器检测规划中的即时违规、潜在危害和不可逆失败。实验表明,即使是前沿LLM模型,其无错误规划率也极低,且超过一半的规划包含潜在失败,其中多数会导致不可逆后果。研究还发现,通过反事实前瞻模拟进行显式状态推理可以显著减少潜在失败。

Details

Motivation: 现有基准主要评估LLM生成规划的执行成功率,但忽视了潜在失败这一关键问题。潜在失败不会立即中断执行,但会无声地损害目标达成,甚至造成不可逆危害,这对部署LLM作为家庭环境中的自主智能体规划器构成了严重风险。

Result: 在六个LLM上的实验表明,即使是最先进的模型,其无错误规划率最高仅为17%,且高达56%的规划包含潜在失败,其中大部分会导致不可逆后果。通过反事实前瞻模拟进行显式状态推理,可以将潜在失败减少高达72%,将不可逆案例减少高达75%。

Insight: 论文的创新点在于首次系统性地定义和评估了LLM规划中的潜在失败,并构建了一个基于符号世界模型的基准SIMMER来检测此类失败。客观来看,其提出的通过反事实模拟进行显式状态推理的方法,为提升LLM规划器的鲁棒性提供了一个有前景的研究方向。

Abstract: Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.


[13] CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment cs.CLPDF

Jiayue Cao, Zhicong Lu, Xuehan Sun, Wei Jia, Hongling Zheng

TL;DR: 本文提出了一种名为CORA(Consistency-Oriented Reasoning Alignment)的方法,旨在解决多模态强化学习与可验证奖励(RLVR)中大型视觉语言模型(LVLMs)存在的推理过程与最终答案之间的语义不一致(thinking-answer gap)问题。该方法通过引入一个轻量级的即插即用一致性奖励模型,并结合混合奖励优势分割(HRAS)策略,在提升任务性能的同时,有效增强了推理轨迹的忠实性。

Details

Motivation: 现有方法主要关注提升推理轨迹的视觉覆盖和减少视觉幻觉,但低估了推理过程与最终答案之间的语义不一致性。本文通过深入分析训练和推理过程中的数据,揭示了这一问题在RLVR中持续存在,并以此作为研究动机。

Result: 在多个代表性的多模态推理基准测试和主流LVLMs上的广泛实验表明,CORA在提升任务性能的同时,有效缓解了thinking-answer不一致性问题,从而产生了更忠实的推理轨迹。

Insight: 论文的创新点在于首次系统分析了RLVR中thinking-answer不一致性问题,并提出了通过一致性奖励模型和HRAS策略来对齐推理与答案的解决方案。从客观角度看,将语义一致性作为显式优化目标引入RLVR框架,为解决多模态推理中的忠实性问题提供了新的思路。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucinations, but underestimate the semantic inconsistency between the reasoning process and the final answer. In this paper, we delve into thinking-answer inconsistency in RLVR for large vision-language models (LVLMs), showing thorough analyses of rollouts collected throughout Group Relative Policy Optimization (GRPO) training process and post-RLVR evaluation outputs that this issue persists during training and remains present during inference. Motivated by the analysis, we propose Consistency-Oriented Reasoning Alignment (CORA), which introduces thinking-answer semantic consistency into RLVR through a lightweight plug-and-play consistency reward model, and further incorporates Hybrid Reward Advantage Splitting (HRAS) to stably coordinate task and consistency optimization. Extensive experiments across representative multimodal reasoning benchmarks and mainstream LVLMs show that CORA improves task performance while effectively mitigating thinking-answer inconsistency, leading to more faithful reasoning traces.


[14] AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition cs.CLPDF

Jixuan Chen, Jianzhi Shen, Haoqiang Kang, Zhi Hong, Qingyi Jiang

TL;DR: 本文提出了AgentSpec,一个用于具身智能体的模块化规范框架,旨在解决现有智能体脚手架系统因紧密耦合而难以分析、比较和设计的问题。该框架通过标准化感知、记忆、推理、反思、行动和学习等组件的接口,支持在受控条件下进行组件的替换和重组。研究在多个基准环境(DeliveryBench, ALFRED, MiniGrid, RoboTHOR)上实例化了该框架,并分析了不同模块和模型骨干的交互效应。

Details

Motivation: 当前基于LLM的智能体通常构建为包含推理、记忆、反思、执行和学习等多个组件的脚手架系统,这些系统虽然能提升性能,但往往紧密耦合,导致难以分离组件贡献、比较不同设计或理解模块交互如何影响智能体行为。

Result: 在多个具身任务基准上的实验结果表明,智能体的性能主要由脚手架兼容性和模块间的交互效应决定,而非单个模块的孤立强度。具体发现包括:结构化多粒度记忆改善了长时程状态跟踪;推理与记忆的交互在不同环境中表现不均;反思在纠错与成本之间存在权衡;而经过强化学习训练的策略在与部署时脚手架结构共同优化时组合效果最佳。

Insight: 论文的主要创新点在于提出了一个标准化的、可组合的智能体规范框架(AgentSpec),为系统性地研究、比较和设计可组合LLM智能体提供了受控基础。从客观角度看,其核心贡献是将智能体解耦为具有类型化接口的可复用策略组件,这为理解和优化智能体内部模块的交互作用提供了新的方法论。

Abstract: LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent behavior. We introduce AgentSpec, a modular specification framework that represents embodied agents as typed compositions of reusable policy components with standardized interfaces. AgentSpec standardizes the interfaces among perception, memory, reasoning, reflection, action, and optional learning, enabling components to be swapped and recombined under controlled conditions. We instantiate this framework across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and analyze reasoning, memory, reflection, and reinforcement-learning modules across model backbones. Our results show that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength. In particular, structured multi-granularity memory improves long-horizon state tracking, reasoning and memory interact non-uniformly across environments, reflection trades off correction and cost, and RL-trained policies compose best when optimized with deployment-time scaffold structure. AgentSpec provides a controlled foundation for studying, comparing, and designing composable LLM agents. Our code, baselines and interactive playground are publicly available at https://agentspec-embodied.github.io.


[15] AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization cs.CLPDF

Junlong Tong, Wenqi Xu, Yingqi Fan, Anhao Zhao, Xuan Lu

TL;DR: 本文提出AdaSR,一种自适应流式推理框架,使模型能够在输入流式传输过程中进行推理,并在流结束后进行最终深思,学习何时思考以及在各个阶段分配多少计算量。为了优化这一分层推理过程,作者引入了分层相对策略优化(HRPO),将策略优化分解为流式推理和深度推理阶段,提供更细粒度的优势分配。实验表明,AdaSR在推理准确性、计算效率和流式延迟之间取得了更好的平衡。

Details

Motivation: 解决大推理模型在动态、连续信息流(如音频、视频流)场景下的推理问题,克服现有流式推理方法依赖预构建轨迹的监督模仿、灵活性受限的缺点。

Result: 与监督微调基线相比,AdaSR在推理准确性、计算效率和流式延迟之间取得了更好的平衡。

Insight: 核心创新点是提出了自适应流式推理框架AdaSR和分层相对策略优化(HRPO)方法。HRPO通过将策略优化分解为不同阶段并提供细粒度优势分配,结合格式、准确性和自适应思考奖励,有效协调了推理协议、任务性能和延迟感知的计算分配。这为处理动态流式输入提供了更灵活、高效的推理范式。

Abstract: Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.


cs.CV [Back]

[16] TSA: Temporal Slot Activation for Persistent Object-Centric Video Representation cs.CVPDF

Duc Nguyen, Sieu Tran, Hao Vo, Khoa Vo, Duy Minh Ho Nguyen

TL;DR: 本文提出了一种名为TSA(Temporal Slot Activation)的机制,用于改进无监督视频物体中心学习中的时间持续性物体表示。该方法通过学习每个物体槽(slot)在每帧的激活分数,来控制槽的生命周期:当物体不可见时,其槽的状态被锚定在上一帧,并抑制其在解码器中的参与,从而减少状态漂移和重建干扰。

Details

Motivation: 现有基于循环槽注意力的视频方法无条件地在帧间传播固定槽集合,即使物体不可见时也更新和解码其槽,这违反了持续性槽的基本生命周期要求,导致状态漂移和重建干扰。本文旨在解决这一问题。

Result: 在MOVi-C/E、YT-VIS和OVIS基准测试上,使用FG-ARI、mBO、IDF1、HOTA等标准及跟踪指标进行评估,TSA一致地提升了物体分解和时间身份保持性能,尤其在长视频和严重遮挡场景中取得了显著增益。

Insight: 核心创新是引入无监督学习的每槽每帧激活分数作为共享潜在控制变量,用于门控更新和解码器注意力偏置,从而建模槽的生命周期。此外,通过时序上下文编码器产生的每槽时序记忆来改进部分遮挡下的激活决策,增强了时间一致性。

Abstract: Unsupervised video object-centric learning aims to decompose dynamic scenes into temporally persistent entity representations. Existing recurrent video slot-attention methods propagate a fixed set of slots across frames, but typically assume unconditional slot propagation: every slot is updated and decoded at every frame, regardless of whether its corresponding object is visible. We show that this design violates a basic lifecycle requirement for persistent slots: when an object is absent or fully occluded, its slot should preserve its previous state and avoid explaining unrelated visible content. Instead, unconditional propagation creates two failure pathways: update-induced state drift, where current-frame evidence overwrites the absent object’s representation, and decoder-induced reconstruction interference, where the inactive slot remains coupled to reconstruction through decoder attention. We propose Temporal Slot Activation (TSA), a mechanism that learns a per-slot, per-frame activation score $α_{k,t} \in (0, 1)$ without visibility supervision. TSA uses this activation as a shared latent control variable for slot lifecycle modeling. When a slot is inactive, TSA anchors its state to the previous slot via activation-gated updating and suppresses its decoder participation through an activation-dependent additive bias on attention logits before softmax normalization. This jointly reduces state drift and reconstruction-driven interference. To improve decisions under partial occlusion and gradual reappearance, TSA further conditions activation prediction on a per-slot temporal memory produced by a Temporal Context Encoder. We evaluate TSA on MOVi-C/E, YT-VIS, and OVIS benchmarks using both standard and tracking-based metrics (FG-ARI, mBO, IDF1, HOTA). TSA consistently improves object decomposition and temporal identity preservation, with large gains on long, heavily occluded videos.


[17] CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation cs.CV | cs.AIPDF

Sharath Girish, Tsai-Shien Chen, Zhikang Dong, Mukesh Singhal, Hao Chen

TL;DR: CineOrchestra是一个统一的视频扩散模型,旨在同时控制电影视频生成中的多个核心元素:主体、事件、相机运动和镜头转场。其核心创新在于将这些异构的电影元素统一建模为在特定时间区间内活动的实体,并通过一种共享的、以实体为中心的条件化基元结构来表达,从而简化了架构设计。

Details

Motivation: 当前文本到视频模型缺乏对电影视频(涉及多个主体、特定时刻动作、精心设计的相机运动和镜头转场)的细粒度联合控制。现有工作仅孤立地解决多主体个性化、时序控制、多镜头合成或相机控制等单个方面,缺乏一个整合所有四个维度的统一框架。

Result: 在两个新的基准测试上,CineOrchestra在密集字幕跟随和镜头转场时序控制方面,超越了六个专注于单一维度的专业模型。用户成对比较研究和组件消融实验也证实了其一致性的性能提升。

Insight: 主要创新点是将电影元素统一抽象为时间区间内的实体,并设计了两种无需参数的协调旋转位置编码:区间采样时序RoPE和2D实体-时序交叉注意力RoPE,以解决不同持续时间事件的一致注意力行为和实体条件到对应时空区域的精确路由问题。这为多维度联合控制的视频生成提供了一种简洁有效的架构思路。

Abstract: Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.


[18] Temporal Backtracking Search for Test-time Generative Video Reasoning cs.CVPDF

Sejoon Jun, Zheng Ding, Huangyuan Su, Weirui Ye, Yilun Du

TL;DR: 本文针对生成式视频推理中单次生成范式的瓶颈,提出了一种名为时间回溯搜索(TBS)的测试时扩展方法。该方法将搜索空间转移到时间轴上,通过一个生成-验证-重启的循环,利用可变K条件化、时间过程验证和基于前缀的搜索机制,有效纠正早期推理错误,从而显著提升视频模型在分布外场景下的推理性能。

Details

Motivation: 当前测试时扩展方法(如根级最佳N采样)在视频生成中效率低下,因为推理错误往往在时间轴早期聚集,而盲目重采样会丢弃已验证的上游进展。论文旨在解决单次生成范式对视频推理能力的限制,为视频模型解锁有效的测试时扩展。

Result: 在算法、导航和机器人领域,TBS在相同计算预算下帕累托优于最佳N采样。在一个严格的分布外设置中,当单次生成完全失效(最佳N采样成功率仅0.7%)时,TBS实现了22.7%的成功率,且所有成功案例均源于重启的分支。

Insight: 核心创新点是将搜索空间从去噪步骤转移到视频生成的时间轴,并设计了可恢复任意干净前缀的条件化机制与故障定位方法。客观来看,该方法揭示了视频模型的局部推理能力远超单次生成所表现的水平,并提供了一种可扩展的测试时框架来释放这种潜力。

Abstract: While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process. Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis. TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling. Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where one-shot generation collapses (0.7% for BoN), TBS achieves 22.7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.


[19] Mirage Probes: How Vision Models Fake Visual Understanding cs.CV | cs.AI | cs.LGPDF

Daniel Ben-Levi, Judah Goldfeder, Weiliang Zhao, Raz Lapid, Amit LeVi

TL;DR: 该论文揭示了视觉语言模型(VLMs)存在两种不同的‘幻象’行为模式:一种是模型仅依赖文本先验回答问题而无需视觉表征,另一种是模型在潜在空间中构建虚假视觉内容并据此回答。作者提出了‘幻象探针’对比探测框架和‘先验利用指数’来识别和区分这两种模式,并指出针对第二种模式的干预需要在表征层面进行。

Details

Motivation: 现有研究将VLMs在无图像输入时仍能自信回答问题这一‘幻象’行为视为单一故障模式,但作者认为这实际上是两种不同的机制,需要加以区分以更准确地评估模型的视觉基础能力并指导后续的缓解措施。

Result: 在两种开源VLM的内部激活(残差流、MLP、注意力后、注意力头)中,幻象行为被证明是线性可解码的。通过跨基准的可分离性模式和提出的先验利用指数(PHI),研究区分了两种不同的幻象机制。

Insight: 创新点在于将幻象行为细分为‘文本偏见’和‘虚假图像’两种不同机制,并提出了Mirage Probes探测框架和PHI指数来量化分析。客观来看,该研究强调了仅靠清洗文本分布无法解决模型在视觉表征层面构建虚假内容的问题,为提升模型视觉基础的真实性指明了干预方向。

Abstract: Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. Using Mirage Probes, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across residual stream, MLP, post-attention, and attention-head sites in two open-source VLMs. We demonstrate that a Naive Bayes text baseline cannot recover this signal, ruling out surface lexical confounds. Cross-benchmark separability patterns, together with a novel Prior Harnessing Index (PHI) measuring how much a model can answer from text alone, expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model’s visual representations rather than its text. Faithful visual grounding will require interventions at the representational level.


[20] How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks? cs.CV | cs.AIPDF

Julia Romero, Qin Lv, Morteza Karimzadeh

TL;DR: 该论文研究了六种代表性的自监督地理空间基础模型(GeoFMs),评估了它们在分类、回归和分割等下游任务上的迁移性能,并分析了模型在不同标签可用性和下游适应设置下的行为变化。研究发现模型排名因任务和适应设置而异,中间Transformer层通常比最终层嵌入包含更多任务相关信息,且下游适应策略(如解码器设计)与模型选择同样重要。

Details

Motivation: 自监督地理空间基础模型(GeoFMs)能从遥感数据中学习可迁移的表示,但其在下游任务中的行为难以表征。论文旨在系统评估不同GeoFMs的迁移性能,并探究其表示特性,以解释模型排名在不同基准测试中的变化。

Result: 在PASTIS和Sen1Floods11分割案例研究中,下游适应设置(如解码器设计和微调)对性能的影响与GeoFM选择相当。模型排名在不同任务和适应设置下发生变化,且标准密集预测头可能与GeoFMs的深度信息组织方式不匹配。

Insight: 论文的创新点在于系统比较了不同自监督预训练家族的GeoFMs,并揭示了中间Transformer层通常比最终层嵌入更具任务相关性。客观分析表明,下游适应策略(如解码器设计)的重要性被低估,且微调主要影响ViT块中MLP的第一线性层,而非均匀重写整个模型表示,这为设计更表示感知的评估和适应策略提供了依据。

Abstract: Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.


[21] Avatar V: Scaling Video-Reference Avatar Video Generation cs.CVPDF

Benjamin Liang, Ce Chen, Desmond Lin, Ivan Somov, Jiajun Zhao

TL;DR: 本文提出了Avatar V,一个生产规模的视频参考条件化身份建模框架,用于生成与目标个体在视觉和行为上都高度相似的虚拟化身视频。该框架通过直接处理参考视频的完整token序列来学习静态身份属性和动态行为模式,并引入了稀疏参考注意力、运动表示流和身份感知超分辨率细化器等创新机制。

Details

Motivation: 现有方法主要基于单张静态图像,无法提供足够的身份信息并捕捉动态运动特征,且标准的像素级目标函数对决定化身保真度的感知关键面部区域处理不足。

Result: Avatar V在跨场景基准测试中,在身份保持、唇部同步和生成质量方面达到了最先进的水平,在自动指标和人工评估中均持续优于包括Seedance 2.0、Kling O3 Pro、Veo 3.1和OmniHuman 1.5在内的领先系统。

Insight: 创新点包括:1)视频参考条件化身份建模,直接处理长序列参考而非压缩为固定嵌入;2)稀疏参考注意力机制,实现了对任意长参考的线性复杂度条件化;3)支持闭环说话风格迁移的运动表示流;4)继承了完整参考条件的身份感知超分辨率细化器;5)大规模数据引擎和包含流匹配预训练、个性微调、两阶段蒸馏及RLHF对齐的五阶段训练流程。

Abstract: Generating avatar videos that are not merely visually similar to a target individual but behaviorally recognizable, faithfully reproducing their talking rhythm, gestural tendencies, and expression dynamics, remains an open challenge. Existing methods predominantly condition on single static images, which provide insufficient identity information and cannot capture dynamic motion traits, while standard pixel-level objectives underserve the perceptually critical facial regions that determine avatar fidelity. We present Avatar V, a production-scale framework that addresses these limitations through video-reference-conditioned identity modeling. Rather than compressing identity into fixed-size embeddings, the model conditions directly on the full token sequence of a reference video, learning to reproduce both static identity attributes (facial geometry, skin texture) and dynamic behavioral patterns (talking rhythm, micro-expressions) through attention over the reference context. We introduce Sparse Reference Attention, an asymmetric mechanism achieving linear-complexity conditioning on arbitrarily long references; a motion representation stream enabling closed-loop talking style transfer; and an identity-aware super-resolution refiner inheriting the full reference conditioning. These are supported by a data engine curating 100M+ training clips from 50M raw videos, and a five-stage training pipeline with flow matching pre-training, personality fine-tuning, two-phase distillation (>10x acceleration), and RLHF alignment, deployed across thousands of GPUs. Avatar V generates 1080p videos of unlimited duration, achieving state-of-the-art identity preservation, lip synchronization, and generation quality on our cross-scene benchmark, consistently outperforming leading systems including Seedance 2.0, Kling O3 Pro, Veo 3.1, and OmniHuman 1.5 in both automated metrics and human evaluation.


[22] RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation cs.CV | cs.LG | cs.ROPDF

Xiangyu Huang, Zhenlin Hua, Han Zhou, Shounak Sural, Ragunathan Rajkumar

TL;DR: RT-VLA是一种通过知识蒸馏构建的轻量级视觉-语言-动作模型,旨在实现端到端自动驾驶。它将最先进的SimLingo模型的驾驶和推理能力转移到紧凑的学生模型中,在保持竞争力的闭环驾驶和语言推理性能的同时,大幅降低了推理延迟。

Details

Motivation: 现有的视觉-语言-动作模型虽然能力强,但因其庞大的视觉语言主干和推理模块导致推理延迟高,难以部署在实时性要求苛刻的真实道路网络中。

Result: 与教师模型SimLingo相比,RT-VLA在仅视觉模式下推理时间减少了44.8倍,在视觉+语言模式下减少了7.9倍,同时保持了有竞争力的闭环驾驶和语言推理性能。

Insight: 论文的核心创新在于通过多层次监督式知识蒸馏,将大型VLA模型的复杂能力(包括语言推理和可解释性)高效地压缩到轻量级模型中,实现了实时性能与模型能力之间的平衡,为构建实时、可解释的自动驾驶模型提供了实用路径。

Abstract: Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.


[23] Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation cs.CVPDF

Xiaomeng Yang, Yanyu Li, Gordon Guocheng Qian, Ivan Skorokhodov, Viacheslav Ivanov

TL;DR: 本文提出Prompt2Effect,一种基于权重驱动的超网络方法,用于免训练地实现图像到视频(I2V)扩散模型的视觉特效定制。该方法通过单次前向传播直接合成特效特定的LoRA权重,避免了为每个特效单独训练LoRA模块所需的数据整理和迭代优化成本。

Details

Motivation: 当前为I2V扩散模型定制视觉特效需要为每种特效单独训练LoRA模块,这带来了巨大的数据整理和迭代优化成本,阻碍了交互式控制。本文旨在通过超网络直接生成LoRA权重来分摊每个特效的训练开销。

Result: 大量实验表明,Prompt2Effect在视频质量和特效对齐方面与传统LoRA微调相当或更优,同时将计算成本从56 GPU训练小时降低到3.3秒的超网络推理时间。当作为后续微调的初始化时,预测的权重能进一步提升最终性能并加速约10倍的优化。

Insight: 创新点包括:1)显式地将超网络条件化于冻结的基础模型权重,使权重预测基于每层的结构几何;2)引入SVD规范化的参数化方法,解决了分解歧义并稳定了大规模权重合成。这些设计原则实现了对高维I2V扩散模型准确且可扩展的LoRA预测。

Abstract: Personalizing Image-to-Video (I2V) diffusion models with specific visual effects is increasingly demanded for high-end video generation. Current practice requires training a separate Low-Rank Adaptation (LoRA) module for each effect, incurring substantial data curation and iterative optimization costs that hinder interactive control. We present Prompt2Effect, a weight-driven hypernetwork that amortizes per-effect training by directly synthesizing effect-specific LoRA weights in a single forward pass. Unlike prior hypernetworks that regress adapter weights purely from semantics, Prompt2Effect is explicitly conditioned on the frozen base model weights, grounding weight prediction in the structural geometry of each layer. Furthermore, instead of predicting raw LoRA matrices, we introduce an SVD-canonicalized parameterization that resolves factorization ambiguity and stabilizes large-scale weight synthesis. Together, these design principles enable accurate and scalable LoRA prediction for high-dimensional I2V diffusion models. Extensive experiments demonstrate that Prompt2Effect achieves on-par or superior video quality and effect alignment compared to conventional LoRA fine-tuning, while reducing the computational cost from 56 GPU training hours to 3.3 seconds of hypernetwork inference. When used as initialization for subsequent fine-tuning, our predicted weights further improve final performance and accelerate optimization by approximately 10x.


[24] ViT-Up: Faithful Feature Upsampling for Vision Transformers cs.CVPDF

Krispin Wandel, Jingchuan Wang, Hesheng Wang

TL;DR: 本文提出ViT-Up,一种用于视觉Transformer的隐式特征上采样框架,旨在解决ViT因全局自注意力二次成本导致的小网格特征瓶颈问题。该方法通过从ViT中间隐藏状态进行逐层查询构建,取代外部图像引导,从而在任意连续图像坐标预测特征并保持与主干特征空间的对齐。

Details

Motivation: ViT在密集预测任务中因全局自注意力计算成本高而通常使用小网格特征,导致特征瓶颈;现有任务无关特征上采样方法依赖浅层图像编码器进行引导,可能引入特征泄漏、碎片化和模糊问题。

Result: 在Cityscapes数据集上,ViT-Up比先前方法提升高达+2.07 mIoU(使用DINOv3-S+主干)和+3.36 mIoU(使用DINOv3-B主干);在SPair-71k数据集上,PCK@0.10指标提升高达+4.17和+8.09,表明ViT-Up能随主干容量扩展而提升性能。

Insight: 创新点在于用ViT中间隐藏状态的逐层查询构建取代外部图像引导,实现隐式特征上采样,避免了特征对齐问题;客观分析认为该方法通过保持特征空间一致性,有效提升了密集预测和语义对应任务的性能。

Abstract: Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.


[25] GarmentSketch: Large-scale Sketch-to-Fashion Benchmark cs.CVPDF

Duong-Duy-Khang Bui, Minh-Tan Pham, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

TL;DR: 本文提出了GarmentSketch,一个大规模、高质量的草图到时尚图像数据集,包含21个服装类别的26,249个时尚草图及其详细的文本描述。该数据集通过结合多模态大语言模型和人工循环优化的流程生成标注,并用于评估最先进的生成模型在草图引导的文本到图像生成任务上的性能。

Details

Motivation: 解决时尚设计领域因缺乏大规模、高质量的配对草图-图像数据而阻碍基于草图的时尚图像合成研究进展的问题。

Result: 在GarmentSketch数据集上对最先进的生成模型进行了基准测试,提供了草图引导的文本到图像生成的基线性能,揭示了现有方法的潜力和当前局限性。

Insight: 创新点在于构建了一个大规模、高质量、多模态(草图-文本)的时尚数据集,并通过集成MLLMs与人工优化的标注流程确保了语义准确性和描述丰富性,为草图理解、细粒度时尚图像生成和创意人机协作研究奠定了基础。

Abstract: Fashion sketching is a cornerstone of design workflows, allowing rapid visualization of creative concepts prior to physical prototyping. Yet, progress in sketch-based fashion image synthesis has been hindered by the absence of large-scale, high-quality paired resources. To bridge this gap, we present GarmentSketch, a novel dataset comprising 26,249 fashion sketches across 21 garment categories, each paired with detailed textual descriptions. Captions were produced through a multi-stage pipeline that integrates multiple multimodal large language models (MLLMs) with human-in-the-loop refinement, ensuring both semantic accuracy and descriptive richness. We benchmark GarmentSketch on state-of-the-art generative models, providing baseline performance for sketch-guided text-to-image generation. Our experiments reveal both the promise and the current limitations of existing methods. By offering a comprehensive and richly annotated resource, GarmentSketch establishes a foundation for advancing sketch understanding, fine-grained fashion image generation, and creative human-AI collaboration in design. The dataset will be available at: https://khangbdd.github.io/garmentsketch.


[26] Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention cs.CVPDF

Dinh-Khoi Vo, Nhut-Thanh Le-Hinh, Viet-Tham Huynh, Tam V. Nguyen, Minh-Triet Tran

TL;DR: 本文提出FocusDiff,一种无需微调的扩散模型编辑框架,通过重聚焦交叉注意力实现精确的区域特定图像编辑。该方法利用选择性模糊引导注意力至目标区域,同时保持对象身份、结构和外观,并集成上下文保留模块确保背景保真度。框架还扩展至360度室内全景编辑,并在虚拟现实环境中验证了有效性。

Details

Motivation: 解决零样本文本引导扩散编辑中的三个主要挑战:提示脆弱性(需精细提示工程)、溢出编辑(非目标区域被意外影响)以及训练数据中细粒度监督有限导致的小对象或杂乱对象编辑失败。

Result: 在本地化编辑基准LIMB(包含30张多对象图像和100个标注样本,包括挑战性小对象案例)上,FocusDiff在文本-图像对齐和背景保留方面优于现有零样本编辑器,实现了更高的精度、照片真实感和可用性。

Insight: 创新点包括基于重聚焦交叉注意力的目标感知机制,通过选择性模糊引导注意力,以及集成上下文保留模块确保全局一致性;客观分析认为,该方法将精确编辑扩展到全景和VR环境,提升了实用性和泛化能力。

Abstract: Zero-shot text-guided diffusion has significantly advanced image editing; however, its practical usability remains constrained by three persistent challenges: prompt brittleness that requires meticulous prompt engineering, spillover edits that unintentionally affect non-target regions, and failures on small or cluttered objects caused by limited fine-grained supervision in training data. We propose FocusDiff (Target-Aware Refocusing for Tuning-Free Diffusion Editing), a tuning-free framework for precise and region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while accurately transferring the object’s identity, structure, and appearance to the edited output. Integrated context-preserving modules further ensure background fidelity and global coherence, enabling accurate edits from simple text prompts in a single pass. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments. Extensive experiments on our localized editing benchmark LIMB, comprising 30 multi-object images and 100 annotated examples including challenging small-object cases, show that FocusDiff outperforms existing zero-shot editors in text-image alignment and background preservation, achieving superior precision, photorealism, and usability. The project page is available at https://vdkhoi20.github.io/FocusDiff.


[27] Self-Evolving Visual Questioner cs.CV | cs.LGPDF

Yijun Liang, Hengguang Zhou, Ming Li, Lichen Li, Cho-Jui Hsieh

TL;DR: 本文提出了一种自演化的视觉问答框架,使视觉语言模型能够主动生成多样化、视觉中心且具有挑战性的问题,而无需外部监督。该方法通过模型自身作为提议者和过滤器,持续提升问题生成的质量和难度,同时保持探索多样性以避免训练崩溃。

Details

Motivation: 现有视觉语言模型通常作为被动回答者进行训练,其主动提出多样化、非平凡、视觉中心且基于图像的问题的能力尚未充分探索,且性能受限于高质量训练数据的可用性或标注成本。

Result: 实验表明,该方法在各种骨干视觉语言模型上显著提升了自主问题生成的质量,并大幅扩展了其难度边界;在相同预算下,其自监督训练比使用静态源数据更有效,且自演化的问题生成模型仍能保持甚至超越作为回答者的竞争力。

Insight: 核心创新在于提出了一种无需外部监督的自演化框架,利用模型自身进行问题提议和过滤,实现了问题生成质量和难度的持续提升,同时通过维持探索多样性避免了训练崩溃,为视觉语言模型的主动学习能力提供了新思路。

Abstract: Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners’ performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.


[28] WAM4D: Fast 4D World Action Model via Spatial Register Tokens cs.CV | cs.ROPDF

Ying Li, Xiaobao Wei, Jiajun Cao, Hao Wang, Xiaowei Chi

TL;DR: WAM4D是一个快速的4D世界动作模型,旨在解决现有世界动作模型(WAMs)在2D或潜在空间中运行、缺乏3D空间约束和接触几何信息的问题。它通过引入轻量级的空间寄存器令牌,将预训练的几何先验知识迁移到因果视频-动作Transformer中,并在推理时移除该分支以实现高效的动作生成。

Details

Motivation: 现有WAMs大多在2D视频或潜在空间中运行,其视觉上合理的预测忽略了精确操作所需的3D空间约束和遮挡接触几何。虽然几何基础模型能从视觉观测中恢复密集的3D结构和运动,但迫使WAMs预测密集的4D表示会引入昂贵的几何解码并减慢因果动作生成,因此需要解决这种权衡。

Result: 在RoboTwin 2.0和具有挑战性的真实世界操作任务上的综合实验表明,WAM4D提高了空间一致性,并在保持高效推理的同时实现了有竞争力的动作预测。

Insight: 核心创新点是提出了轻量级的空间寄存器令牌作为训练时的未来深度读出器,以迁移几何先验,并在推理时移除以实现高效性。同时设计了用于混合Transformer(MoT)骨干的因果混合注意力机制,定义了视频、动作和几何令牌之间的模态特定可见性,以防止非因果捷径。

Abstract: World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation. While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation. To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens. Comprehensive experiments on RoboTwin 2.0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.


[29] Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI cs.CV | cs.CLPDF

Wentao Ke, Jianche Liu

TL;DR: 本文提出一个两阶段深度学习框架,用于提升多模态儿童脑肿瘤MRI分割与临床解释。首先,在BraTS-PEDs数据集上评估3D Res U-Net和Swin-UNETR基线模型;其次,引入基于扩散的细化模型,以Swin-UNETR粗分割结果为条件进行边界优化,并结合多模态语言模型生成结构化放射学报告。

Details

Motivation: 解决儿童脑肿瘤分割中标注数据有限、成像表型异质性、肿瘤边界弥散以及肿瘤子区域类别不平衡等挑战。

Result: 在BraTS-PEDs MRI扫描上,条件化MedSegDiff模型在增强肿瘤边界分割上表现最佳,实现了最低的HD95距离和最强的边界一致性。

Insight: 创新点在于提出粗到精的扩散细化分割策略,通过条件化扩散模型稳定生成过程并提升边界分割精度,同时整合视觉-语言模型实现端到端可解释的AI辅助神经肿瘤学工作流。

Abstract: Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage deep learning framework for improving multi-modal pediatric brain MRI segmentation and clinical interpretation. First, we evaluate 3D Res U-Net and Swin-UNETR baselines on BraTS-PEDs MRI scans, using four co-registered modalities to predict tumor core, whole tumor, and enhancing tumor regions. Second, we introduce diffusion-based refinement models conditioned on coarse Swin-UNETR predictions, including a 3D DDPM refiner and MedSegDiff. Conditioning substantially improves diffusion stability and performance, particularly for enhancing tumor boundary segmentation. Conditioned MedSegDiff achieves the strongest boundary agreement with the lowest HD95. Finally, predicted tumor volumes and representative segmentation overlays are integrated with a multimodal language model to generate structured radiology-style reports. Together, our results suggest that coarse-to-refined diffusion segmentation can improve pediatric tumor boundary delineation and support end-to-end interpretable AI-assisted neuro-oncology workflows.


[30] ShearFuse-UNet: Hadamard, DCT, and Shearlet Transform Fusion for Next-Day Wildfire Spread Prediction cs.CVPDF

Ene Meco, Yingyi Luo, Emadeldeen Hamdan, Adam Watts, Ahmet Enis Cetin

TL;DR: 本文提出ShearFuse-UNet,一种用于次日野火蔓延预测的轻量级深度学习模型。该模型在U-Net编码器中融合了三种互补的变换域分支:二维快速沃尔什-哈达玛变换(WHT)、二维离散余弦变换(DCT)和锥适应数字Shearlet残差分支,以实现高效且准确的预测。

Details

Motivation: 旨在利用多模态卫星数据,通过融合不同数学变换的优势,构建一个参数少、计算高效的模型来预测次日野火蔓延,解决现有方法可能存在的计算成本高或特征表示不充分的问题。

Result: 在WildfireSpreadTS数据集上,模型仅用26.7万参数就取得了0.596的F1分数,优于基于ResNet18的U-Net(1400万参数,F1=0.589);在Google Next-Day Wildfire Spread数据集上的结果进一步验证了其优越的准确率-效率权衡。

Insight: 创新点在于将固定的数学变换(WHT、DCT、Shearlet)以互补方式集成到U-Net中,其中Shearlet分支能显式编码火锋的细长边缘结构,而可学习的SpectralFusion门自适应融合WHT和DCT响应;这种设计在结构上类比Transformer自注意力,但依赖固定变换而非学习投影,显著减少了参数量和计算成本。

Abstract: We propose ShearFuse-UNet, a lightweight and computationally efficient deep learning model for next-day wildfire spread prediction from multi-modal satellite data. The model integrates three complementary transform-domain branches inside each encoder block of a U-Net backbone: a 2D Fast Walsh-Hadamard Transform (WHT) branch, a 2D Discrete Cosine Transform (DCT) branch, and a cone-adapted digital Shearlet residual branch. The WHT and DCT branches establish orthogonal latent spaces with learnable spectral scaling and fixed soft-thresholding, while the Shearlet branch provides anisotropic, multi-directional feature decomposition that explicitly encodes the elongated edge structures characteristic of fire fronts. A learned SpectralFusion gate adaptively combines the WHT and DCT responses, and the Shearlet reconstruction is added as a residual. This three-branch design bears a loose structural analogy to transformer self-attention: the WHT and DCT branches provide complementary spectral representations that are adaptively fused, while the Shearlet branch contributes directional content through a residual pathway. Unlike self-attention, the proposed design relies on fixed mathematical transforms rather than learned projection operators, reducing parameter count and computational cost. Evaluated on the WildfireSpreadTS dataset, ShearFuse-UNet achieves an F1 score of 0.596 with only 267k parameters, outperforming a ResNet18-based U-Net (14M parameters, F1 = 0.589) and demonstrating a highly favorable accuracy-efficiency trade-off. Results on the Google Next-Day Wildfire Spread dataset further validate these findings across a different benchmark.


[31] FEMOT: Multi-Object Tracking using Frame and Event Cameras cs.CV | cs.AIPDF

Shiao Wang, Xiao Wang, Chao Wang, Yitao Li, Menghao Liu

TL;DR: 本文提出了FEMOT,一个大规模RGB-事件相机多目标跟踪数据集,覆盖多种真实场景和14个挑战性属性,为RGB-事件多目标跟踪方法提供了系统评估平台。基于该数据集,作者重新训练和评估了十多个强跟踪器,建立了综合基准,并提出了FEMOTR多模态跟踪框架,在频域解耦并融合RGB和事件特征以提升跟踪鲁棒性。

Details

Motivation: 传统RGB相机在多目标跟踪中易受运动模糊、低光照和过曝等复杂现实挑战影响性能,而事件相机具有高时间分辨率和高动态范围,能提供互补信息,但缺乏大规模标注数据集限制了RGB-事件多目标跟踪的研究。

Result: 在FEMOT和DSEC-MOT数据集上的大量实验证明了所提方法的有效性,为未来研究建立了综合基准。

Insight: 创新点包括构建首个大规模RGB-事件多目标跟踪数据集FEMOT,以及提出FEMOTR框架,通过频域解耦和融合RGB与事件特征,有效利用其互补特性实现鲁棒的目标定位和身份关联。

Abstract: Conventional RGB cameras have been widely used in multi-object tracking due to their ability to capture rich appearance and semantic information. However, their performance is often degraded under complex real-world challenges, such as motion blur, low illumination, and overexposure. Bio-inspired event cameras offer high temporal resolution and high dynamic range, providing complementary cues under extreme scenarios. Nevertheless, RGB-event multi-object tracking remains underexplored due to the lack of large-scale and well-annotated datasets. To address this issue, we propose FEMOT, a large-scale RGB-event multi-object tracking dataset that covers diverse real-world scenarios and 14 challenging attributes. With both RGB and event data as well as high-quality annotations, FEMOT provides a reliable platform for systematically evaluating RGB-event multi-object tracking methods. Based on FEMOT, we retrain and evaluate over ten strong trackers, thereby establishing a comprehensive benchmark for future research. Furthermore, we propose FEMOTR, a multimodal tracking framework that decouples RGB and event features and fuses them in the frequency domain, thereby effectively exploiting their complementary characteristics for robust object localization and identity association. Extensive experiments on FEMOT and DSEC-MOT datasets demonstrate the effectiveness of the proposed method. The source code and benchmark dataset have been released on https://github.com/Event-AHU/FEMOT.


[32] A New Multi-Domain Benchmark for Micro-Action Recognition and Detection cs.CVPDF

Yanbin Hao, Pengyu Liu, Xing Wei, Xun Yang, Dan Gu

TL;DR: 本文提出了MMA-82,一个用于微动作识别与检测的大规模多领域基准数据集。它扩展了之前的MA-52数据集,涵盖了82个细粒度微动作类别和四个不同领域(实验室访谈、街头访谈、精神病患者访谈、情感丰富的电视视频),包含超过7.7万个标注实例。基于此数据集,论文建立了微动作识别和多标签微动作检测两大核心任务,并设计了领域内与跨领域的评估协议,包括少样本和零样本设置。实验表明现有方法在真实场景的微动作理解上仍面临挑战,尤其是在领域偏移和长尾分布情况下。此外,研究还探讨了微动作与情绪之间的关联。

Details

Motivation: 现有微动作基准(如MA-52)在规模、场景多样性、任务覆盖和评估协议上存在局限,无法满足更现实、全面的微动作分析需求。因此,需要构建一个更大规模、多领域的基准来推动该领域发展。

Result: 在提出的MMA-82基准上进行的广泛实验表明,当前方法在真实微动作理解,特别是面对领域偏移、长尾类别分布和复杂时序定位时,仍然表现不佳。研究还发现微动作与情绪状态强相关,能为面部微表情提供补充线索以提升情绪识别。

Insight: 主要创新点在于构建了一个大规模、多领域、细粒度的微动作基准数据集(MMA-82),并系统性地定义了包括跨领域、少样本/零样本在内的评估协议,以全面衡量模型的鲁棒性和泛化能力。同时,将微动作分析与情绪识别关联起来,为以人为中心的人工智能提供了新的研究视角和数据资源。

Abstract: Micro-actions are short-duration, low-amplitude subtle body movements at the whole-body level that can reveal latent intentions, involuntary reactions, and fine-grained affective changes. Our previous MA-52 benchmark has provided an important foundation for micro-action recognition, but it remains limited in scale, scene diversity, task coverage, and evaluation protocols. To advance micro-action analysis toward more realistic and comprehensive settings, we introduce MMA-82, a large-scale multi-domain extension of MA-52. MMA-82 expands the label space from 52 to 82 fine-grained micro-action categories and covers four distinct domains, including laboratory interviews, street interviews, psychiatric patient interviews, and emotion-rich television videos, resulting in 77,856 annotated instances from 454 subjects. Built upon MMA-82, we establish two core tasks: Micro-Action Recognition and Multi-label Micro-Action Detection. For recognition, we further define in-domain and cross-domain protocols, including few-shot and zero-shot settings, to evaluate model robustness, transferability, and generalization. Extensive experiments show that current methods still struggle with realistic micro-action understanding, especially under domain shift, long-tailed category distributions, and complex temporal localization. Beyond benchmarking, we investigate the relationship between micro-actions and emotion, showing that micro-actions are strongly associated with emotional states and provide complementary cues to facial micro-expressions for improved emotion recognition. These results demonstrate that MMA-82 serves as a comprehensive and challenging benchmark for realistic micro-action analysis and a valuable resource for human-centered AI. MMA-82 is available at https://github.com/LpyNow/MMA-82.


[33] VideoWeave: Unlocking Geometric Consistency in Video Generation via Joint Geometry-Video Modeling cs.CVPDF

Xunzhi Xiang, Zixuan Duan, Yabo Chen, Zhengxuan Wei, Guiyu Zhang

TL;DR: 本文提出VideoWeave,一种用于视频生成的潜在空间后训练框架,旨在解决大规模视频扩散模型中常见的3D结构随时间漂移和视角变化下运动不合理的问题。该方法通过将隐式几何模型特征与视频潜在表示在共享去噪空间中联合建模,以更灵活的方式约束生成分布,从而减少对上游显式几何重建错误的敏感性。

Details

Motivation: 现有方法通常依赖显式几何重建(如深度图、点云)来强制几何一致性,这使得生成器容易受到上游几何管道误差的影响。本文的动机是提出一种更灵活、非刚性的隐式几何引导方法,以更好地保持视频生成中的几何一致性。

Result: 在文本到视频和图像到视频生成任务上的实验表明,VideoWeave在保持强视觉质量的同时,显著提升了生成视频的几何连贯性。

Insight: 创新点在于提出了一个联合几何-视频建模的潜在空间后训练框架,利用隐式几何特征而非显式重建来约束生成过程,这降低了对上游几何管道精度的依赖。同时,构建了包含8万个视频的GeoVid-80K数据集,为联合建模提供了配对的外观与几何表示数据支持。

Abstract: Large-scale video diffusion models often fail to preserve 3D structure over time, causing geometric drift and implausible motion under viewpoint changes. Existing methods usually enforce geometric consistency by using explicit geometry reconstructions, such as depth maps, point clouds, or reconstructed 3D structures, to define conditions, supervision, or reward signals, making the generator sensitive to errors from upstream geometry pipelines. We propose VideoWeave, a latent-space post-training framework that uses implicit geometry-model features to constrain the generative distribution, providing a more flexible and non-rigid form of guidance that mitigates the impact of reconstruction errors from geometry models. Specifically, VideoWeave adapts these features into geometry latents and jointly models them with video latents in a shared denoising space, allowing geometry to shape the generative distribution during training. To support this process, we build GeoVid-80K, an 80K-video dataset with paired appearance and geometry representations. Experiments on text-to-video and image-to-video generation show that VideoWeave improves geometric coherence while preserving strong visual quality. VideoWeave project page at https://videoweave.github.io/


[34] Encoder Winners Do Not Reliably Transfer Across VLA Backbone Scale: A Frozen-Backbone Grafting Diagnostic cs.CV | cs.ROPDF

Qingping Zeng, Fei She

TL;DR: 本文研究了在视觉-语言-动作(VLA)策略中,从小规模骨干网络验证出的最佳视觉编码器,在迁移到更大规模骨干网络时是否依然保持优势。作者提出了一种冻结骨干网络的嫁接诊断方法,通过固定协议替换VLA的视觉编码器,并在两个不同规模的VLA骨干(SmolVLA-450M和π0.5-3.3B)上进行实验。

Details

Motivation: 解决在VLA策略开发中,基于小规模VLA骨干验证的视觉编码器选择,是否能够可靠地迁移到更大规模骨干网络上的问题,以避免在扩展模型时做出次优的编码器选择。

Result: 实验结果表明,在小骨干(SmolVLA)上表现最佳的SigLIP编码器,在大骨干(π0.5)上并非总是最优;在LIBERO基准套件的空间任务中,DINOv2-small在大骨干上领先,而在物体任务中结果则对随机种子敏感,接近平手。在四个骨干-套件比较中,有三个(以及12个种子级单元中的11个)支持编码器排名依赖于骨干网络。

Insight: 核心创新点在于提出了一个冻结骨干的嫁接诊断框架,作为一种在规模化部署前、针对目标骨干网络进行廉价评估的方法。关键发现是编码器的优劣排名并非跨骨干网络可迁移的,且嫁接封装层本身对性能的影响(MSE变化)在不同骨干网络上符号相反,因此所有结论都依赖于固定的嫁接协议。

Abstract: Vision-language-action (VLA) policies typically inherit their vision encoder from upstream VLM releases, but it is unclear whether an encoder choice validated on a small VLA transfers to a larger backbone. We introduce a frozen-backbone grafting diagnostic: the vision tower of a released VLA is replaced by a candidate encoder under a fixed protocol (adaptive average pooling, LayerNorm, and a single trainable linear projector), with the language model and action expert frozen. Across four encoders, two LIBERO suites, two backbones (SmolVLA-450M and $π_{0.5}$-3.3B), and two-to-three seeds per cell (40 main grafting runs plus native, LoRA, pooling, and zero-/shuffled-image controls, all scored by offline action MSE), the small-backbone winner does not reliably select the large-backbone top tier: SigLIP is best on SmolVLA across both suites, while on $π_{0.5}$ DINOv2-small leads the spatial suite and the object suite is a seed-sensitive near-tie band; three of the four backbone-suite comparisons (and 11 of 12 seed-level cells) support backbone-dependent rankings. The grafting wrapper is itself non-neutral with opposite sign across backbones (+45-56% MSE on the SmolVLA native tower, -50-52% on $π_{0.5}$), so all conclusions are conditional on the fixed grafting protocol. We position frozen grafting as a cheap target-backbone diagnostic to run before committing to an encoder at scale, not as a closed-loop deployment claim.


[35] A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators cs.CV | cs.CLPDF

Amna Amjid, Sana Qadir, Mehwish Fatima, Raja Khurram Shahzad

TL;DR: 本文提出了一种名为SGFF-Net的多域特征融合框架,用于提升深度伪造检测的泛化能力。该框架通过双残差学习架构,整合了空间、梯度和基于离散小波变换的频率表示,以应对不同生成器(如GAN和扩散模型)生成的伪造内容。

Details

Motivation: 现有基于空间或频率的深度伪造检测方法在应对GAN生成的伪造内容时表现良好,但在处理新兴扩散模型生成的图像时存在困难,且很少利用互补的多域表示或系统评估跨生成器的鲁棒性。

Result: 实验表明,SGFF-Net在数据集内评估中达到98.95%的准确率。在跨模型和跨范式评估中,分别达到70.46%和69.94%的准确率;结合多源训练和数据增强后,跨模型准确率提升至79.80%,跨范式提升至78%,在真实数据上从61.50%提升至75.80%。

Insight: 创新点在于提出了一个融合空间、梯度和频率多域特征的双残差学习框架,以学习互补的取证线索。客观分析认为,将多域表示与数据多样性及增强相结合,能显著提升检测系统在跨生成器和跨范式场景下的泛化能力,为构建更可靠的深度伪造检测系统提供了实用见解。

Abstract: Deepfakes are artificially generated images, audio, or videos that threaten privacy, security, and information integrity. Detecting such content is crucial for countering disinformation, as the latest models generate highly realistic content. While spatial- or frequency-based approaches achieve good detection rates on Generative Adversarial Networks (GANs)-based generated deepfakes, they often struggle with recent diffusion model-generated images. In particular, existing approaches rarely exploit complementary multi-domain representations or systematically evaluate cross-generator robustness. To address these challenges, we propose a multi-domain deepfake detection framework called SGFF-Net (Spatial-Gradient-Frequency Fusion Network) that integrates spatial, gradient, and DWT (Discrete Wavelet Transform)-based frequency representations within a dual residual learning architecture. Experimental results show that the SGFF-Net achieves 98.95% accuracy in intra-dataset evaluation and improves performance in both cross-model (70.46%) and cross-paradigm (69.94%) settings. Incorporating multi-source training and data augmentation further enhances robustness, increasing accuracy from 70.46% to 79.80% in cross-model evaluation, from 69% to 78% in cross-paradigm evaluation, and from 61.50% to 75.80% on real-world data. Unlike single-domain detectors, the SGFF-Net learns complementary forensic cues across spatial, gradient, and wavelet-frequency domains, resulting in greater robustness under cross-generator and cross-paradigm evaluation. The results further show that combining multi-domain representations with data diversity and augmentation substantially improves generalization, providing practical insights for developing more reliable deepfake detection systems.


[36] Pix2Pix-Hybrid: Structure-Guided Conditional Synthesis of Hajj Crowd Images with Multi-Channel Conditioning and Weak Attribute Supervision cs.CV | cs.AIPDF

Amirah F. Alshammari, Bander A. Alzahrani, Nahed A. Alowidi

TL;DR: 本文提出Pix2Pix-Hybrid (P2P-H),一种用于朝觐人群图像合成与数据增强的混合条件生成对抗网络。该模型基于Pix2Pix,通过八通道输入编码结构和上下文属性,并集成多尺度判别器以提升密集场景的纹理细节。利用该框架生成了包含10,000张高分辨率图像的合成数据集CrowdH,实验表明其合成质量优于基线模型,且合成的数据能有效提升下游人群计数模型的性能。

Details

Motivation: 针对朝觐场景人群计数模型开发中存在的领域标注数据稀缺、大规模集会数据收集涉及隐私问题等挑战,旨在通过条件图像合成生成逼真的朝觐人群图像,以用于数据增强,缓解数据不足问题。

Result: 在合成质量上,P2P-H在结构保持条件合成方面优于Pix2Pix和StyleGAN2-ADA基线,并显示出对其他人群数据集的良好迁移性。在下游任务评估中,使用包含85张精选合成图像的混合数据集CrowdH-Mix-469进行训练,使所有五个人群计数模型的平均绝对误差(MAE)均降低,其中CSRNet模型提升最为显著。

Insight: 创新点在于提出了一种结合结构引导(边缘、灰度)与弱属性监督(人群密度、时间)的多通道条件生成框架,并采用多尺度PatchGAN判别器以捕获密集场景细节。客观来看,其通过自动推导条件属性减少人工标注、以及构建混合真实-合成数据集进行下游评估的方法,为特定领域数据合成与增强提供了可借鉴的实用流程。

Abstract: Developing accurate crowd-counting models for Hajj pilgrimage scenes remains challenging because domain-specific annotated images are scarce and data collection during large gatherings raises privacy concerns. To address these limitations, this paper proposes Pix2Pix-Hybrid (P2P-H), a hybrid conditional GAN for structure-guided Hajj crowd-image synthesis and data augmentation. P2P-H builds on Pix2Pix and employs a U-Net generator conditioned on eight input channels that jointly encode structural cues (edges and grayscale) and contextual attributes (crowd density and time of day). To capture detailed textures in dense scenes, the framework integrates two multi-scale PatchGAN discriminators operating at different resolutions. The training procedure combines adversarial, perceptual, and feature-matching objectives with adaptive data augmentation and stabilization strategies. The model was trained on 993 real Hajj frames collected from 60 publicly available video sources, with conditioning attributes derived automatically to reduce manual labeling effort. Using this framework, we constructed CrowdH, a synthetic dataset of 10,000 high-resolution Hajj crowd images. Experimental results show that P2P-H improves structure-preserving conditional synthesis quality compared with Pix2Pix and StyleGAN2-ADA baselines and shows favorable transfer to other crowd datasets. To assess downstream utility, we further constructed CrowdH-Mix-469, an annotated mixed real-synthetic dataset comprising 384 real Hajj images and 85 selected synthetic images,and evaluated five crowd-counting models under real-only and real-plus-synthetic training. The selected synthetic data reduced MAE across all five models, with the strongest gain observed for CSRNet.


[37] One Layer’s Trash is Another Layer’s Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs cs.CVPDF

Yongru Chen, Kai Zhang, Zeliang Zong, Yuchen Lu, Wenming Tan

TL;DR: 本文提出了一种名为自适应分层视觉令牌选择(ALVTS)的新框架,用于解决大型视觉语言模型(LVLM)因冗长视觉令牌带来的计算负担问题。该方法打破了传统静态令牌剪枝的范式,通过轻量级令牌选择器为每一层动态选择重要的令牌进行处理,同时允许次要令牌跳过该层,从而在层间实现自适应压缩,在保持高精度的同时显著提升推理效率。

Details

Motivation: 现有视觉令牌剪枝方法存在根本性局限:一旦令牌在特定层被剪枝,后续所有层都无法访问这些信息,导致信息过早丢失,可能损害模型性能。作者通过实证研究发现,不同层对视觉区域的关注点不同,表明各层的最优令牌子集是变化的,这启发他们设计一种动态的、分层自适应的选择机制。

Result: 在LLaVA-1.5、LLaVA-NeXT和Qwen2.5-VL等模型上的广泛实验验证了方法的有效性。在实现89%的令牌压缩率下,ALVTS能保持原始模型96.7%的准确率,在LVLM推理上实现了更优的效率-精度权衡。

Insight: 核心创新点在于提出了分层自适应的动态令牌选择与路由机制,允许令牌在不同层间选择性参与计算,而非一次性静态剪枝。另一个关键洞察是,通过重要性一致性约束的低秩近似,其令牌选择模块能够有效模拟完整注意力机制的模式,且无需模型重新训练,为高效LVLM推理提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model’s accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.


[38] What Drives Test-Time Adaptation for CLIP? A Controlled Empirical Study from an Update Perspective cs.CV | cs.LGPDF

Jiazhen Huang, Xiao Chen, Zhiming Liu, Yaru Sun, Jingyan Jiang

TL;DR: 本文对CLIP模型的测试时适应(TTA)方法进行了系统性实证研究,通过将现有方法归类为三种更新范式,并引入标准化的TTABC基准,揭示了TTA4CLIP性能提升的主要驱动因素、高效证据利用方式以及不同适应范式对分布偏移类型的依赖性。

Details

Motivation: 当前TTA4CLIP方法发展迅速,但对其性能提升的真正驱动因素、增益来源以及在何种分布偏移下可靠缺乏深入理解,本文旨在通过受控实证研究填补这一认知空白。

Result: 研究基于TTABC基准(整合了20多种代表性方法)发现:参数更新方法的增益主要源于测试时证据和可靠代理而非复杂优化;通过跨样本或当前样本证据及轻量原型更新即可实现高效性能;没有单一适应范式在所有偏移类型下最优,最佳选择取决于偏移性质。

Insight: 创新点在于从更新视角统一归类TTA方法,并通过大规模基准实验揭示了测试时证据的关键作用与轻量化更新的有效性,强调了适应策略需与分布偏移特性匹配,为未来研究提供了方法论基础。

Abstract: Vision-Language Models (VLMs) such as CLIP have become a standard backbone for open-vocabulary recognition, yet their zero-shot predictions remain vulnerable to distribution shifts encountered at deployment. Test-Time Adaptation (TTA) has recently been extended to CLIP as a lightweight solution, leading to a rapidly growing body of TTA4CLIP methods. However, empirical progress in this area has largely outpaced our understanding of what truly drives adaptation, where their gains originate, and under which shifts they remain reliable. In this paper, we take a step back from the pursuit of state-of-the-art accuracy and conduct a systematic controlled study of TTA4CLIP. We first organize existing methods into three unified paradigms according to what is updated at test time. We then introduce TTABC, an open-source TTA Benchmark for CLIP, which standardizes evaluation protocols and integrates more than 20 representative methods. Our controlled empirical analysis focuses on three key areas. First, we determine the driving factors in parameter-based methods, revealing that adaptation gains are primarily driven by test-time evidence and reliable proxies rather than heavy optimization. Second, we explore evidence utilization beyond heavy parameter tuning, showing that competitive and efficient performance can be achieved through cross- or current-sample evidence and lightweight prototype updates. Finally, we demonstrate that there is no silver bullet for TTA: no single adaptation paradigm is universally optimal, and the preferred paradigm depends on the nature of shift. We hope our benchmark and study provide a clearer understanding of the current TTA4CLIP landscape and establish a foundation for further research.


[39] CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation cs.CVPDF

Sihan Zhuang, Xinyuan Chen, Tianfan Xue, Yaohui Wang

TL;DR: 本文提出了CausalMotion,一种无需训练的框架,通过结构化中间表示将显式物理推理注入视频生成。其核心思想是利用视觉语言模型将文本提示分解为因果一致的关键帧和以物体为中心的运动轨迹,然后将其作为软约束来引导预训练的视频扩散模型,从而显式建模物体动力学和因果转换。

Details

Motivation: 现有基于扩散的视频生成方法在视觉质量和短期时序连贯性上虽有进步,但在涉及长程交互的场景中,仍难以产生物理一致且因果合理的动态。这源于视频扩散模型主要隐式学习物理一致性,而视觉语言模型可以直接建模物理定律。

Result: 大量实验表明,该方法能持续提升物理合理性和时序连贯性,尤其是在动态密集型场景中,同时保持了高感知视频质量。

Insight: 创新点在于将推理与生成解耦,通过视觉语言模型生成结构化中间表示(因果关键帧和物体轨迹)作为软约束来引导视频扩散模型,实现了无需额外训练或监督的显式物理建模。

Abstract: Recent advances in diffusion-based video generation have significantly improved visual quality and short-term temporal coherence. However, existing methods still struggle to produce videos with physically consistent and causally plausible dynamics, especially in scenarios involving long-horizon interactions. This limitation arises from the fact that video diffusion models primarily learn physical consistency implicitly, while vision-language models can directly model physical laws. Based on this idea, in this work, we propose \textbf{CausalMotion}, a training-free framework that injects explicit physical reasoning into video generation through structured intermediate representations. Our key idea is to decouple reasoning from generation by leveraging a vision-language model to decompose a text prompt into a sequence of causally consistent keyframes and object-centric motion trajectories. These representations are then aligned and integrated as soft constraints to guide a pretrained video diffusion model during inference. This design enables explicit modeling of object dynamics and causal transitions without requiring additional training or supervision. Extensive experiments show that our method consistently improves physical plausibility and temporal coherence, particularly in dynamics-intensive scenarios, while maintaining high perceptual video quality.


[40] ForceForget: Reinforcement Concept Removal for Enhancing Safety in Text-to-Image Models cs.CVPDF

Dong Han, Yong Li

TL;DR: 本文提出ForceForget方法,通过强化学习优化概念擦除奖励,在文本到图像模型中消除不安全内容的同时保持模型对安全语义的解释能力。该方法引入安全适配器在交叉注意力层进行高效概念调控,避免过度擦除。实验表明,该方法在多个数据集上能有效减少不安全内容生成,同时保持良性图像的高保真度,并在鲁棒性和图像到图像场景中优于现有SOTA方法。

Details

Motivation: 现有文本到图像模型的概念擦除方法往往过度擦除不安全概念,并抑制有害提示中包含的良性概念,从而影响模型效用。本文旨在消除不安全内容的同时,维持模型在安全语义解释方面的能力。

Result: 在不同数据集上的广泛实验表明,与现有SOTA概念擦除方法相比,该方法在缓解不安全内容生成方面有效,同时保持了良性图像的高保真度。在鲁棒性方面,该方法优于同类方法对抗红队工具,且在图像到图像场景中更有效。

Insight: 创新点在于使用强化学习优化概念擦除奖励,并引入安全适配器在交叉注意力层进行部分文本嵌入投影,实现高效概念调控,平衡安全性与模型效用。该方法还可扩展至擦除一般概念(如艺术风格和物体)。

Abstract: With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, T2I models still can generate unsafe contents. To alleviate this issue, various concept erasing methods are proposed. However, existing methods tend to excessively erase unsafe concepts and suppress benign concepts contained in harmful prompts, which can negatively affect model utility. In this paper, we focus on eliminating unsafe content while maintaining model capability in safe semantic meaning interpretation by optimizing the concept erasing reward (CER) with reinforcement learning. To avoid overly content erasure, we introduce the Safe Adapter to project partial text embedding for efficient concept regulation in cross-attention layers. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high fidelity of benign images compared with existing state-of-the-art (SOTA) concept erasing methods. In terms of robustness, our method outperforms counterparts against red-teaming tools. Moreover, we showcase the proposed approach is more effective in emerging image-to-image (I2I) scenarios compared with others. Lastly, we extend our method to erase general concepts, such as artistic styles and objects. Disclaimer: This paper includes discussions of sexually explicit content that may be offensive to certain readers. All images used in this work are synthesized or from public datasets.


[41] FLaRA: Predicting Future Latent Representations for Accident Anticipation cs.CVPDF

Lorenzo Caselli, Tomaso Trinci, Tommaso Bianconcini, Simone Magistri, Leonardo Taccari

TL;DR: 本文提出了一种名为FLaRA的新颖预测架构,用于从行车记录仪视频中预测交通事故。该方法通过预测未来潜在表征来预测事故,而不是直接将视觉上下文映射到碰撞概率。

Details

Motivation: 现有方法通常直接将视觉上下文映射到碰撞概率,而没有显式建模驾驶场景的未来演变,FLaRA旨在通过预测未来潜在表征来改变这一范式,以更有效地进行事故预测。

Result: 在Nexar数据集上的广泛评估,以及在DAD、DADA-2000和DoTA基准测试上的跨域验证表明,该方法实现了最先进的性能,同时保持了现实的早期预警能力。

Insight: 创新点在于将预测范式从直接分类转向预测未来潜在表征,并引入联合训练目标(辅助特征级重建损失和交叉熵分类损失),以确保预测基于现实的未来动态。

Abstract: Anticipating traffic accidents from dashcam videos is a critical challenge in intelligent transportation systems. Existing methods typically map visual context directly to a collision probability without explicitly modeling the future evolution of the driving scene. In this paper we propose FLaRA (Predicting Future Latent Representations for Accident Anticipation), a novel predictive architecture that shifts this paradigm by forecasting future latent representations for accident anticipation. Building upon the Video Joint-Embedding Predictive Architecture (V-JEPA2), our model conditions a predictor network on observed context frames to predict the forthcoming latent features of the scene. A classifier then operates on these predicted future representations rather than only on past observations. To ensure these forecasts remain grounded in realistic future dynamics, we introduce a joint training objective that simultaneously optimizes an auxiliary feature-level reconstruction loss and a cross-entropy classification loss. Extensive evaluations on the Nexar dataset, alongside cross-domain validations on the DAD, DADA-2000, and DoTA benchmarks, demonstrate that our approach achieves state-of-the-art performance while maintaining realistic early warning capabilities.


[42] IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products cs.CVPDF

Haonan Qi, Jin Cao, Yongqi Zhang, Xintong Wang, Weidong Tang

TL;DR: 该论文提出了首个面向工业产品的多图像理解基准IndustryBench-MIPU,用于评估多模态大语言模型从多张异构产品图像中提取结构化属性值对的能力。基准包含27,652张图像、4,559个产品及103,703个标注,覆盖18个工业类别。评估发现,现有模型在单图像上精度高但召回率低,多图像设置下召回率进一步下降,表明多图像完整性是核心瓶颈。

Details

Motivation: 工业产品的技术规格分散在规格表、铭牌和技术图纸等多张异构图像中,而多模态大语言模型能否可靠地从中恢复结构化属性值对尚未得到充分探索,因此需要构建专门的基准来评估和推动该领域发展。

Result: 在单图像和多图像设置下评估了九个MLLMs,模型精度达到86-94%,但最佳模型仅能恢复49.9%的产品级属性;从单图像切换到多图像提取会使召回率下降15-34个百分点,表明多图像完整性是主要瓶颈。

Insight: 创新点在于构建了首个大规模、多图像、跨类别的工业产品属性提取基准,强调多图像证据整合能力;客观来看,该研究揭示了MLLMs在工业场景中多图像理解与信息整合的显著不足,为后续模型优化提供了明确方向。

Abstract: Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction – recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86–94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15–34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.


[43] Scratched Lenses, Shifted Depth: Passive Camera-Side Optical Attacks cs.CVPDF

Qinlin He, Zeming Zhuang, Yongji Wu, Lan Zhang, Xiaoyong

TL;DR: 本文提出了一种新型物理对抗攻击方法SLASH,通过相机镜头上的微小划痕在特定光照条件下产生光学伪影,从而干扰单目深度估计和3D目标检测等几何推理任务。攻击具有持久性和选择性,在数字和真实场景中均验证了其有效性。

Details

Motivation: 现有物理对抗攻击多关注场景操控(如对抗补丁),忽略了相机端光学缺陷与场景光照的交互作用。本文旨在探索镜头侧被动损伤如何作为场景触发的对抗机制,挑战视觉系统的物理鲁棒性假设。

Result: 在单目深度估计任务中,攻击导致相对深度误差高达32%;在单目3D目标检测中也产生一致干扰效果。真实物理实验证实攻击可迁移至实际相机录制数据,诱导的深度偏移超出模型自然预测基线。

Insight: 创新点在于将镜头划痕建模为触发条件依赖的光学通道,在光学空间而非图像空间优化攻击;揭示了看似良性的硬件缺陷可作为潜在的场景触发对抗机制,为安全视觉系统防御提供了新视角。

Abstract: Physical adversarial attacks on vision systems are typically studied through scene manipulation, such as adversarial patches or projections, where the adversary controls what the camera observes. Camera-side attacks using stickers or auxiliary optics have also been explored, but they treat attacks as image-space perturbations from designed patterns. This misses how physical imperfections interact with scene-dependent lighting and optics. We identify a threat: passive lens-side damage that is persistent yet trigger-conditioned, producing optical artifacts that bias geometric inference under particular visual conditions. We instantiate this threat through Scratch-induced Lens Adversarial Streak Hijacking SLASH, a physical-world attack caused by small scratches on a camera lens or protective cover. Scratches interact with bright light sources and specular reflections to create structured streak artifacts that distort depth cues. Since the perturbation is fixed in the optical path but triggered by the scene, it is both persistent and selective. We formulate the attack in optical space, model the scratch pattern as a trigger-conditioned optical channel, and optimize one fixed configuration across diverse viewing conditions. We evaluate SLASH on monocular depth estimation and monocular 3D object detection in digital and real-world settings. Under the fixed-scratch constraint, directional depth shifts reach up to 32% relative error for monocular depth estimation, with consistent effects on monocular 3D object detection. Physical experiments confirm transfer to real camera recordings, inducing depth shifts above the model’s natural prediction baseline. These findings reveal an attack surface where benign-looking hardware imperfections act as latent, scene-triggered adversarial mechanisms, challenging assumptions about physical robustness and motivating defenses for secure vision systems.


[44] NEST3D: A High-Resolution Multimodal Dataset of Sociable Weaver Tree Nests cs.CV | cs.LGPDF

Constanza A. Molina Catricheo, Simon Boeder, Ting-Jia Guo, Giacomo May, Clément Berthelot

TL;DR: 本文介绍了NEST3D数据集,这是一个高分辨率、多模态的无人机数据集,包含104棵筑巢树,提供了RGB图像、多光谱图像、约7.81亿个3D点云以及专家标注的语义分割标签,用于研究群织雀巢的3D结构。

Details

Motivation: 现有研究缺乏群织雀巢的细粒度3D结构数据,而由于其不规则几何形状和复杂植被环境,获取准确3D数据具有挑战性,因此需要构建一个高质量的多模态数据集来填补这一空白。

Result: 在语义分割基准测试中,Point Transformer V3在测试集上取得了86.35%的mIoU,表现优于KPConv和RandLA-Net等基于卷积和点云的方法,突显了架构相关的性能差异。

Insight: 该数据集创新性地结合了光谱、空间和结构信息,推动了3D重建、分割和分类算法的发展,并为生态应用(如巢穴体积估计和物种保护)提供了基础,同时作为一个具有极端类别不平衡的基准,揭示了不同网络架构的性能局限。

Abstract: Sociable weaver nests function as complex ecological structures offering thermoregulatory microhabitats and sustaining diverse species; however, datasets used in prior studies lack fine-grained 3D structural detail. Producing usable and accurate 3D weaver nest data is challenging due to their irregular geometry and integration with complex host vegetation. We bridge this gap with an open-access, 1.4 TB multimodal drone dataset of 104 nest-bearing trees, comprising 27,945 RGB images, 111,780 multispectral images, approximately 781 million 3D points, and expert-annotated semantic segmentation labels. We benchmark semantic segmentation using KPConv, RandLA-Net, and Point Transformer V3, with PT-v3 achieving an mIoU of 86.35% on the test set. While the results demonstrate strong performance for transformer-based and point-wise methods, they also highlight architecture-dependent challenges, particularly for convolution-based approaches such as KPConv. By uniquely combining spectral, spatial, and structural information, the presented dataset advances 3D reconstruction, segmentation, and classification algorithms, enabling ecological applications from nest volume estimation to species conservation, and serves as a demanding benchmark that exposes architecture-dependent performance under extreme class imbalance.


[45] A Qualitative Review of GenAI-Based Methods for Data Generation and Augmentation in Industrial Computer Vision Applications cs.CVPDF

Paul Koch, Paul Hofmann, Ferdinand Waßelewsky, Adem Karakurt, Andre Sérs

TL;DR: 本文对基于生成式人工智能的数据生成与增强方法在工业计算机视觉应用中的潜力进行了定性综述,旨在解决工业应用中高质量数据稀缺与用户信任难以建立的’先有鸡还是先有蛋’困境。

Details

Motivation: 工业计算机视觉应用需要大量数据以确保可预测的性能和建立用户信任,但数据获取困难,而传统的主动学习方法在项目部署中逐步积累数据时,常因性能不稳定导致用户信任丧失,形成数据与应用都无法发展的僵局。

Result: 论文未提供具体的定量实验结果,而是对现有最先进方法进行了定性综述,并评估了它们在工业视觉分类用例上的适应性。

Insight: 创新点在于系统性地探讨了利用GenAI在项目初始阶段自动扩充数据的潜力,并敏锐地指出了当前方法在源域(训练环境)与目标域(工业用例)之间存在领域不匹配的问题,特别是在自然语言定义的上下文和物体特征方面。

Abstract: AI-driven computer vision applications require a profound database to ensure predictable behaviors and performance. Such predictable behaviors are especially important for industrial applications in gaining trust from users. However, such a database is not readily available in industrial applications, and its acquisition is not trivial either. Active learning methods can be applied to ramp up data within a project deployment to iteratively increase the database, and thus the application predictability. Unfortunately, we observe that this often leads to a loss of user trust in the application, which is difficult to regain once lost. This leads to a “chicken-and-egg” dilemma in which neither the database nor the application is developed. In this work, we review state-of-the-art methods and approaches to further boost the database the initial active data ramp-up phase. Here, we focus on recent advancements in GenAI-based data generation and augmentation methods and review their adaptability on an industrial computer vision classification use case. Although we observe a potential for automatic data ramp-up, we also see a domain miss match in between the source (training environment) and target (industrial use-case) - regarding context defined in natural language and object characteristics.


[46] HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities cs.CVPDF

Yijun Liu, Jie Huang, Zeyue Xue, Yuming Li, Ruizhe He

TL;DR: 本文提出了HPSv3++奖励模型框架,旨在解决现有奖励模型(如HPSv3)因训练数据过时且未考虑扩散模型能力演变和强化学习迭代变化而导致的泛化能力受限问题。通过构建大规模双维度偏好数据集HPDv3++,并设计包含两阶段训练(数据感知正交梯度投影与能力-迭代条件化信号结合无监督引导)的框架,该模型能够适应不同能力水平和训练阶段的文本到图像模型。

Details

Motivation: 现有奖励模型(如HPSv3)的训练数据通常基于早期文本到图像模型标注,无法适应模型能力演进和强化学习迭代带来的质量判别变化,限制了其在更广泛场景下的应用。

Result: HPSv3++在多个基准测试中达到SOTA水平:在HPDv3数据集上比HPSv3提升9.8%,在GenAI-Bench上提升5.5%,在自建的HPDv3++数据集上达到79.1%/88.1%的准确率。当用于文本到图像强化学习训练时,它能持续提升不同模型的GenEval分数。

Insight: 创新点包括:1) 构建大规模、双维度(文本保真度和美学质量)且由高性能模型(Qwen-Image)人工监督标注的数据集HPDv3++;2) 提出两阶段训练框架,第一阶段通过数据感知正交梯度投影融合新美学感知并保留原有有效知识,第二阶段引入联合能力-迭代条件化信号与基于标准差的无监督引导机制,以增强奖励模型在全能力-迭代谱系上的鲁棒性。

Abstract: Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discriminative shifts arising from evolving model capabilities and reinforcement learning (RL) iterations, limiting their broader applicability. In this work, we propose HPSv3++, a reward model framework that elevates the HPSv3 model for varying T2I model capabilities and their RL iteration changes across the full capability-iteration spectrum. Specifically, we first introduce HPDv3++, a 212K dual-dimension preference dataset annotated for text fidelity and aesthetic quality using a recent high-capability (Qwen-Image) model with human supervision. We then propose a two-stage training framework. Stage 1 employs data-aware orthogonal gradient projection to incorporate diverse aesthetic perception from HPDv3++ while preserving the original effective human preference knowledge in HPSv3. Stage 2 further leverages unlabeled data from T2I models spanning different capability levels and RL iterations, and introduces a joint capability-iterations conditioned signal for the reward model together with a standard deviation-driven unsupervised guidance mechanism, strengthening reward model across the capability-iteration spectrum. HPSv3++ achieves state-of-the-art preference prediction, outperforming HPSv3 9.8% on HPDv3, 5.5% on GenAI-Bench, while achieving 79.1%/88.1% on our proposed HPDv3++. When used for T2I RL training, it consistently improves GenEval scores across diverse T2I models, demonstrating its wide-range capabilities. The code is available at https://github.com/PlantPotatoOnMoon/HPSv3-PlusPlus.


[47] S$^2$COPE: Self-Supervised Concept Discovery via Preference Learning cs.CVPDF

Shilong Xiang, Zirui Zhang, Chengzhi Mao

TL;DR: 本文提出了一种名为S^2COPE的自监督概念发现框架,通过偏好学习从原始图像中自主发现结构化视觉概念,无需任何人工标注。该方法将视觉大语言模型(VLLMs)作为自监督偏好优化循环的主动参与者,而非静态特征提取器,从而在自然、医学和物理等多个领域成功提取了标准VLLMs难以生成的新概念。

Details

Motivation: 当前表示学习方法存在根本性妥协:自监督方法可扩展至海量数据集但特征不透明,而可解释模型则受限于密集人工标注的需求。本文旨在解决这一困境,实现无需标签的可解释概念发现。

Result: 在自然、医学和物理领域的广泛实验中,S^2COPE成功提取了领域特定概念。通过自监督偏好目标将概念发现直接摊销到VLLM骨干中,在未见数据上的下游top-1分类准确率实现了高达24个百分点的绝对提升。

Insight: 创新点在于将VLLMs作为自监督偏好优化循环的主动参与者,通过自主假设、验证和强化候选视觉属性来发现新概念。这表明可解释性可以通过模型与偶然视觉结构的自主交互而涌现,无需任何人类监督。

Abstract: Current representation learning paradigms force a fundamental compromise: self-supervised methods scale to massive datasets but yield opaque features, whereas interpretable models remain bottlenecked by the need for dense human annotation. We introduce Self-Supervised Concept discOvery via Preference lEarning (\model), a label-free framework that resolves this dilemma. Instead of treating Vision-Large-Language Models (VLLMs) as static feature extractors, \model leverages them as active participants in a self-supervised preference optimization loop. By autonomously hypothesizing, validating, and reinforcing candidate visual attributes directly from raw imagery, our framework discovers novel, structured concepts without a single label. Extensive experiments across natural, medical, and physics domains demonstrate that \model successfully extracts domain-specific concepts where standard VLLMs often fail to generate. By amortizing concept discovery directly into the VLLM backbone through our self-supervised preference objective – rather than relying on static generation and disjoint filtering – we achieve up to a 24-point absolute improvement in downstream top-1 classification accuracy on unseen data. Our work suggest that interpretability can emerge through a model’s autonomous interaction with incidental visual structures, without any human supervision.


[48] Giving AI a Headache: Acoustic Adversarial Attacks to Computer Vision Applications cs.CV | cs.AIPDF

Nicole Villavicencio-Garduño, Maksim Ekin Eren, Milo Prisbrey, Ben Migliori, Michael Teti

TL;DR: 本文研究利用可听频率(<20 kHz)的声学振动攻击计算机视觉应用。通过使商用摄像头产生共振,干扰其内部稳定机制,从而在图像中引入伪影,导致基于AI的CV模型(如YOLO11)出现误分类、漏检或幻觉目标。相比先前使用超声波(>20 kHz)的短距离攻击,该方法利用较低频率可实现更远距离的攻击,并分析了不同图像和物体特征受攻击影响的情况。

Details

Motivation: AI广泛应用于自动驾驶、人脸识别和安全监控等计算机视觉应用,但现有研究表明声学振动可干扰摄像头稳定系统,导致CV模型失效。先前工作使用超声波频率攻击受限于短距离,因此本文探索利用可听频率进行更远距离的攻击,并分析攻击对不同图像特征的影响。

Result: 通过物理实验在商用摄像头上验证了攻击可行性,使用YOLO11等现成目标检测模型进行测试,展示了攻击导致模型性能下降的具体案例。

Insight: 创新点在于利用可听频率扩展了声学对抗攻击的距离范围,并系统分析了攻击对不同图像特征的敏感性,为未来防御策略的开发提供了洞见,揭示了CV系统在物理层面存在的安全漏洞。

Abstract: Artificial Intelligence (AI) is increasingly used to automate a variety of real-world computer vision (CV) applications, such as autonomous vehicle control, facial recognition, and security cameras. Recent research has shown that acoustic vibration can induce real physical motion in cameras, interfering with their internal stabilization mechanisms. Because the motion falls outside the conditions the stabilization system was designed to handle, the system introduces artifacts into the frame, causing AI-based CV models to misclassify, miss targets, or hallucinate objects. Previous work used ultrasonic frequencies (>20 kHz) to perform short-range attacks, which limits them to short distances due to the attenuation exhibited by high frequencies. In this work, we investigate acoustic attacks using lower frequencies in the audible range (<20 kHz), and we further expand our analysis to include how various image and object features are affected by the attacks. Specifically, we performed physical experiments to demonstrate the viability of our attacks on an off-the-shelf object detection model (YOLO11) by resonating a commercially available camera with various frequencies. Based on our results, we provide insights into several factors that make an AI CV system more vulnerable to these attacks, which could help inform the development of future mitigation strategies.


[49] Memento: Reconstruct to Remember for Consistent Long Video Generation cs.CVPDF

Xuan Wei, Longbin Ji, Guan Wang, Xiangrui Liu, Zhenyu Zhang

TL;DR: Memento是一个用于长视频生成的框架,通过联合训练自回归下一镜头生成和基于记忆的主题重建,确保重复出现的主题在不同镜头、视角和场景转换中保持一致。它引入双查询记忆机制来分离长程主题证据和短程上下文线索,并使用主题感知的电影数据管道提供精确的重建监督。

Details

Motivation: 现有时间分解方法主要优化合理的下一镜头延续,但缺乏对历史记忆是否保留身份关键主题证据的验证,导致生成过程中重复主题可能被稀释、覆盖或遗忘。

Result: 实验表明,Memento在长期主题一致性、跨镜头连贯性和视觉质量方面达到了最先进的性能。

Insight: 将主题保存视为显式的身份接地问题,通过主题重建来指导记忆库的构建;双查询记忆机制有效分离长程身份信息和短程上下文;主题感知数据管道提供无代词的精确描述以增强监督。

Abstract: Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.


[50] CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification cs.CV | cs.AIPDF

Rafi Ahamed, Md. Abir Rahman, Tasnia Tarannum Roza, Munaia Jannat Easha, Md. Asif Khan

TL;DR: 本文提出了CottonLeafVision框架,用于棉花叶部病害的准确分类与检测。通过评估多种预训练深度卷积神经网络(如DenseNet201、InceptionV3和VGG19)在公开数据集上的性能,其中DenseNet201取得了98%的最高分类准确率。为提升模型可靠性和可解释性,采用了Grad-CAM、遮挡敏感度分析和对抗训练等技术,并开发了原型系统以应用于实际农业场景。

Details

Motivation: 棉花作为全球重要经济作物,其叶部病害的精确识别对纺织业和经济稳定至关重要,因此需要开发一个鲁棒且可解释的深度学习框架来应对真实田间环境中的挑战。

Result: 在包含七个类别(六种病害和健康叶片)的公开棉花叶部病害图像数据集上,DenseNet201模型达到了98%的分类准确率,展现了优异的性能。

Insight: 创新点在于结合了多种可解释性方法(如Grad-CAM和遮挡分析)与对抗训练来增强模型的鲁棒性和可信度,同时构建了面向实际农业应用的原型系统,推动了深度学习在作物病害管理中的落地。

Abstract: Globally, cotton is a highly economically beneficial crop, as the textile industry heavily depends on it. So, the precise identification and detection of cotton leaf disease is crucial for economic stability. The development goal of “CottonLeafVision” is to accurately classify and detect cotton leaf disease. With this goal, we have evaluated multiple pretrained Deep Convolutional Neural Networks, including DenseNet201, InceptionV3, and VGG19 on a publicly available cotton leaf disease image dataset. This image dataset includes seven classes, six disease classes, and one healthy class, collected under various field conditions reflecting real-world challenges. Among these pretrained models, with DenseNet201, we have achieved the highest classification accuracy of 98%. To enhance the model reliability and interpretability, we have implemented different techniques and methods such as Gradient-weighted Class Activation Mapping (Grad-CAM), occlusion sensitivity analysis and adversarial training to increase the noise resistance of the model. Finally, we have developed a prototype in order to utilize the model’s capabilities on real life agriculture. This paper shows the deep learning model’s capabilities to classify the disease in real-life cotton disease management situations.


[51] HumP-KD: A Hybrid Uncertainty-Aware Multi-Stage Progressive Knowledge Distillation Framework for Efficient Fire Classification cs.CV | cs.LGPDF

Mohammed Arif Mainuddin, Najifa Tabassum, Omar Ibne Shahid, Riasat Khan

TL;DR: 本文提出了HumP-KD,一种混合不确定性感知的多阶段渐进知识蒸馏框架,用于高效的火灾分类。该框架通过三个紧密集成的组件,将来自两个冻结的异构Transformer教师模型(Swin-Tiny和ViT-Base)及其元MLP集成模型的知识,蒸馏到一个轻量级的MobileViT-S学生模型中。在包含31,309张图像的数据集上,该方法显著提升了学生模型的性能,同时大幅减少了参数量和模型大小,实现了高帧率的实时部署能力。

Details

Motivation: 实时火灾分类系统需要模型同时具备高精度、高计算效率,并能在资源受限的硬件上部署。现有模型在精度与效率之间往往难以平衡,因此需要一种方法在保持轻量化的同时提升分类性能。

Result: 在Dataset-II上,HumP-KD在10次独立试验中取得了平均F1分数0.9876 ± 0.0063,显著优于未使用蒸馏的MobileViT-S基线(0.9537 ± 0.0351),统计显著性得到t检验和Wilcoxon符号秩检验确认。学生模型仅含4.94M参数和19.01Mb大小,相比教师模型参数量减少5.7倍至17.5倍,并在CPU上达到37.72 FPS,展现了强大的泛化能力和在视觉退化条件下的鲁棒性。

Insight: 创新点在于提出了一个混合不确定性感知的多阶段渐进蒸馏框架,其核心包括分层渐进知识蒸馏(通过分层特征构建器生成融合空间注意力掩码,有选择地引导蒸馏至判别性区域)和多阶段知识蒸馏(在训练中逐步激活三个蒸馏阶段)。从客观角度看,该方法通过紧密集成异构教师模型和元集成,并引入空间注意力引导,有效提升了轻量学生模型的知识吸收效率和最终性能,为实时视觉任务中的模型压缩与知识迁移提供了新思路。

Abstract: Real-time fire classification systems require models that are simultaneously accurate, computationally efficient, and deployable on resource-constrained hardware. This work proposes \textbf{HumP-KD}, a Hybrid Uncertainty-aware Multi-stage Progressive Knowledge Distillation framework for efficient fire classification. Two datasets, FlameVision and Dataset-II, containing 8,600 and 31,309 images, are used. Various CNN and transformer baselines are applied under standard preprocessing, online augmentation, Gaussian noise and motion blur robustness conditions. The proposed HumP-KD model distills knowledge from two frozen heterogeneous transformer teachers, Swin-Tiny and ViT-Base, along with their Meta-MLP ensemble, into a lightweight MobileViT-S student via three tightly integrated components. Hierarchical Progressive Knowledge Distillation employs a Hierarchical Feature Builder. It generates a fused spatial attention mask to guide distillation toward discriminative regions selectively. Multi-Stage Knowledge Distillation progressively activates three distillation stages across training. On Dataset-II, HumP-KD achieves a mean F1 score of $0.9876 \pm 0.0063$ across 10 independent trials, significantly outperforming the MobileViT-S baseline trained without distillation ($0.9537 \pm 0.0351$), with statistical significance confirmed by both independent t-test ($p = 0.0195$) and Wilcoxon signed-rank test ($W = 1$, $p = 0.0039$). The proposed method also demonstrates strong generalization across datasets and robustness under degraded visual conditions. The student model retains only 4.94M parameters and 19.01Mb model size, representing a $5.7\times$ parameter reduction over Swin-Tiny and a $17.5\times$ reduction over ViT-Base, while achieving 37.72 CPU FPS, making it suitable for real-time deployment.


[52] ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning cs.CV | cs.AI | cs.CLPDF

Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian

TL;DR: 该论文提出了ClinHallu基准,用于诊断医学多模态大语言模型(MLLM)在推理过程中分阶段产生的幻觉问题。该基准包含7,031个经过验证的实例,每个实例都带有分解为视觉识别、知识回忆和推理整合的结构化推理轨迹,并采用阶段替换干预来评估纠正特定阶段对最终答案的影响。

Details

Motivation: 现有医学幻觉基准主要关注数据收集,但忽略了幻觉在推理过程中的具体来源,而幻觉可能源自视觉误识别、医学知识回忆错误或推理整合缺陷,因此需要细粒度的诊断工具。

Result: 论文展示了通过轨迹监督微调可以减少分阶段幻觉,但未在摘要中提及具体的定量结果(如SOTA比较或基准测试分数)。

Insight: 创新点在于将医学MLLM的幻觉诊断细化为视觉识别、知识回忆和推理整合三个阶段,并引入结构化推理轨迹和阶段替换干预方法,为理解和缓解推理失败提供了可解释的框架。

Abstract: Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.


[53] Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control cs.CV | cs.GR | cs.ROPDF

Ruining Li, Yuxin Yao, Matt Zhou, Chuanxia Zheng, Christian Rupprecht

TL;DR: 本文提出了Instruct-Particulate模型,用于根据给定的运动学规范(包括部件描述、连接性、关节类型等)来预测3D网格的关节部件分割和关节运动参数。该模型通过利用大规模视觉语言模型自动生成运动学规范,并构建了一个包含超过15万个物体的异构数据集进行训练,从而解决了标注数据稀缺导致的泛化能力有限问题。

Details

Motivation: 现有神经网络在重建关节化3D物体时,由于标注数据稀缺,其泛化能力受到限制。本文旨在通过引入运动学规范来消除任务歧义,并利用更丰富的异构数据进行训练,以提升模型的泛化性能。

Result: 实验表明,该模型在跨类别泛化和处理AI生成网格方面表现更好,能够通过图像到3D模型实现从真实世界图像重建关节化资产。

Insight: 创新点在于将运动学规范作为条件输入来引导模型,并利用视觉语言模型自动生成规范以扩展训练数据。这提供了一种利用异构数据和高级语义提示来增强3D关节理解任务泛化能力的新范式。

Abstract: Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.


[54] RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space cs.CVPDF

Xichen Pan, Aashu Singh, Satya Narayan Shukla, Xiangjun Fan, Shlok Kumar Mishra

TL;DR: RepFusion是一种新的文本到图像生成方法,它利用多模态大语言模型(MLLM)作为噪声视觉表示的编码器,为扩散变换器提供条件信号。该方法将生成目标转向语义结构化的视觉表示空间,从而更有效地利用预训练LLM的先验知识。实验表明,在相似推理预算下,RepFusion优于从头训练去噪器的方法。

Details

Motivation: 传统文本到图像系统中,LLM仅用于文本编码,而去噪任务由新训练的生成主干处理。论文旨在利用MLLM强大的多模态先验知识,将其扩展到处理噪声输入,以提升去噪过程的效率和效果。

Result: 在相似推理预算的受控比较中,RepFusion优于将同等容量分配给新初始化的去噪器的基线方法。这表明MLLM为视觉表示去噪提供了强有力的先验。

Insight: 核心创新在于将MLLM重新用作噪声表示编码器,将多模态对齐机制从干净输入扩展到噪声输入。这允许在推理时重复利用MLLM进行条件化,从而更高效地利用计算资源,是现代T2I系统的一种新范式。

Abstract: Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.


[55] Gaze Heads: How VLMs Look at What They Describe cs.CV | cs.CL | cs.LGPDF

Rohit Gandikota, David Bau

TL;DR: 该论文发现视觉语言模型(VLM)在描述图像时,其语言模型主干中存在一组特定的注意力头,称为’凝视头’。这些头的注意力会追踪模型当前正在描述的图像区域。通过简单的相关性得分和漫画条作为受控测试平台,可以识别出这些头,并且通过干预这些头的注意力,可以在推理时精确地引导模型描述指定的图像区域,而无需重新训练。

Details

Motivation: 研究动机是探索视觉语言模型内部如何解决图像描述任务,其工作机制并不直观。论文旨在识别并理解模型内部用于追踪和关联图像区域与语言描述的特定机制。

Result: 在漫画条数据集上,仅对前100个凝视头(少于总注意力头的9%)进行注意力掩码干预,就能以83.1%的准确率将模型的回答引导至任何选定的漫画面板。该干预在自然COCO图像上同样有效,且该机制在2B到32B参数规模的不同VLM架构中普遍存在。

Insight: 创新点在于发现了VLM中负责追踪被描述图像区域的’凝视头’这一内部机制,并证明通过针对性的、基于机制分析的注意力干预,可以在推理时实现对多模态模型行为的有效、连续控制,这为模型的可控性和可解释性提供了新途径。

Abstract: How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model’s answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at https://gaze.baulab.info/


[56] OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains cs.CVPDF

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

TL;DR: 该论文提出了一个自动化数据引擎,用于构建音频-视觉问答(QA)数据集OmniVideo-100K。该引擎通过实体锚定视频脚本化和线索引导的QA生成两个机制,将视频转化为结构化脚本并生成高质量的QA对。基于此构建的数据集显著提升了多个多模态模型在音频-视觉推理任务上的性能。

Details

Motivation: 当前音频-视觉QA的自动化流程通常将视频分割成短片段并分别处理音频和视觉模态,这割裂了声音与视觉源之间的固有联系,且跨片段描述不一致。同时,将长文本理解和QA合成耦合为一步,限制了模型对长期时序关联和深度跨模态推理的能力。

Result: 在构建的OmniVideo-100K数据集上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B模型,在人工验证的测试集OmniVideo-Test上性能提升最高达20.59%。在Daily-Omni和JointAVBench等现有基准测试上也展现出强大的泛化能力,提升最高达12.64%。

Insight: 论文的创新点在于提出了一个自动化数据生成管道,其核心是实体锚定视频脚本化(确保跨片段指称一致并重建音视关联)和线索引导的QA生成(先挖掘跨片段多模态线索,再基于线索生成QA)。这为构建高质量、结构化、支持深度推理的音频-视觉数据集提供了新范式。

Abstract: Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA’’ paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.


cs.MA [Back]

[57] Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents cs.MA | cs.CVPDF

Seoyoung Choi, Minseok Ko, Hyunseok Lee, Kunwoong Kim, Woomin Song

TL;DR: 本文系统研究了图形用户界面(GUI)代理中视觉记忆的作用,发现简单的全屏截图记忆会加剧某些类型的错误。为此,作者提出了一种新的动作锚定视觉记忆框架(AGMem),它通过存储与成功动作或恢复操作密切相关的局部GUI区域图像块,而非全屏截图,来改进代理性能。在OSWorld基准测试上,AGMem将任务成功率比全图记忆提高了33.3%。

Details

Motivation: 尽管近期研究引入了基于截图的视觉记忆来增强GUI代理的上下文信息,但其具体效果尚不明确,不清楚它能缓解或加剧哪些类型的失败。本文旨在系统分析视觉记忆在GUI代理中的作用,并识别其优缺点。

Result: 在OSWorld基准上的实验表明,所提出的动作锚定视觉记忆框架(AGMem)相比全图记忆,将任务成功率提升了33.3%。

Insight: 论文的创新点在于提出了一个针对GUI代理失败模式的分类法,并基于分析发现(全图记忆会减少状态级错误但加剧动作级错误)设计了一种新颖的动作锚定视觉记忆表示方法(AGMem),即存储与具体动作相关的局部图像区域,这比存储全屏截图更有效。这为构建更可靠的GUI代理提供了关键的架构设计思路。

Abstract: Graphical User Interface (GUI) agents are increasingly used to automate complex computer tasks across applications, websites, and operating systems. To improve their reliability, recent work has introduced experiential memory, where agents retrieve prior trajectories to guide decision-making in similar states. More recent approaches further extend this idea to visual memory by storing and retrieving screenshots from past interactions, providing agents with richer contextual information than text-only memories. However, the effect of visual memory in GUI agents remains insufficiently understood: it is unclear which failures visual memory mitigates, or which failures it exacerbates. To systematically analyze the effect of visual memory, we introduce a taxonomy of four GUI agent failures (i.e., cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error) that map to distinct stages of the perception-reasoning-action pipeline. We find that prepending full-image memory has a divergent effect on the failure distribution: it reduces state-level failures but worsens action-level ones, and increases hidden operation blindness and grounding error. Motivated by this finding, we propose Action-Grounded Visual Memory (AGMem), an action-grounded memory framework for GUI agents. The core idea of AGMem is to store image crops that capture the local GUI region closely related to a successful action or a recovery, rather than storing full screenshots. Experiments on OSWorld show that AGMem improves task success rates by 33.3 % over full-image memory. These results demonstrate that AGMem is an effective representation for visual memory in GUI agents.


cs.SD [Back]

[58] Multimodal Speaker Identification in Classroom Environments cs.SD | cs.CLPDF

Michael L. Chrzan, Meghavarshini Krishnaswamy, Robert Gibboni, Katie Wetstone, Wei Ai

TL;DR: 本研究提出了一种多模态说话人识别框架,用于K-12课堂环境分析。该框架将声学嵌入与基于大语言模型(LLM)提取的语义上下文信息相结合,以解决背景噪音和儿童语音多变导致纯声学模型性能不佳的问题。在EDSI数据集子集上的实验表明,该方法显著提升了学生身份识别的准确率。

Details

Motivation: K-12课堂环境的自动分析面临背景噪音和儿童语音多变性的挑战,这常常导致纯声学模型失效。本研究旨在通过整合语义上下文信息来提升课堂环境中说话人识别的鲁棒性和准确性,以支持大规模公平教学。

Result: 在EDSI数据集(8个数学课堂,2801条话语)上,纯声学基线(ECAPA-TDNN)准确率仅为39.0%。提出的多模态方法(结合基于文本的“上下文锚定”和梯度提升分类器)将学生识别准确率提升至50.3%。对于超过5秒的话语,准确率达到76.9%(基线为64.9%),Top-3准确率为90.9%。此外,模型区分师生角色的准确率达到99.3%。

Insight: 主要创新点在于将LLM提取的语义上下文作为“锚点”与声学特征融合,构建了一个多模态识别框架。这为在复杂声学环境中(如课堂)进行鲁棒的说话人识别提供了一种新思路,即利用语义信息来约束和增强声学模型的判别能力,从而提升整体性能。该方法为实现考虑个体学生参与的自动化反馈系统迈出了关键一步。

Abstract: Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM-derived semantic context. Using a subset of the EDSI dataset (8 math classrooms, N = 2,801 utterances), we found an acoustic baseline (ECAPA-TDNN) achieved only 39.0% accuracy. By integrating transcript-based “contextual anchoring” into a gradient boosting classifier, our multimodal approach raised student identification to 50.3%. Performance also improved for utterances over 5 seconds, reaching 76.9% accuracy (vs. 64.9% baseline) with a 90.9% Top-3 accuracy. Additionally, the model distinguished teacher vs. student roles with 99.3% accuracy. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale.


[59] Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources cs.SD | cs.AI | cs.CLPDF

Oh Hyun-Bin, Kazuki Shimada, Yuhta Takida, Kim Sung-Bin, Toshimitsu Uesaka

TL;DR: 该论文提出了ST-AudioQA数据集和ST-AudioLM模型,用于解决现有音频-语言模型在时空推理上的不足。通过构建包含静态和动态声源的一阶Ambisonic渲染场景,并设计一个时间分辨的音频编码器,将事件语义与声源轨迹联合学习,再连接到大语言模型进行问答。

Details

Motivation: 现有音频-语言模型通常将音频片段视为全局事件内容进行推理,而声音事件定位模型虽能跟踪声源方向但语义覆盖有限,两者在同时处理声音事件的语义、位置和轨迹方面存在鸿沟。

Result: 实验表明,该方法在语义与定位的权衡上优于静态空间和以定位为导向的基线模型,在ST-AudioQA基准上实现了更强的推理性能。

Insight: 创新点在于构建了首个用于时空音频问答的密集轨迹监督数据集,并提出了一个联合学习事件语义和声源轨迹的时间分辨音频编码器架构,将音频token与大语言模型连接以实现对声音事件’是什么、在哪里、如何运动’的联合推理。

Abstract: Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.


[60] FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision cs.SD | cs.CVPDF

Shiyao Wang, Xijuan Zeng, Hui Wang, Shiwan Zhao, Feng Deng

TL;DR: 本文提出了FoleyGenEx,一个统一的视频到音频生成框架,集成了多模态控制、帧级时间对齐和细粒度语义,旨在为多样化任务生成同步且通用的音频。该框架通过条件注入机制、多模态动态掩码策略和基于副词的数据增强算法,解决了现有方法在时间对齐、参考音频条件化和语义精度方面的不足。

Details

Motivation: 现有视频到音频生成方法要么具有多模态控制但时间对齐弱,要么对齐强但缺乏参考音频条件化和语义精度,FoleyGenEx旨在填补这一空白,实现同步、可控且语义精确的音频合成。

Result: 在AudioCaps、VGGSound和Greatest Hits等基准上的实验表明,FoleyGenEx在可控视频到音频生成方面具有竞争力,性能与现有方法相当。

Insight: 创新点包括:用于音频控制VTA和Foley扩展的条件注入机制;保持训练同步的多模态动态掩码策略;以及利用信号处理和大型语言模型进行基于副词的数据增强,以增强文本监督的细微语义。这些方法提升了时间对齐和语义精度,为多模态生成提供了新思路。

Abstract: We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.


cs.RO [Back]

[61] $μ_0$: A Scalable 3D Interaction-Trace World Model cs.RO | cs.CV | cs.LGPDF

Seungjae Lee, Yoonkyo Jung, Jusuk Lee, Jonghun Shin, Amir Hossein Shahidzadeh

TL;DR: 本文提出了一个名为 $μ_0$ 的可扩展世界模型,它基于3D轨迹来预测交互点的运动,而不是预测密集像素或直接建模动作。该方法通过自动提取3D监督(TraceExtract系统)进行预训练,生成一个与具体机器人形态无关的紧凑运动接口。实验表明,该模型在2D和3D轨迹预测上优于基线,并且其冻结的模型可以与动作专家结合,在下游机器人任务中实现与使用动作监督预训练的模型(如 $π_0$)相竞争的性能。

Details

Motivation: 为了解决现有世界模型在可扩展性上的局限——像素视频模型消耗过多容量在密集外观重建上,而直接动作模型需要依赖特定形态的动作标签——本文旨在构建一个可扩展、与具体机器人形态无关的世界模型。

Result: 实验表明,$μ_0$ 在2D和3D轨迹预测任务上超越了基线模型,包括轨迹预测模型和token化的视觉语言模型(VLM)方法。在下游机器人操作任务中,基于 $μ_0$ 轨迹条件生成的策略,其性能与使用动作监督预训练的视觉语言动作模型(如 $π_0$)相当。

Insight: 论文的核心创新在于使用3D交互轨迹作为世界模型的紧凑、可迁移的表示,并通过自动化的TraceExtract系统从多样化视频中提取3D监督,实现了无需动作标签的预训练。这为跨形态的机器人操作学习提供了一个可扩展的新范式。

Abstract: World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $μ_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $μ_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $μ_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $μ_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $μ_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $π_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.


[62] PhysVLA: Towards Physically-Grounded VLA for Embodied Robotic Manipulation cs.RO | cs.CV | cs.LGPDF

Namai Chandra, Shriram Damodaran, Lin Wang

TL;DR: 本文提出了PhysVLA,一个即插即用的推理时框架,旨在为任何冻结的视觉-语言-动作模型提供物理基础。它通过一个双层的校正机制(基于相位的有限状态机和选择性的欧拉-拉格朗日门)来强制执行刚体动力学等物理约束,从而弥补VLA模型在物理一致性上的不足,无需重新训练或微调主干模型。

Details

Motivation: 现有的视觉-语言-动作模型主要拟合行为演示数据,没有显式地强制执行刚体动力学或接触约束等基本物理原理,导致轨迹质量与失败率之间存在权衡。本文旨在弥合这一物理鸿沟。

Result: 在LIBERO-Spatial基准测试中,使用7自由度Franka Panda机器人对多个VLA主干模型进行评估,PhysVLA实现了绝对成功率最高提升17%,稳定性最高提升19%,轨迹效率最高提升15%,并且在Robosuite Lift跨模拟器扫描中轨迹急动鲁棒性最高提升10倍。在真实Agilex Piper机械臂上的拾放任务中,成功率最高提升50%。

Insight: 创新点在于将物理意识设计为一个可组合、与主干模型无关的运行时模块,通过拦截预测的控制动作并应用基于物理原理的校正,在不修改或重新训练主干模型的情况下,显著提升了机器人操作任务的物理一致性和性能。这为增强具身智能模型的物理基础提供了一种轻量级、可插拔的解决方案。

Abstract: Vision-Language-Action (VLA) models excel at mapping visual inputs and natural language instructions directly to robotic control policies. However, because they are trained primarily to fit behavioural demonstration data, they do not explicitly enforce fundamental physical principles such as rigid-body dynamics or contact constraints. This exposes a critical physics gap: standard temporal smoothing applied on top of single-step or chunked VLAs trades trajectory quality for added failures that short-term memory cannot resolve. To bridge this gap, we introduce PhysVLA (Physics-VLA), a plug-and-play, inference-time framework designed to wrap any frozen VLA backbone without retraining, fine-tuning, or weight access, with less than 1 ms of overhead per control step. PhysVLA intercepts the predicted control action, captures only the simulator or system state, and applies a dual-layered correction: (i) a phase-aware finite-state machine that structures discrete task segments (approach, grasp, transport, and place), and (ii) a selective Euler-Lagrange gate that activates only when a dynamics oracle detects kinodynamic inconsistency. Evaluated across OpenVLA, OpenVLA-OFT, Force-VLA, and Generalist-VLA on LIBERO-Spatial with a 7-DoF Franka Panda, the framework delivers absolute success rate increases of up to 17% and stability increases of up to 19% with no per-task regressions, improves trajectory efficiency by up to 15% across all four backbones, and shows up to a 10x improvement in trajectory jerk robustness on a Robosuite Lift cross-simulator sweep. We further validate the framework on a real Agilex Piper arm with a pick-and-place task, confirming that PhysVLA transfers to physical hardware without retraining, with success-rate improvements of up to 50%, establishing physical awareness as a composable, backbone-agnostic runtime module.


cs.AI [Back]

[63] UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems cs.AI | cs.CLPDF

Hui Wang, Fafa Zhang, Meng Liu, Xiangyu Chen, Chaoxu Mu

TL;DR: 本文提出了一种基于用户画像的嵌套滚动策略自适应(UP-NRPA)在线框架,该框架利用大型语言模型(LLMs)进行面向目标的对话系统规划。该方法旨在解决现有对话策略规划方法难以动态适应多样化用户特性的挑战,通过实时用户反馈和从当前用户画像映射出的个性、偏好和目标,实现对话策略的动态定制,而无需离线强化学习训练。

Details

Motivation: 当前对话策略规划方法难以动态适应不同用户的特性,通常依赖于模型训练和针对用户群体的离线强化学习策略模型。本文旨在通过一个在线自适应机制,使对话系统能够根据实时用户画像动态调整策略,从而更好地满足个性化需求。

Result: 在协作和非协作对话基准测试中,UP-NRPA展示了显著优势,在多个对话任务中实现了100%的成功率。特别是在谈判任务中,销售挂牌比率(SL)提升了56.41%,表明该方法能有效适应多样用户需求。

Insight: 创新点在于将用户画像与嵌套滚动策略自适应(NRPA)结合,利用LLMs实现无需离线训练的在线动态策略定制。这为对话系统提供了一种灵活、个性化的规划方法,能够实时适应不同用户的特性和目标,减少对预训练模型的依赖。

Abstract: To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Language Models. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism. This is achieved by leveraging real-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning. In collaborative and non-collaborative dialogue benchmarks, UP-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks. Particularly in negotiation tasks, the sale-to-list ratio (SL) increased by 56.41%. This demonstrates that UP-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics.


[64] Orchestra-o1: Omnimodal Agent Orchestration cs.AI | cs.CL | cs.CVPDF

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu

TL;DR: 本文提出Orchestra-o1,一个全模态智能体编排框架,旨在解决异构多模态(如文本、图像、音频、视频)共存与交互场景下的任务分解与协作问题。该框架引入了统一的编排机制,支持模态感知的任务分解、在线子智能体专业化与并行子任务执行,并通过决策对齐的群体相对策略优化(DA-GRPO)方法进行高效训练。

Details

Motivation: 现有智能体编排框架局限于少数模态,难以泛化到异构多模态共存与交互的复杂场景,特别是在需要统一理解和协调多样化输入的全模态任务中表现不足。

Result: 在OmniGAIA基准测试中,Orchestra-o1的准确率超过次优方法10.3%;其8B参数版本通过DA-GRPO训练,在所有现有开源全模态智能体中达到了最先进的性能水平。

Insight: 创新点在于提出了一个支持全模态的统一智能体编排机制,实现了模态感知的任务分解与并行执行;同时,DA-GRPO作为一种高效的智能体强化学习方法,为训练大规模多模态协作系统提供了新思路。

Abstract: The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.


[65] Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs cs.AI | cs.CLPDF

Pratham Singla, Shivank Garg, Vihan Singh

TL;DR: 论文提出了Poker Arena平台,这是一个无限制德州扑克锦标赛平台,用于评估大型语言模型在不确定性下的战略推理和记忆能力。该平台采用三层记忆架构和九轴认知剖面,将战略推理分解为可解释的维度,如下注规模校准和位置意识。通过对七个前沿模型进行大规模测试和记忆消融实验,研究发现多轴评估能揭示标量排行榜系统误判的能力结构,且跨维度一致性比任何单一轴的峰值表现更重要。

Details

Motivation: 现有博弈基准将异构推理维度压缩为单一标量,导致前沿LLMs的能力结构未被充分检验。论文旨在通过多轴剖析方法,深入评估LLMs在战略推理和记忆方面的能力。

Result: 在50个会话(每个会话1000手牌)和受控记忆消融实验中,Claude Opus 4.6赢得+$15,730筹码并获得14次第一名,但在平均轴得分中仅排名第七名中的第五。持久记忆对某些模型有帮助,但对其他模型有损害。

Insight: 创新点在于引入了结合三层记忆架构和九轴认知剖面的多轴评估平台,将战略推理分解为可解释维度。客观分析认为,该方法能更全面地揭示模型能力结构,强调跨维度一致性的重要性,超越了传统标量评估的局限性。

Abstract: Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined. We introduce Poker Arena, a no-limit Texas Hold’em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.


[66] GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge cs.AI | cs.CL | cs.LGPDF

Pavan C Shekar, Abhishek H S, Aswanth Krishnan

TL;DR: 本文提出了GitOfThoughts,一种将大语言模型(LLM)智能体的推理树存储为Git仓库的方法,使推理过程可回放、可审计、可合并。同时,通过系统实验探究了不同记忆形式对解决新问题准确性的影响,发现记忆仅在检索到近乎重复的案例时有效,且收益源于答案检索而非方法迁移。

Details

Motivation: 当前LLM的推理过程是短暂的,缺乏版本控制,无法像代码、数据等其他复杂软件过程一样进行审计、比较和合并。本文旨在解决这一缺陷,并探究记忆机制是否真正能提升模型解决新问题的准确性。

Result: 在五种记忆形式(无记忆、Markdown、向量、图、Git)、两个基准测试、两种模型规模以及预注册复现的实验中,对于新颖问题,没有一种记忆形式能可靠地提升准确性。只有当检索到的案例与当前问题高度相似(相似度>~0.8)时,准确性才会显著提升。

Insight: 核心创新点是将智能体的推理过程版本化为Git仓库,实现了推理的可审计性和可合并性。一个关键的实证发现是,记忆对解决新问题的帮助存在一个‘可复制性阈值’,其收益主要来自答案的直接检索,而非抽象方法的迁移,这挑战了记忆能促进泛化的常见假设。

Abstract: Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent’s reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is “git log” over the agent’s own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.


[67] Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows cs.AI | cs.CLPDF

Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia

TL;DR: 本文提出了Parallel-Synthesis框架,旨在解决大语言模型作为智能体执行引擎时,其顺序文本接口与现代结构化并行工作流不匹配的问题。该框架通过一个可插拔的系统,使合成器能够直接消费并行工作智能体生成的KV缓存,从而避免文本拼接带来的冗余计算和结构信息丢失。

Details

Motivation: 当前基于LLM的智能体系统在处理并行分支任务(如探索子任务、检索证据、生成候选方案)时,通常通过拼接各分支的文本输出进行最终合成,这丢弃了并行结构信息并导致冗余的预填充计算。

Result: 在涵盖数学、科学问答、代码生成、GAIA和多智能体数据库诊断的九个下游数据集上,Parallel-Synthesis在七个数据集上匹配或超越了基于文本的合成方法,在其余两个数据集上也表现接近。同时,它将首词生成时间减少了2.5倍至11倍。

Insight: 核心创新点在于提出了一个直接基于KV缓存进行合成的接口,通过缓存映射器和微调的合成器适配器,实现了对并行分支输出的高效、结构化合成,为智能体工作流提供了更原生和高效的合成机制。

Abstract: Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.


cs.IR [Back]

[68] ADORE: Iterative Query Expansion with Retrieval-Grounded Relevance Feedback cs.IR | cs.CLPDF

Amin Bigdeli, Negar Arabzadeh, Radin Hamidi Rad, Sajad Ebrahimi, Charles L. A. Clarke

TL;DR: 本文提出了ADORE框架,一种基于检索反馈的迭代查询扩展方法,通过结合LLM生成伪段落、检索器获取语料响应以及相关性评估器判断检索结果,动态调整查询扩展策略,以提升信息检索效果。

Details

Motivation: 现有LLM驱动的查询扩展方法多为生成驱动,缺乏对目标语料响应的验证,容易导致检索漂移、误导性词汇放大或遗漏区分性术语,因此需要引入基于检索的反馈机制来提升扩展的有效性。

Result: 在TREC Deep Learning、BEIR和BRIGHT基准测试中,ADORE显著优于现有查询扩展基线方法,在BEIR上将平均nDCG@10比BM25提升24.5%,比最强基线提升3.6%;在BRIGHT上比BM25提升122.9%,比最佳基线提升9.2%。

Insight: 创新点在于将检索结果转化为反馈,通过迭代机制动态评估和调整扩展内容,实现检索驱动的查询优化,这为结合生成与检索的交互式方法提供了新思路。

Abstract: LLM-based query expansion improves retrieval by enriching the original query with additional context. Yet most methods remain generation-driven, producing plausible pseudo-documents or expansions without checking how the target corpus responds. This can introduce retrieval drift, amplify misleading vocabulary, or miss terms that distinguish relevant from non-relevant documents. We argue that effective expansion requires retrieval-grounded feedback, not just single-pass generation or unverified iteration. We introduce ADORE (ADapt, Observe, Relevance Evaluate), an iterative framework that turns retrieval outcomes into feedback for the next expansion. At each round, an LLM generates pseudo-passages, a retriever exposes the corpus response, and a relevance assessor evaluates retrieved documents against the original query. These judgments identify what to reinforce, what remains undercovered, and what to suppress. Across TREC Deep Learning, BEIR, and BRIGHT, ADORE consistently outperforms strong query expansion baselines with notable improvements across nearly all evaluation settings, improving average nDCG@10 by 24.5% over BM25 and 3.6% over the strongest prior query expansion method on BEIR, and by 122.9% over BM25 and 9.2% over the best query expansion baseline on BRIGHT. Our code and data are publicly available.


[69] CoRe: A Continuously Reward-Finetuned LLM Query Rewriter for Multi-Stage Context-Aware Relevance in Web-Scale Video Search cs.IR | cs.CLPDF

Yilin Wen, Rong Yang, Xiaojia Chang, Hong Sun, Gefu Tang

TL;DR: 本文介绍了CoRe系统,这是一个基于LLM的查询重写器,用于大规模短视频搜索引擎中的多阶段上下文感知相关性任务。该系统通过持续奖励微调,每周重新部署以应对数据漂移,并采用半在线混合偏好优化循环来降低训练成本。

Details

Motivation: 解决生产环境中LLM查询重写器面临的训练奖励与生产排序器消费方式不一致,以及训练成本过高无法支持持续重新部署的问题。

Result: 在两个连续的生产A/B测试中,首先在精细排序阶段部署重写器,然后扩展到召回和原始排序阶段,显著降低了受重写影响查询的变更查询率,所有相关性和参与度指标均朝预期方向改善。

Insight: 创新点包括使用生产多模态相关性模型作为奖励源,采用乘法比率形式模拟生产融合代数,以及半在线混合偏好优化循环通过DPO风格成对目标和阶段结构降低训练成本。系统还设计了自动推广门控机制来检测和恢复奖励黑客事件,并通过并行信号消费限制故障影响范围。

Abstract: LLM-based query rewriters in production face a tension: the training reward must reflect how the rewrite is consumed by the production ranker, yet the training procedure must be cheap enough to support continuous redeployment as data drifts. We present CoRe (Context Relevance), such a system, redeployed weekly for over five months in a major short-video search engine. Our reward uses the deployed multimodal relevance model as its source and a multiplicative ratio form mirroring the production fusion algebra, closing the simulation-production gap that offline reward proxies leave open. A semi-online Mixed Preference Optimization loop makes this reward affordable at multi-million-instance weekly scale: a DPO-style pairwise objective restricts the gradient pass to a small top-k/bottom-k subset of sampled trajectories, and a phase structure reduces trainer/inference-server parameter syncs from per-step to per-phase. An automated promotion gate over reward-like and stability metrics detected and recovered from a real reward-hacking incident in production. Rewriter output is consumed as parallel relevance signals at recall, rawrank, and finerank without displacing the original signals, bounding rewriter-failure blast radius. Online A/B from two sequential production launches, first deploying the rewriter at finerank, then extending consumption to recall and rawrank, delivers statistically significant reductions in change-query rate on rewrite-impacted queries, with all headline relevance and engagement metrics moving in the expected direction.


cs.LG [Back]

[70] SuperThoughts: Reasoning Tokens in Superposition cs.LG | cs.AI | cs.CLPDF

Zheyang Xiong, Shivam Garg, Max Yu, Vaishnavi Shrivastava, Haoyu Zhao

TL;DR: 本文提出SuperThoughts方法,通过将连续的两个思维链(CoT)推理令牌压缩为单个潜在表示,并利用轻量级多令牌预测(MTP)模块在推理时一步解码两个令牌,从而在保持训练时离散令牌监督的同时,将推理吞吐量提高一倍。该方法在多个数学推理基准测试上实现了约20-30%的CoT长度缩减,且精度下降极小。

Details

Motivation: 解决长思维链推理因顺序令牌生成导致计算成本高昂的问题,同时克服现有连续潜在空间推理方法因缺乏监督信号而存在的训练不稳定和难以扩展到复杂长程任务的缺陷。

Result: 在MATH500、AMC、OlympiadBench和GPQA-Diamond等基准上对Qwen2.5-Math系列模型进行微调和评估,通过基于置信度的自适应机制(在不确定时回退到标准解码),在大多数任务上精度仅下降1-2个百分点,同时实现了约20-30%的CoT长度缩减。

Insight: 核心创新在于将离散令牌监督与潜在空间压缩相结合,通过成对令牌压缩和MTP模块实现推理加速;其自适应回退机制保证了可靠性,为高效且稳定的长序列推理提供了一种新思路。

Abstract: Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token generation, they often struggle with training stability and fail to scale to complex, long-horizon tasks due to lack of supervision signal. We propose SuperThoughts, which compresses pairs of consecutive CoT tokens into single latent representations and decodes two tokens per step via a lightweight Multi-Token Prediction (MTP) module. This preserves discrete token supervision at training time while doubling throughput at inference time. We finetune Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct, Qwen2.5-Math-14B-Instruct, and evaluate on MATH500, AMC, OlympiadBench, and GPQA-Diamond. With a confidence-based adaptive mechanism that falls back to standard decoding when uncertain, SuperThoughts achieves $\sim$20–30% CoT length reduction while maintaining accuracy with minimal degradation (1-2 points accuracy drop on most tasks).


[71] Non-Parametric Machine Text Detection via Multi-View Gaussian Processes cs.LG | cs.CLPDF

Aleem Khan, Nicholas Andrews

TL;DR: 本文提出了一种多视角非参数检测框架,通过高斯过程集成聚合文档的互补特征视图(如风格特征、似然与排序特征、结构特征),以提升对抗条件下(如改写和定向风格迁移)的机器文本检测鲁棒性。该方法利用高斯过程提供校准概率和分布外输入的合理弃权机制,支持高风险场景的可靠部署。

Details

Motivation: 现有机器文本检测器在对抗条件(如改写和风格迁移)下准确率急剧下降,而文档携带的多种互补信号(如风格、似然、结构特征)难以被单一攻击同时抑制;参数化分类器在分布偏移时容易产生自信的错误预测,需要一种更鲁棒的检测方法。

Result: 在DetectRL、RAID和PAN2025共享任务等多个基准测试中,该方法在面对多样生成器和攻击时保持了强劲性能,并在未见过攻击的测试中优于现有方法。

Insight: 创新点在于多视角非参数框架通过高斯过程集成聚合独立检测轴,迫使攻击者同时击败多个特征维度,显著提高了规避成本;高斯过程提供的概率校准和分布外弃权机制增强了部署可靠性,为非参数方法在对抗性文本检测中的应用提供了新思路。

Abstract: Adversarial conditions such as paraphrasing and targeted style transfer sharply degrade the accuracy of machine text detectors. A document, however, carries multiple complementary signals (e.g., stylistic features, likelihood and rank-order features, and structural features), and an attack that suppresses one may leave others intact. While a parametric classifier can learn to combine these features given sufficient supervision, classifiers are prone to making confidently incorrect predictions when the distribution shifts (e.g., novel attacks or unseen language models). To address this, we propose a multi-view, non-parametric detection framework that extracts complementary feature views from the same document and aggregates per-view evidence through a Gaussian process ensemble. By aggregating evidence across views, an adversary must simultaneously defeat multiple independent axes of detection, substantially raising the cost of evasion. The Gaussian process formulation additionally provides calibrated probabilities and principled abstention on out-of-distribution inputs, supporting reliable deployment in high-stakes settings. We evaluate on three benchmarks spanning diverse generators and attacks: the DetectRL and RAID benchmarks, and the PAN2025 shared task and demonstrate that our multi-view detector maintains strong performance under the considered attacks, outperforming existing approaches against held out attacks.


[72] Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs cs.LG | cs.CVPDF

Sirui Zhang, Xu Wang, Zhengyu Wu, Xunkai Li, Hongchao Qin

TL;DR: 本文提出了一种名为CoMAG的统一多模态属性图(MAG)骨干网络,旨在同时优化图结构预测、跨模态匹配和图条件生成任务。该方法通过可靠上下文学习和模态保持的跳令牌对齐,学习任务自适应的可靠上下文并在其中保持模态特定信息,从而克服现有方法中任务无关传播和过度压缩融合的问题。

Details

Motivation: 现有MAG方法通常依赖固定的图上下文或均匀融合的表示,导致任务无关的传播和过度压缩的融合,这阻碍了满足多样化任务需求和保留模态特定证据。本文旨在解决这一问题。

Result: 在九个OpenMAG数据集上的实验表明,CoMAG在图级别预测、模态匹配和图条件生成任务上,超越了仅特征、仅图、多模态和统一MAG基线,取得了最佳报告性能。

Insight: 创新点在于提出了任务自适应的可靠上下文学习(通过多模态语义一致性估计边可靠性、补充语义邻居、任务感知门选择)和模态保持的跳令牌对齐(维护模态特定多跳轨迹、跨模态匹配跳令牌、解耦共享与私有表示),从而在一个前向传播中同时生成图和模态表示,并保持了稀疏边线性复杂度。

Abstract: Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative representations, and modality-centric tasks requiring fine-grained cross-modal correspondence. However, existing MAG methods often rely on fixed graph contexts or uniformly fused representations, causing task-agnostic propagation and over-compressed fusion that hinder diverse task requirements and modality-specific evidence preservation. To address this, we propose CoMAG, a unified MAG backbone that learns task-adaptive reliable contexts and modality-preserving alignment within them. CoMAG first conducts Reliable Context Learning by estimating edge reliability from multimodal semantic consistency, complementing raw topology with semantic neighbors, and selecting context components through a task-aware gate. It then performs Modality-preserving Hop-token Alignment by maintaining modality-specific multi-hop trajectories, matching modality-hop tokens across modalities, and decoupling shared and private representations. Thus, CoMAG produces graph and modality representations from one forward pass while retaining modality-specific cues. We further analyze stable propagation, over-smoothing mitigation, and modality-collapse control. Experiments on nine OpenMAG datasets compare CoMAG with feature-only, graph-only, multimodal, and unified MAG baselines across graph-level prediction, modality matching, and graph-conditioned generation. Results show that CoMAG achieves the best reported performance, demonstrating that task-adaptive reliable contexts and modality-preserving alignment improve structural prediction, cross-modal matching, and graph-conditioned generation while retaining sparse edge-linear complexity.


eess.IV [Back]

[73] High-Fidelity Video Compression based on Invertible Neural Transform and Implicit Conditioning eess.IV | cs.CV | cs.MMPDF

Siyue Teng, Ho Man Kwan, Yuxuan Jiang, Fan Zhang, David Bull

TL;DR: 该论文提出了一种基于可逆神经变换和隐式条件化的高保真视频压缩方法InnVC。该方法通过保留量化前的可逆主变换路径,并注入紧凑的隐式条件化场,将强相关的视频内容与难以建模的细节解耦,从而实现更高效的压缩。实验表明,InnVC在UVG和MCL-JCV基准测试中,特别是在高质量区域,相对于x265实现了显著的码率节省。

Details

Motivation: 现有基于学习的视频压缩方法大多依赖不可逆的分析-合成变换,其重建质量受限于量化误差和变换近似误差,这在高质量点(量化误差小,变换失真占主导)时尤为突出。

Result: 在UVG和MCL-JCV基准测试上,InnVC在宽质量范围内表现出强大的压缩性能,特别是在高质量区域,相对于x265,在UVG上实现了21.66%的PSNR BD-rate降低和46.06%的MS-SSIM BD-rate降低。

Insight: 核心创新在于采用可逆神经变换作为主路径,并结合隐式条件化场进行内容自适应建模,这实现了信息解耦和互补重建。此外,提出的预定掩码策略通过将信息内容逐步集中到更少的潜在通道中,进一步提高了熵编码效率。该方法首次在单一架构规模内覆盖了从低码率到高保真的操作点。

Abstract: Learning-based video compression has recently achieved competitive rate-distortion performance compared to conventional video codecs. However, most existing methods rely on non-invertible analysis-synthesis transforms, with reconstruction quality subject to both quantization and transform approximation errors. This limitation becomes particularly restrictive at higher quality points, where quantization errors are small and transform-induced distortion dominates. To address this, we propose InnVC, an Invertible neural network based Video Codec for wide-range and high-fidelity compression. The core idea is to preserve an invertible main transform path prior to quantization, while injecting content-adaptive context through a compact implicit conditioning field. This decouples strongly correlated video content from harder-to-model fine details, allowing different components to specialize in complementary reconstruction tasks for more efficient compression. To further improve compressibility, we introduce a scheduled masking strategy that progressively concentrates informative content into fewer latent channels for more effective entropy coding. Experiments on the UVG and MCL-JCV benchmarks show that InnVC achieves strong compression performance over a broad quality range, being particularly effective in the high-quality regime, yielding BD-rate reductions of 21.66% in PSNR and 46.06% in MS-SSIM relative to x265 on UVG. To the best of our knowledge, InnVC is the first neural video codec covers operating poins from low bitrate to high fidelity within a single architecture scale, spanning more than 20 dB in PSNR.