Table of Contents

cs.CL [Back]

[1] How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures cs.CL | cs.AIPDF

Tanvi Thoria, Kiana Jafari, Marc R. Schlichting, Mykel J. Kochenderfer

TL;DR: 该论文通过分析语言模型推理过程中的token级不确定性信号,识别出两种可区分的失败模式:早期锁定错误推理路径的’承诺失败’和全程持续不确定的’持续不确定性’。研究发现这些失败模式在23个模型-数据集配置中具有可复现的签名特征,并揭示了不确定性信号与自一致性方法之间的互补关系。

Details

Motivation: 旨在理解语言模型推理失败的内在机制,通过识别失败过程中的可观测特征,为失败检测策略提供理论基础。

Result: 在23个模型-数据集配置中,框架的可证伪预测在20个案例中成立,显著高于随机水平;研究还展示了不确定性信号与自一致性方法的协同作用。

Insight: 创新性地提出通过token级不确定性轨迹区分两种根本不同的失败机制,并定义了’承诺点’这一关键诊断特征;为动态调整推理失败检测策略提供了可操作的量化框架。

Abstract: Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework’s falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.


[2] Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning cs.CLPDF

Jonathan von Rad, Louis Arts, George Burgess, Eleftheria Kolokytha, Harry O’Donnell

TL;DR: 本文研究了大型语言模型在多语言事实召回中的跨语言不一致性问题,并提出了PolyFact数据集来评估和改善这一现象。通过比较持续预训练、监督微调和基于GRPO的强化学习三种方法,发现GRPO在提升跨语言一致性和泛化能力方面表现最佳,并揭示了其通过减少MLP层和注意力头中的语言专业化来重组多语言路由的机制。

Details

Motivation: 大型语言模型主要基于英语数据训练,虽然编码了大量世界知识,但在其他语言中可靠表达这些知识时存在跨语言事实不一致的问题,本文旨在研究和解决这一问题。

Result: 在PolyFact数据集上,GRPO方法在Qwen-2.5-7B和OLMo-2-1124-7B模型上均优于监督微调,提高了跨语言一致性以及对未见语言的泛化能力,而持续预训练带来的增益有限。

Insight: 论文的创新点在于引入了大规模并行多语言事实问答数据集PolyFact,并采用GRPO强化学习方法有效提升跨语言事实召回;从客观角度看,其通过机制分析揭示了GRPO通过减少语言专业化来促进共享跨语言表征的路径,为多语言模型优化提供了新视角。

Abstract: Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.


[3] UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs cs.CLPDF

Amirhossein Abaskohi, Amirhossein Dabiriaghdam, Liang Luo, Ellie Dingqiao Wen, Lele Wang

TL;DR: 本文提出了UnpredictaBench基准测试,用于评估大语言模型(LLMs)捕捉真实底层分布的能力。该基准包含448个问题,并引入了KS@N评估指标,通过Kolmogorov-Smirnov统计检验量化模型输出与目标分布的近似程度。实验表明,现有模型在分布采样能力上存在显著差异且整体表现有限,即使简单分布模拟仍具挑战性。

Details

Motivation: 随着LLMs越来越多地被用作其他实体(如经济模拟中的人类)的替代品,许多模型倾向于坍缩到单一合理答案,无法捕捉真实系统的不可预测性。现有提高输出多样性的工作不足以满足模拟需求,因为模拟需要与目标分布校准的样本,而不仅仅是多样化的输出。

Result: 在开源和专有模型上的测试显示,模型在分布能力上存在较大差异。例如,在KS@100(标准指标)下,得分范围从接近0%到超过20%,但没有模型能达到40%以上,表明分布采样能力仍有显著提升空间。尽管增加推理能略微提高分数,但未发现直接解决方案。

Insight: 论文的创新点在于提出了首个专门评估LLMs分布随机性的基准UnpredictaBench及其通用评估指标KS@N,将问题简化为从单个目标分布中采样结果,涵盖了经典统计分布、随机程序诱导的分布和描述随机过程的自然语言场景。这为将LLMs用作复杂系统替代品提供了必要的初步评估工具。

Abstract: We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.


[4] CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures cs.CL | cs.AIPDF

Jakub Bąba, Jarosław Chudziak

TL;DR: 本文提出了CAF-Gen,一个用于将浅层论证结构丰富为符合Carneades论证框架(CAF)的自动化多智能体系统。该系统采用迭代的创造者-评审者流程,通过多智能体协作来提升生成论证模型的结构完整性和质量。实验表明,该方法能有效克服单次生成模型的局限性,并与原始标注保持高度一致。

Details

Motivation: 当前论证挖掘技术主要识别基本主张和前提,但难以捕捉Carneades论证框架等高级模式所需的丰富结构信息(如前提类型、证明标准和论证方案)。

Result: 实验证明,迭代反馈循环提高了生成数据的质量,与原始标注实现了强对齐,并生成了结构更丰富的模型。

Insight: 创新点在于引入了多智能体协作的迭代创造者-评审者管道,以缓解单次生成模型常见的结构不稳定性,为形式化论证的自动化建模提供了稳健的方法论。

Abstract: Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF-Gen, an automated multi-agent framework designed to enrich shallow argument structures into CAF-compliant argument models. By employing an iterative Creator-Reviewer pipeline, a creator agent’s output is validated by a critical agent to ensure structural integrity. This multi-agent collaboration is crucial for mitigating the structural instability typical of single-pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi-agent system can overcome the limitations of single-pass generation, providing a robust methodology for the automated modeling of formal argumentation.


[5] What Do People Actually Want From AI? Mapping Preference Plurality cs.CL | cs.CYPDF

Julia Sepúlveda Coelho, Scott A. Hale

TL;DR: 本文通过分析PRISM数据集中75个国家的1500份开放式回答,揭示了当前基于人类反馈的强化学习(RLHF)对齐大型语言模型(LLMs)的局限性。研究发现人们对AI的偏好高度多元化且相互冲突,例如对“真实性”的定义存在根本性分歧,且二元比较无法捕捉上下文区别。

Details

Motivation: 解决当前RLHF方法在模型对齐中的根本问题,包括聚合冲突偏好、依赖非代表性样本以及仅使用二元比较,旨在探究人们实际对AI系统的多元化需求。

Result: 定性分析表明,大多数价值观(如真实性)仅被少于四分之一的受访者提及,且同一术语(如真实性)隐藏着截然不同的认知基础(如要求来源引用、专家意见或非主流观点)。某些能力(如拟人化行为)和功能(如AI护栏)存在直接争议。

Insight: 创新点在于通过开放式响应揭示了偏好多元化的具体形态,批判了将复杂、情境化偏好扁平化为单一奖励模型的现行实践;客观而言,研究强调了在AI对齐中需要更细粒度、上下文感知的偏好建模方法。

Abstract: Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people’s preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by “truthfulness”, they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do “by default” versus “if requested”) that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users’ clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.


[6] Signal-Driven Observation for Long-Horizon Web Agents cs.CLPDF

Shubham Gaur, Ian Lane

TL;DR: 本文提出信号驱动观察(SDO)方法,旨在解决长视野网页智能体因频繁读取庞大DOM树(通常数万tokens)导致的上下文退化问题。该方法通过解耦观察与行动频率,仅在特定信号触发时读取并返回任务相关元素,从而显著压缩观察开销。

Details

Motivation: 当前网页智能体在每个行动步骤都需处理原始DOM和可访问性树,导致长任务中上下文信息逐步退化,损害推理能力。作者认为将观察频率与行动频率耦合是一种架构错误。

Result: 论文未在摘要中提供具体定量实验结果或基准测试比较,但基于递归语言模型的洞察,提出SDO方法旨在理论上优化观察效率。

Insight: 核心创新是将观察压缩作为网页智能体设计的核心架构决策,引入专用子调用和轻量级信号检测器来动态触发观察,从而减少不必要的信息处理。借鉴了文档查询优于整体读取的思想。

Abstract: Web agents operating over long horizons ingest raw DOM and accessibility trees – routinely tens of thousands of tokens – at every action step, causing progressive context degradation that erodes reasoning well before tasks complete. We argue that this coupling of observation frequency to action frequency is an architectural mistake. Drawing on the insight from Recursive Language Models that querying a document outperforms reading it wholesale, we propose Signal-Driven Observation (SDO): a dedicated sub-call reads the full DOM but returns only task-relevant elements and their selectors, and is re-invoked only when a lightweight signal detector fires – triggered by URL transitions, newly visible interactive elements, action failures, or exogenous browser events. We outline the open problems SDO introduces and call on the community to treat observation compression as a core architectural decision in web agent design.


[7] When to Think Deeply: Inhibitory Deliberation for LLM Reasoning cs.CLPDF

Zhixuan He, Yue Feng

TL;DR: 本文提出了IDPR框架,用于在大型语言模型推理中实现响应条件抑制性深思。该框架首先生成一个简洁的直觉答案,然后通过一个抑制控制器,基于快速答案及其相关证据(如置信度、对数概率边际、可解析性和生成成本)来决定是直接输出该答案,还是抑制它并转而调用缓慢的深度推理过程。

Details

Motivation: 动机在于解决大型语言模型进行深思熟虑推理时面临的计算效率问题。对每个输入都调用缓慢的推理过程计算成本高昂且通常不必要,因此需要一种智能机制来决定何时需要进行深度思考。

Result: 在一个包含5000个例子的数学推理测试集上,IDPR仅对8.20%的例子调用慢速推理,并将准确率从47.90%提升至48.92%。在相同的慢速调用预算下,其性能优于随机路由(准确率46.76%)和最强的基于置信度的基线方法(准确率48.22%),并实现了最高的纠正精度。

Insight: 主要创新点在于提出了一个响应条件抑制性深思框架,其抑制控制器不仅考虑输入,还综合了快速答案本身及其多种证据特征来做出决策。这比仅基于输入的路由方法能更精准地识别出哪些快速答案能从慢速推理中受益,从而在保证准确率的前提下显著降低计算开销。

Abstract: Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework for response-conditioned inhibitory deliberation. IDPR first generates a concise intuitive answer and then uses an inhibition controller to decide whether that specific response should be released or suppressed in favor of slow reasoning. Unlike input-only routers, the inhibition controller conditions on the fast answer and fast-side evidence, including confidence, logit margin, parseability, and generation cost. We train the controller from paired fast-slow outcomes and select the inhibition threshold on a held-out validation set under an accuracy-first slow-call budget. On a held-out 5,000-example mathematical reasoning test set, IDPR invokes slow reasoning on only 8.20% of examples and improves accuracy from 47.90% to 48.92%. Under the same slow-call budget, random routing decreases accuracy to 46.76%, while the strongest confidence-based baseline reaches 48.22%. IDPR also achieves the highest corrective precision, showing that response-conditioned inhibition better identifies fast answers that benefit from slow reasoning.


[8] Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards cs.CL | cs.AIPDF

Shihao Zhang, Xiaoman Wang, Yuan Liu, Yunshi Lan, Weining Qian

TL;DR: 本文提出了Progress-SQL,一个用于Text-to-SQL的多轮强化学习框架,通过引入渐进式奖励来改进现有方法。该方法利用Oracle引导的诊断树(ODT)对SQL查询进行子句级结构抽象并提供诊断反馈,结合结构对齐、词法对齐以及执行状态等定义了衡量从初始SQL到最终SQL改进的渐进式奖励。

Details

Motivation: 现有基于强化学习的Text-to-SQL方法通常优化定义在单个SQL状态上的单次奖励,这种奖励对迭代式SQL修正的指导有限,且难以捕捉多轮SQL细化的改进过程。

Result: 在BIRD、Spider及其鲁棒性变体上的实验表明,该方法在主要评估和鲁棒性评估中均能持续提升Text-to-SQL性能。

Insight: 创新点在于提出了一个多轮强化学习框架,其核心是结合ODT的结构对齐与词法对齐来定义渐进式奖励,并引入了鼓励早期正确性的进度延迟奖励和鼓励从无效SQL中恢复的执行状态奖励,从而为迭代优化提供了更密集和鲁棒的奖励信号。

Abstract: Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typically optimize one-shot rewards defined over a single SQL state. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi-turn SQL refinement. In this paper, we propose Progress-SQL, a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL. Our approach introduces an Oracle-guided Diagnostic Tree (ODT), which abstracts SQL queries into clause-level structural profiles and produces diagnostic feedback for next-turn refinement. To provide dense and robust reward signals, we combine ODT-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text-to-SQL performance across both primary and robustness evaluations.


[9] Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning cs.CLPDF

Pratik Jayarao, Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Adithya M Devraj

TL;DR: 本文提出了一种基于强化学习的翻译工具调用策略Translate-R1,旨在解决大语言模型在多语言任务中性能不均衡的问题。该方法通过训练一个单一的策略,让模型能够根据自身对输入的理解程度自适应地决定是否调用翻译工具,从而在保证任务性能的同时降低计算成本。

Details

Motivation: 大语言模型在不同语言上的性能差异显著,而通过预训练或微调来原生提升低资源语言能力通常缺乏足够的语料。直接对所有输入进行翻译会造成浪费,而让模型自行决定是否翻译又因其过度自信而容易失败。现有方法依赖人工设计的规则或外部路由,缺乏通用性。

Result: 在Qwen3-4B模型上,使用包含22种语言(分高、低、极低三个资源层级)和5个领域的数据进行强化学习训练。提出的置信度门控GSPO策略在三个资源层级上分别比基线提升了+4.6、+23.5和+17.5的奖励。与几乎总是翻译的无约束策略相比,该策略以63%的成本保留了全部奖励,并在87%的成本敏感度范围内达到帕累托最优。在两种合成语言上,策略比过度自信的基线提升了+18.7,并能零样本迁移到9种未见语言。

Insight: 核心创新在于通过强化学习从奖励信号中学习一个单一、通用的翻译调用策略,实现了语言和领域自适应的内省能力。该方法避免了手动工程,通过置信度门控机制在任务性能和计算成本之间取得平衡,并展示了良好的泛化能力。

Abstract: The performance gap across languages in LLMs is well documented, and closing it natively requires pretraining or fine-tuning on corpora that, for most languages, do not exist. Translation offers an alternative: converting an input into the model’s dominant language unlocks its full capabilities at once. Applying translation to every input, however, is wasteful for languages the model already handles, while leaving the choice to the model fails in the opposite way, as LLMs are overconfident and skip the tool even when they cannot understand the input. Prior work resolves this with language-specific rules, domain heuristics, language identifiers, or external routers, each requiring manual engineering. We instead learn a single policy that decides when to translate from reward alone, developing language- and domain-adaptive introspection that assesses its own comprehension and invokes translation only when it cannot solve a task natively. Using data built by our answer-preserving translation pipeline, we continue RL on the post-trained Qwen3-4B across 22 languages in 3 resource tiers (High, Low, XLow) and 5 domains, and introduce confidence-gated GSPO for cost-sensitive tool use. The gated policy lifts reward over the baseline by +4.6 on High, +23.5 on Low, and +17.5 on XLow. Against an unconstrained policy that almost always translates, it preserves full reward at 63% of the cost and is Pareto-optimal across 87% of the cost-sensitivity range. Additionally, to simulate behavior on a completely unseen language, we create 2 synthetic languages, where our gated policy improves +18.7 over the overconfident baseline that underutilizes the tool even on these incomprehensible inputs. The policy transfers zero-shot to 9 held-out languages, and we analyze how tool use emerges over training, per language and per domain.


[10] Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces cs.CL | cs.AI | cs.LGPDF

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

TL;DR: 该论文研究了现代推理模型在零样本多标签任务上的工作机制,将其分解为’候选短名单筛选’和’精细推理’两个阶段,并基于此提出了一种优于标准蒸馏的机制蒸馏策略。

Details

Motivation: 旨在探究现代推理模型在需要从海量候选标签中筛选少量相关选项的复杂多标签任务上,实现强大零样本性能的内部工作机制。

Result: 在多个数据集上验证了所提出的两阶段推理机制(短名单筛选与精细推理)是可分离且互补的,并且基于此开发的机制蒸馏策略在性能上持续优于标准蒸馏方法。

Insight: 核心创新点在于将复杂的推理过程机制性地解耦为两个互补阶段,并利用这种机制性理解来指导模型蒸馏,从而提升蒸馏效果,为理解大输出空间下的模型推理提供了新视角。

Abstract: Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We investigate how they achieve this mechanistically. We characterize reasoning as a two-phase process: A broad “shortlisting” of candidates followed by fine-grained reasoning over the resulting set. We provide evidence across a range of datasets that these steps can be isolated and are complementary. Using this characterization, we develop a mechanistic distillation strategy that consistently outperforms standard distillation.


[11] Are Large Language Models Suitable for Graph Computation? Progress and Prospects cs.CLPDF

Yuting Zhang, Yi Han, Kai Wang, Wei Ni, Angela Bonifati

TL;DR: 这篇论文对大型语言模型在图计算中的应用进行了全面综述,提出了基于角色的分类法,将LLMs在图计算中的角色分为执行者和规划者两种范式,并分析了当前方法的优势和局限性。

Details

Motivation: 现有关于LLMs与图的研究主要关注图学习、文本属性图或图语言建模,缺乏对LLMs在图计算中可靠应用场景和集成方式的系统性分析,本文旨在填补这一空白。

Result: 综述分析表明,LLMs在简单、小规模任务上表现出潜力,但在大规模和需要精确性的任务上仍然不可靠。

Insight: 论文的创新点在于提出了一个基于角色的分类框架(执行者与规划者),为理解LLMs在图计算中的作用提供了结构化视角,并指出了未来四个研究方向,对领域发展具有指导意义。

Abstract: Large language models (LLMs) have been increasingly explored for graph computation, where tasks require reasoning over structured relationships and algorithmic operations. Yet, it remains unclear when LLMs can reliably support such computation and how they should be incorporated into graph-solving pipelines. Existing surveys at the intersection of LLMs and graphs primarily focus on graph learning, text-attributed graphs, or graph-language modeling. To bridge this gap, we provide a comprehensive review of LLMs for graph computation through a role-based taxonomy. Specifically, we identify two major paradigms: i) LLMs as executors, where models directly solve graph tasks from graph descriptions and instructions; and ii) LLMs as planners, where models formulate problems, decompose reasoning steps, and invoke external tools or agents for execution. Based on this taxonomy, we analyze the strengths and limitations of current methods. Our review indicates that LLMs are promising for simple, small-scale tasks, but remain unreliable for large-scale and exactness-demanding tasks. Finally, we summarize available datasets and suggest four future directions.


[12] CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification cs.CLPDF

Chenshuo Pan, Yu Zhao, Jie Zhang, Changzai Pan, Zhenhe Wu

TL;DR: 本文提出了CRAFT,一个统一的逆向推理框架,用于表格问答和事实验证任务。该框架通过构建陈述性语句及其逆向变体,从原始和逆向两个路径提取证据,并通过加权机制整合得到最终答案。实验表明,CRAFT在WikiTQ和TabFact等数据集上超越了现有基线,尤其在复杂问答任务上取得了显著提升,并有效缩小了不同骨干大语言模型之间的性能差距。

Details

Motivation: 现有表格推理方法主要依赖单向推理,限制了其在跨任务中探索替代假设的能力。为了解决这一问题,本文旨在通过逆向推理克服单向推理的局限性,为结构化推理任务建立更原则性的范式。

Result: 在WikiTQ和TabFact等表格推理数据集上的实验结果表明,CRAFT方法持续超越了代表性基线模型,在复杂问答任务上取得了特别大的改进。该框架还显著缓解了不同骨干大语言模型之间的性能差距。

Insight: 论文的核心创新在于将表格问答和事实验证统一重构为一个通用的双向验证过程,通过显式构建逆向变体并整合双向推理证据来提升模型的辨别性推理能力。从客观角度看,这种将逆向思维系统性地融入推理流程的设计,为结构化数据的复杂推理任务提供了一种新颖且有效的解决方案。

Abstract: Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limits their ability to explore alternative hypotheses across tasks. In this work, we propose CRAFT, a unified Counterfactual Reasoning Framework that reformulates Tabular question answering and fact verification into a general bidirectional verification process. Our method explicitly constructs both declarative statements and their counterfactual variants. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering. Our framework also significantly mitigates performance gaps between different backbone LLMs. This indicates that counterfactual reasoning effectively overcomes the limitations of single-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks. Our code will be made publicly available upon acceptance.


[13] EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering cs.CL | cs.AIPDF

Xiaopeng Yuan, Zebin Wang, Suwen Wang, Zongxin Yang, Haohan Wang

TL;DR: 本文提出了一种名为EASE-TTT的新方法,用于解决小型语言模型在长上下文问答中的挑战。该方法结合了上下文内检索和轻量级测试时训练,通过将检索到的证据块转换为软注意力监督目标,来指导模型在完整上下文上的注意力分配,从而提升答案生成质量。

Details

Motivation: 现有方法存在局限:上下文内检索方法仅暴露证据块,未调整模型注意力参数;而轻量级测试时训练方法(如qTTT)的自监督目标无法定位支持答案的具体上下文位置。因此,需要一种能对齐证据并指导模型注意力分配的测试时适应方法。

Result: 在LongBench的六个QA任务和三个小型仅解码器语言模型上的实验表明,EASE-TTT在完整上下文推理、仅检索基线和qTTT等方法中取得了最强的宏观平均性能。

Insight: 创新点在于将检索到的证据块转换为软注意力监督目标,用于指导查询侧参数的测试时适应,使模型能在保持完整上下文的同时,将注意力聚焦于相关证据位置。这为长上下文QA中的测试时适应提供了一种证据对齐的新思路。

Abstract: Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate evidence chunks for the question, but they stop at input-level evidence exposure rather than adapting the query-side attention parameters that control how the model allocates attention over full-context positions. In contrast, lightweight test-time adaptation methods, such as query-only test-time training (qTTT), leave evidence localization unresolved because their generic span-level self-supervised objectives do not identify which context positions support the current answer. In this paper, we propose Evidence-Aligned SElective Test-Time Training (EASE-TTT), a within-context retrieval-augmented test-time training framework that converts selected evidence chunks into a soft attention supervision target over their token positions. Instead of replacing the full context with retrieved chunks, EASE-TTT uses the resulting attention target to guide query-side adaptation, with the adapted model generating the final answer from the original full context. Experiments on six LongBench QA tasks and three small decoder-only language models show that EASE-TTT achieves the strongest macro-average performance among full-context inference, retrieval-only baselines, and qTTT, supporting evidence-aligned test-time adaptation in long-context QA.


[14] ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning cs.CL | cs.AI | cs.LGPDF

Vladislav Smirnov, Chieu Nguyen, Sergey Senichev, Minh Ngoc Ta, Ekaterina Fadeeva

TL;DR: 本文提出了ThinkBooster,一个用于无缝扩展大语言模型推理时计算量的统一框架。该框架包含一个实现最先进TTC扩展策略和评分器家族的模块化Python库、一个联合评估性能和计算效率的基准测试,以及一个可部署的OpenAI兼容代理服务,以实现自适应推理在实际应用中的即插即用。

Details

Motivation: 现有的测试时计算扩展策略和推理评分器较为零散,评估协议不一致,且很少从质量-成本权衡的角度进行分析。本文旨在解决这些问题,提供一个统一的框架来系统性地评估和部署推理扩展策略。

Result: 在数学和编码任务上的实证结果表明,该框架揭示了不同TTC扩展策略和评分方法在性能与计算开销之间的权衡,并证明了ThinkBooster在实际任务中能带来实用性的提升。

Insight: 主要创新点在于提供了一个集成了库、基准测试和部署服务的统一框架,实现了对推理扩展策略的系统化评估和便捷集成。从客观角度看,其将性能评估与计算效率基准测试相结合,并提供了可视化调试工具,为研究和应用提供了实用的基础设施。

Abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.


[15] Principles of Concept Representation in Sentence Encoders cs.CL | cs.DBPDF

Isabelle Mohr, John Dujany, Jonathan Souquet, Andre Freitas

TL;DR: 本文研究了句子编码器产生良好概念表示的原理,通过表示组合性的视角,提出编码器仅当潜在空间能低失真地实现相应语义算子时才支持概念族。通过在大规模同义词和定义对上进行受控消融实验,识别出四个核心原则,并发布了两个新的评估数据集。

Details

Motivation: 旨在探究句子编码器产生有效概念表示的内在机制,特别是从表示组合性的角度分析编码器结构与监督信号的匹配程度,以理解当前编码器在哪些概念族上成功或失败。

Result: 在基于WordNet和Wiktionary的330万对数据训练后,于三个去污染分割和一个修饰语标记的名词短语基准上评估,发现微调主要重新校准潜在几何而非扩展它,语义信号集中在特定概念训练前的最终Transformer层,硬负例改善区分性但不提升检索排名,监督有效性取决于目标概念的组合类型。

Insight: 创新点包括从表示组合性框架理论分析编码器能力,提出并实证了四个具体原则(如微调的作用、语义信号层集中性、硬负例的独立影响、监督对组合类型的依赖性),揭示了当前训练范式的结构局限性,并贡献了新的评估基准。

Abstract: What makes a sentence encoder produce good concept representations? We approach this through the lens of representational compositionality: an encoder supports a concept family only when its latent space admits a low-distortion realization of the corresponding semantic operator. This framing predicts both where current encoders succeed and where they are structurally mismatched to their supervision. Through a controlled ablation over encoder conditions trained on 3.3 million synonym and definition pairs from WordNet and Wiktionary, evaluated on three decontaminated splits and a modifier-labeled noun-phrase benchmark, we identify four principles. Fine-tuning recalibrates the latent geometry rather than expanding it (P1). Semantic signal concentrates in the final transformer layer before concept-specific training begins, making cross-layer pooling redundant (P2). Hard negatives improve discrimination and stress-test robustness without improving retrieval ranking, showing that calibration and ranking are independently addressable (P3). Finally, the effectiveness of supervision depends on the composition type of the target concept. Extensional training helps intersective and subsective families while degrading relational and intensional ones, exposing a structural limitation of current training paradigms (P4). We release two new evaluation datasets: a DBpedia semantic-gap benchmark and a modifier-labeled NP paraphrase suite.


[16] Didact: A Cross-Domain Capability Discovery System for Defence cs.CL | cs.AIPDF

Aarya Bodhankar, Aditya Joshi, Bao Gia Doan, Thomas Marchant, Oscar Leslie

TL;DR: 本文介绍了Didact系统,这是一个用于国防领域的跨领域能力发现原型系统,它整合了澳大利亚的公开国防报告、政策文件以及基于研究出版物构建的知识图谱,通过复合检索增强生成(RAG)管道支持自然语言对话,并提供了交互式证据轨道可视化功能。

Details

Motivation: 国防及国防相关领域的决策者需要监控快速演进的研究与部门优先事项,但现有信息来源分散在异构格式、不连贯的存储库和孤立的更新流中,导致能力发现过程缓慢且难以审计。

Result: 评估表明,Didact在输出质量和运行时性能上表现出实用性,尽管是针对澳大利亚背景开发的学术界-工业界合作项目,但可适应其他知识同样碎片化的领域。

Insight: 主要创新点在于将公开国防文档与定制知识图谱集成,并采用复合RAG管道结合交互式证据可视化(Evidence Rail),以增强政策工作流的可解释性和审计能力,其跨领域适应性设计具有推广潜力。

Abstract: Policymakers in defence and defence-aligned sectors must monitor rapidly evolving research alongside sector priorities relevant to operational and strategic needs. In practice, these sources are fragmented across heterogeneous formats, disjoint repositories, and siloed update streams, making capability discovery slow and difficult to audit. We present Didact, a prototype that integrates publicly available defence reports and policy documents from Australia with a purpose-built knowledge graph derived from Australian research publications. Didact provides natural language conversations for policy-oriented workflows, and leverages a composite retrieval-augmented generation (RAG) pipeline. A key feature of Didact is an interactive Evidence Rail that visualises retrieved evidence and source relationships. Our evaluation of the output quality and runtime of Didact highlights its utility. While Didact has been co-developed as an academia-industry project for the Australian context, it is adaptable to other domains where knowledge is similarly fragmented. A demonstration video is available here:


[17] mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages? cs.CL | cs.CYPDF

Yerzhan Sapenov, Jaromir Savelka

TL;DR: 论文提出了mmPISA-bench,一个基于OECD国际学生评估项目(PISA)构建的紧凑型高质量多语言推理基准,包含25道需要推理的多选题,覆盖43种语言的官方人工翻译和机器翻译版本。研究评估了两种主流专有大语言模型(LLMs)在不同语言、推理难度和翻译类型下的表现,发现现代LLMs在所有评估语言中都能有效推理,准确率与人类考生相当,但存在语言间性能差异。

Details

Motivation: 动机是评估大语言模型在不同语言中的推理能力是否均衡,并探究机器翻译数据是否足以用于大规模多语言推理评估,尤其是在缺乏官方翻译的情况下。

Result: 在mmPISA-bench基准上,现代LLMs在所有评估语言中实现了与人类考生相当的准确率,但性能存在跨语言差异;机器翻译问题相对于官方人工翻译并未降低模型准确率,表明高质量机器翻译(合成数据)通常足以用于评估。

Insight: 创新点在于构建了一个高质量、覆盖广泛语言(43种)的紧凑型多语言推理基准,并实证验证了高质量机器翻译数据可用于替代官方翻译进行大规模多语言评估;客观分析还揭示了LLMs在某些语言中推理成本更高但准确率更低的现象,这对多语言模型部署的成本效益分析具有借鉴意义。

Abstract: We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each question is provided in official human translations to 43 languages and complemented with machine-translated counterparts (i.e., 2,150 data points in total). We evaluate two mainstream proprietary LLMs across languages, reasoning effort levels, and translation types in terms of their ability to answer the questions correctly. Our results show that modern LLMs can reason effectively across all evaluated languages, achieve accuracy comparable to human test-takers, with some performance variations across covered languages. We further find that machine-translated questions do not degrade accuracy relative to official human translations which suggests that high-quality machine translation (synthetic data) might often be adequate for large-scale multilingual reasoning evaluations where official translations are not available. Finally, we analyze token usage and related inference cost and find that LLMs usage in some languages is simultaneously more expensive and less accurate.


[18] TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents cs.CL | cs.AI | cs.CR | cs.LGPDF

Vijitha Mittapalli, Shreyaa Jayant Dani, Satya Srujana Pilli, Snigdha Ansu, Mohammadreza Teymoorianfard

TL;DR: 本文提出了TRACE框架,用于监控追求隐藏恶意目标的自主LLM智能体。该框架通过一个Triage-Inspect-Judge循环,能够跨时间步长自适应地聚合证据,从而在长视野轨迹中检测出通过一系列表面良性行动构成的恶意行为。

Details

Motivation: 现有方法要么一次性评估完整轨迹,要么将其分割为独立评分的窗口,难以关联时间上相距较远的行动证据,无法有效检测通过序列良性行动实现的隐藏恶意目标。

Result: 在SHADE-Arena的十个任务领域上,TRACE相比最先进的基线方法取得了0.713的聚合F1分数和0.844的召回率,在需要长距离证据关联的任务上提升最为显著。

Insight: 创新点在于提出了一个自适应跨步证据聚合的推理框架(TIJ循环),能够动态识别高信号区域并进行有针对性的检查,同时在整个推理步骤中保持累积的证据,从而实现对长视野轨迹的合成判决。这为解决长程依赖的恶意行为检测提供了新思路。

Abstract: Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.


[19] From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning cs.CLPDF

Yuhang Zhou, Yixin Cao, Guangnan Ye

TL;DR: 本文提出了一种新的评估大语言模型推理前缀的方法,从传统的局部步骤正确性转向关注前缀对最终问题解决成功率的提升效果。作者定义了“前缀增益”来衡量前缀对轻量级学生模型组解决率的提升,并基于此训练了一个前缀效用模型。该模型在数学推理任务上的多种搜索策略中,尤其是在候选池大、搜索预算高或基于规则的奖励稀疏的情况下,提供了强大的前缀级监督信号。

Details

Motivation: 现有过程奖励模型通常通过局部步骤的正确性来评估推理前缀,但作者认为正确性只是一个间接代理,他们最终关心的是前缀是否增加了成功完成的概率。因此,需要一种更直接评估前缀效用的方法。

Result: 在数学推理任务上,通过Best-of-N选择、波束搜索和强化学习等实验,PUM模型在候选池大、搜索预算增加或基于规则的奖励稀疏的情况下,提供了强大的前缀级监督信号,提升了性能。

Insight: 论文的核心创新在于提出了“前缀增益”这一概念,将评估焦点从局部正确性转向对最终成功率的直接贡献,并基于此训练了一个通用的前缀效用模型。这为推理过程的评估和优化提供了一个更有效、更直接的监督信号,特别是在复杂或稀疏奖励的场景下。从客观角度看,这是一种从结果出发、以效用为导向的评估范式转变,可能对提升LLM的推理效率和可靠性有重要借鉴意义。

Abstract: Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect we ultimately care about: whether a prefix increases the probability of successful completion. We define this effect as prefix gain, the solve-rate improvement induced by conditioning lightweight student model group on a prefix, and use it to train a Prefix Utility Model (PUM) with a simple pairwise ranking objective. PUM learns outcome-grounded prefix utility and can score both complete trajectories and partial reasoning prefixes. Across Best-of-$N$ selection, beam search, and reinforcement learning on mathematical reasoning, PUM provides a strong prefix-level supervision signal, especially when candidate pools are large, search budgets increase, or rule-based rewards are sparse. We release all data, models, and code at https://zhiqix.github.io/pum-project-page.


[20] When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations cs.CL | cs.AI | cs.LGPDF

Mahdi Alkaeed

TL;DR: 本研究系统评估了通用和医疗专用大语言模型在临床推理任务中对提示词扰动的敏感性。研究发现,即使是微小的措辞变化或对抗性提示,都可能导致模型输出不一致、不准确甚至有害的临床建议,揭示了LLMs在医疗等高风险领域应用中的内在脆弱性和不可靠性。

Details

Motivation: 大语言模型在医疗领域的应用日益广泛,但其对提示词(无论是词汇还是句法层面的)的微小扰动高度敏感,这在安全至上的临床应用中构成了严重风险。

Result: 在MedMCQA基准测试上,研究发现医疗LLMs并不具备内在安全性。模型对简单的词汇替换或释义表现出一定韧性,但在句法重排或误导性上下文提示下容易失效,对抗性操作甚至会导致推荐错误剂量等临床危险输出。

Insight: 论文的创新点在于对医疗LLM提示词鲁棒性进行了系统的敏感性分析,并区分了自然扰动和对抗性扰动。客观来看,其核心洞察是揭示了领域专用模型在关键任务上并不比通用模型更可靠,强调了在部署前进行严格鲁棒性评估的必要性。

Abstract: Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt perturbations, both lexical and syntactic, posing serious risks in safety-critical clinical applications. In this study, we conduct a systematic sensitivity analysis to evaluate the robustness of both general-purpose (e.g., GPT-3.5, Llama3) and medical-specific LLMs (e.g., ClinicalBERT, BioLlama3, BioBERT) using the MedMCQA benchmark. We categorize perturbations into natural and adversarial types and examine their effect on model consistency, accuracy, and reliability in clinical reasoning tasks. Our findings reveal that medical LLMs are not intrinsically safe. Even minor variations in phrasing can alter clinical advice, and targeted adversarial prompts can provoke harmful outputs. In high-stakes settings like healthcare, such unpredictability is unacceptable-models that change diagnoses due to reworded inputs or hallucinate medications when slightly rephrased cannot be reliably trusted by clinicians. While models tend to show resilience to simple lexical substitutions or paraphrasing, they often break down under syntactic reordering or misleading contextual cues. This fragility is evident across both general-purpose and domain-specific LLMs. Notably, adversarial manipulations can lead to clinically dangerous outputs, such as recommending incorrect dosages or omitting critical findings.


[21] KIT’s Submission to Cross-Lingual Voice Cloning in IWSLT 2026 cs.CL | cs.SDPDF

Seymanur Akti, Alexander Waibel

TL;DR: 本文介绍了KIT团队为IWSLT 2026跨语言语音克隆任务提交的系统。该系统基于多语言TTS模型FishAudio-S2-Pro,通过引入语言标签提示来增强语言控制、减少口音泄露,并应用强化学习微调以提升可懂度,同时提出了一种参考条件词汇匹配方法来改善领域特定术语的发音。

Details

Motivation: 解决跨语言语音克隆任务中,在存在口音变化和领域特定词汇的情况下,保持生成语音的可懂度和自然度这一核心挑战。

Result: 在IWSLT 2026跨语言语音克隆任务中,语言提示方法带来了最大的性能提升,而词汇匹配方法在词汇重叠的子集上取得了持续改进。

Insight: 创新点在于将语言标签提示集成到多语言TTS模型中以进行细粒度语言控制,以及为任务适应而设计的强化学习微调和针对领域术语的参考条件词汇匹配方法,这些策略共同提升了跨语言语音克隆的性能。

Abstract: Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.


[22] LLM-Guided Evolution for Medical Decision Pipelines cs.CL | cs.NEPDF

Ivan Sviridov, Artem Oskin, Ivan Panin, Iaroslav Bespalov, Dmitry Dylov

TL;DR: 本文提出了一种基于LLM引导的MAP-Elites进化算法,作为在推理时发现医疗决策策略的替代方案,避免了昂贵的微调或手动提示工程。该方法将紧急分诊、交互式咨询和医学图像分类问题形式化为对可执行构件的进化搜索,并通过特定任务的适应度函数进行优化。

Details

Motivation: 将大型语言模型(LLMs)适配到临床工作流通常需要昂贵的微调或手动提示和流程工程,本文旨在探索一种在推理时自动发现有效医疗决策策略的替代方法。

Result: 在三个医疗任务中,进化方法均优于手动设计的基线。在分诊任务中,进化程序将Semigran准确率从77.3%提升至87.1%,紧急召回率从0.60提升至0.97,并在MIMIC-ESI数据集上提升了安全加权的留出性能。在交互式咨询和PneumoniaMNIST图像分类任务中,进化策略也改善了准确率-成本边界,并提升了冻结的MedGemma VLM模型性能。

Insight: 创新点在于将进化算法(MAP-Elites)与LLM引导相结合,在推理时自动搜索和优化可解释的程序级决策策略,其性能提升源于可解释的机制(如校准的分诊边界、定向证据获取)而非浅层的提示词改写,为自动化构建医疗决策流程提供了新思路。

Abstract: Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at https://github.com/univanxx/llm_guided_evo_medical. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions. Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from $77.3%$ to $87.1%$ and emergency recall from $0.60$ to $0.97$, while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy–cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone.


[23] M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions cs.CLPDF

Zhengjun Huang, Wenxuan Liu, Zhoujin Tian, Wei Chen, Junle Chen

TL;DR: 本文提出了M³Exam,一个基于真实用户-智能体交互构建的以查询为中心的多模态对话记忆基准测试,用于评估模型在跨模态对齐和隐含信息推理等多维度的能力。通过基准测试发现现有MLLM和记忆系统在跨模态对齐、跨会话推理以及多模态上下文累积效率方面存在不足。

Details

Motivation: 现有基准测试假设人-人交互形式,视觉信息稀疏且内容直接,无法评估智能体在真实多模态文件交互中的推理能力以及对隐藏用户信息的解读,因此需要构建更贴近现实的基准。

Result: 在M³Exam基准上测试发现现有模型存在明显差距;提出的M³Proctor方法通过检测查询模态偏差并按需消费原始视觉源,将准确率提升了13%,同时将索引构建时间和检索令牌数减少了70%以上。

Insight: 创新点在于构建了更贴近真实用户-智能体交互的多模态对话记忆基准,并提出了能感知查询模态偏差、按需处理原始视觉信息的高效记忆方法,在提升性能的同时显著降低了计算开销。

Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multimodal file interaction nor the interpretation of concealed user information. We therefore introduce M$^3$Exam, a query-centric multimodal conversational memory benchmark built on realistic user-agent interaction, with multi-dimensional evaluation spanning cross-modal grounding and implicit information inference. Benchmarking MLLMs and memory systems reveals persistent gaps in cross-modal grounding, cross session reasoning, and the efficiency cost of accumulating multimodal context. We further propose M$^3$Proctor, a multimodal memory method that detects query modality bias and consumes raw visual sources only on demand, improving accuracy by 13% while cutting index-construction time and retrieved tokens by over 70%.


[24] Sycophantic Praise: Evaluating Excessive Praise in Language Models cs.CLPDF

Daniel Vennemeyer, Phan Anh Duong, Meryl Ye, Ruihong Huang, Tianyu Jiang

TL;DR: 本文研究了语言模型中的谄媚性赞美问题,提出了一种参数化框架来评估赞美是否相对于贡献质量和用户能力过度,并发现该框架在人类标注一致性上显著优于通用LLM评估器,且谄媚性赞美在社交和解释性领域比客观推理场景更常见。

Details

Motivation: 当前对语言模型谄媚性的研究主要关注过度同意或验证,而明确的赞美和奉承问题未得到充分关注,作者认为谄媚性赞美是一个独特的对齐问题,现有方法无法可靠测量。

Result: 提出的框架在人类标注一致性上大幅优于通用LLM评估器,且分析显示谄媚性赞美在社交和解释性领域出现频率远高于客观推理场景。

Insight: 将赞美校准视为独立的对齐挑战,并引入基于贡献质量和用户能力的参数化评估框架,为检测模型过度奉承行为提供了新方法。

Abstract: Sycophancy in language models is typically studied as excessive agreement or validation, while explicit praise and flattery have received comparatively little attention. We argue that sycophantic praise is a distinct alignment problem that cannot be reliably measured using current methods. We introduce a parameterized framework that measures whether praise is excessive relative to contribution quality and expected user ability. We show that our framework substantially outperforms generic LLM judges in agreement with human annotations, and that sycophantic praise occurs far more frequently in social and interpretive domains than in objective reasoning settings. Together, these findings position praise calibration as a distinct alignment challenge.


[25] Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification cs.CL | cs.AIPDF

Sercan Karakaş, Yusuf Şimşek

TL;DR: 本文研究了土耳其语习语性轻动词结构(LVC)的分类任务,比较了监督学习基线(BERTurk)与三种指令微调大语言模型在零样本、单样本和少样本提示下的性能。研究发现,零样本下LLMs对负样本表现良好但LVC召回率低;单样本提示能显著提升LVC检测但会引入模型特定的预测偏差;精心设计的少样本提示能改善校准并实现稳健的整体性能。

Details

Motivation: 土耳其语习语性轻动词结构(LVC)的表层形式常与完全字面义的动宾组合相同,但作为部分习语的谓词,这对多词表达处理构成挑战,因此需要研究有效的分类方法。

Result: 在包含147个样本的人工构建控制集(含匹配负样本)上评估。监督基线(BERTurk)保持竞争力,而经过精心构建演示提示的LLMs(如GPT-OSS-20B和Qwen 2.5-14B)在LVC分类上可以匹配或超越基线性能,结果突显了提示的敏感性。

Insight: 创新点在于系统比较了监督学习与基于演示的上下文学习在特定语言(土耳其语)元语言分类任务上的表现,并深入分析了演示如何改变错误模式(如单样本引入模型特定偏差,少样本改善校准),为低资源语言处理中提示工程的有效性提供了实证见解。

Abstract: Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially idiomatic predicate. We frame Turkish LVC detection as a binary classification task (literal meaning vs. idiomatic meaning) and evaluate on a manually created controlled set (N=147) with matched negatives: out-of-domain random sentences and in-domain literal controls (NLVC), alongside LVC positives. We compare a supervised Turkish encoder baseline (BERTurk with a classifier head) to three instruction-tuned LLMs from different families under zero-shot, one-shot, and few-shot prompting, and analyze how demonstrations shift error profiles. In zero-shot, LLMs perform well on negatives but show very low LVC recall. One-shot prompting sharply improves LVC detection but can induce strong, model-specific biases, leading models to overpredict or underpredict LVCs. A richer few-shot prompt improves calibration and yields robust overall performance for GPT-OSS-20B and Qwen 2.5-14B. Overall, the results highlight substantial prompt sensitivity in Turkish metalinguistic classification: the supervised baseline remains competitive, while prompted LLMs can match or exceed it on LVCs with carefully constructed demonstrations.


[26] How reliable are LLMs when it comes to playing dice? cs.CL | cs.AI | cs.HC | math.PRPDF

Luca Avena, Gianmarco Bet, Bernardo Busoni

TL;DR: 本文通过构建标准与反直觉两类离散概率问题数据集,评估了8个前沿大语言模型的概率推理能力。研究发现,模型在标准问题上平均准确率达0.96,但在反直觉问题上仅为0.59,且存在显著的token偏差和易受误导提示影响的问题,表明当前LLM尚未成为真正的概率推理器。

Details

Motivation: 探究大语言模型在离散概率问题上的推理可靠性,特别是检验其是否依赖启发式思维而非严谨的概率计算。

Result: 在标准问题上平均准确率0.96,反直觉问题仅0.59;当问题表述被伪装或提示中包含误导建议时,性能下降超过20%至34%,所有模型均受影响。

Insight: 揭示了LLM在概率推理中存在系统性脆弱性,尤其是对反直觉问题和表述变动的敏感性;即使使用思维链提示,模型仍易受token偏差和误导信息干扰,这为评估和改进模型的逻辑推理能力提供了关键方向。

Abstract: We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.


cs.CV [Back]

[27] Applying Deep Learning for cockpit segmentation in the context of mixed reality cs.CV | cs.GRPDF

Alexandre Leles Sousa, Pedro de Oliveira Nielson, Erick Oliveira Rodrigues, Rafael Francisco dos Santos, Giovani Bernardes Vitor

TL;DR: 本文提出使用深度学习进行驾驶舱图像分割,以支持混合现实应用。通过采集矿用卡车模拟器CAT793F的真实图像,应用U-net和DeepLabV3+卷积神经网络架构进行前景与背景分割,旨在促进虚拟与真实图像的融合。

Details

Motivation: 随着第一人称视角技术的发展,混合现实需要将物理世界对象实时融入虚拟环境,因此需提升用户在模拟环境中的沉浸感,使其更接近期望的现实。

Result: 实验结果显示,分割模型取得了约90%的准确率,并确定了最佳模型。

Insight: 创新点在于将U-net和DeepLabV3+架构应用于特定驾驶舱场景的图像分割,以支持混合现实中的虚实融合,这为类似工业模拟环境的分割任务提供了参考。

Abstract: Computer vision is an area that has been growing continuously. With the advance of technologies with a first-person view, new development opportunities have emerged inside the area. Mixed reality promotes virtual environments with objects from the physical world shown in real time. For that, it’s necessary to be concerned with the immersion of the user in this simulated environment, increasingly seeking to bring it closer to a possible desired reality. This paper proposes the development of image processing in order to perform the segmentation of images to identify what is foreground and background in order to facilitate the union of virtual and real images. Thus, the present work obtain real images of the user using the off-highway truck simulator CAT793F, through a camera, to be able to perform the segmentation of such images with artificial intelligence techniques.The convolutional neural network architectures “U-net” and “DeepLabV3+” are applied to perform image segmentation. As a result, metrics with around 90% accuracy were presented and and the best model was determined.


[28] GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning cs.CVPDF

Haozhe Chi, Yang Jin, Yadong Mu

TL;DR: 本文提出GOPAgen,一种用于长视频理解的新型智能体框架,通过集成视频编解码器、设计运动感知智能体、构建结构化记忆和分层推理算法,旨在提升对视频中详细运动的理解能力和处理效率。

Details

Motivation: 现有长视频理解方法在详细运动理解和高效内存架构方面存在不足,需要一种能够结合视频编解码结构、有效捕捉局部运动细节并高效管理记忆的解决方案。

Result: 该方法在多个视频理解基准测试(如MotionBench和Egoschema)上实现了卓越的视频问答(VQA)性能,展示了其优越性。

Insight: 创新点包括:将视频编解码器(GOP)集成到理解框架中,设计运动智能体和GOP树推理算法以增强局部运动理解;开发结构化记忆机制和由粗到细的放大算法以高效利用记忆;引入运动矢量数据库支持多粒度检索。这些设计从视频编码结构出发,实现了运动感知与内存效率的平衡。

Abstract: Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec. We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model’s ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory. Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.


[29] WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark cs.CVPDF

Yida Yin, Harish Krishnakumar, Chung Peng Lee, Boya Zeng, Wenhao Chai

TL;DR: 本文介绍了WorldBench,一个具有挑战性且视觉多样性的多模态推理基准测试,旨在评估多模态大语言模型(MLLMs)在多样化真实世界场景中的表现。该基准通过构建涵盖多个领域(如生物)的数千个视觉概念分类法,并据此从搜索引擎和现有数据集中精心挑选图像,以全面代表视觉世界。通过结构化的试错过程,手动设计了前沿MLLMs难以回答的挑战性问题。

Details

Motivation: 现有许多多模态基准测试在扩展任务类型时,未能捕捉处理开放式视觉输入所需的视觉多样性,而真实世界应用要求模型在多样化环境中可靠运行,因此需要构建一个更具视觉多样性的基准来评估MLLMs。

Result: 在定量和人工评估中,WorldBench的视觉多样性高于任何现有多样性基准。对15个MLLMs的评估揭示了它们在视觉理解上的弱点:即使最强模型准确率也仅为64.0%,而一些模型表现仅略高于随机水平。

Insight: 创新点在于构建了一个基于视觉概念分类法的、系统化收集多样化图像的基准,并通过手动设计挑战性问题来暴露模型弱点;客观来看,这强调了视觉多样性在构建多模态基准中的重要性,为评估模型在开放世界中的鲁棒性提供了新方向。

Abstract: In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.


[30] From Pixels to Newtons: Predicting In Vivo Joint Contact Forces from Monocular Video cs.CVPDF

Jessy Lauer

TL;DR: 本文提出了一种无需物理模型的端到端流程,仅通过单目视频即可预测髋关节和膝关节的瞬时三维接触力。该方法无需标记点、测力台、肌电图或受试者特定影像,通过参数化人体网格重建、运动学特征编码及Transformer解码,实现了与受试者特定肌肉骨骼模拟相当的精度,并展示了在原始视频上进行零样本推理和生成生物力学合理运动变体的能力。

Details

Motivation: 关节接触力对植入物寿命、软骨健康和康复效果至关重要,但目前仅能通过侵入性手段在少数患者中测量。本文旨在开发一种无创、便捷的方法,利用单目视频预测关节力,以支持临床记录回顾分析、初级保健筛查和家庭康复跟踪。

Result: 在OrthoLoad数据库的26名患者和25种活动类别上进行留一受试者交叉验证,该方法在髋关节(0.32 ± 0.08 BW RMSE)和膝关节(0.23 ± 0.03 BW RMSE)上的预测精度与受试者特定肌肉骨骼模拟相当,并能检测出比步态再训练和骨关节炎进展更小的峰值力变化。在独立器械化队列上的零样本测试中,其性能达到或超越了先前发表的方法。

Insight: 创新点包括:将身体形状、关节、侧别、活动文本和自监督视频令牌(V-JEPA 2)自适应调制到Transformer的每一层,统一了髋关节和膝关节的预测模型;仅使用视频特征即可保持精度,支持端到端原始视频推理;结合生成运动先验,可产生生物力学合理且降低峰值负荷的运动变体,为无创关节力估计开辟了新途径。

Abstract: Joint contact forces govern implant longevity, cartilage health, and rehabilitation outcomes, shaping who develops osteoarthritis, who recovers well from joint replacement, and who benefits from biomechanical interventions. Yet they remain measurable only invasively, in a few dozen patients with instrumented implants. I present a physics-free pipeline to predict instantaneous 3D hip and knee contact forces from an uncalibrated monocular video: no markers, force plates, electromyography, subject-specific imaging, or musculoskeletal model. Parametric body meshes are recovered per frame, encoded as kinematic features, and decoded into forces by a transformer whose pose stream is adaptively modulated at every layer by body shape, joint, side, activity text, and self-supervised video tokens (V-JEPA 2), unifying hip and knee in a single model. Under leave-one-subject-out cross-validation across 26 patients and 25 activity categories from the in vivo OrthoLoad database, the pipeline matches the accuracy of subject-specific musculoskeletal simulations ($0.32 \pm 0.08$ BW RMSE for hip; $0.23 \pm 0.03$ BW for knee) and resolves peak force changes smaller than those reported for gait retraining and osteoarthritis progression. Applied zero-shot to an independent instrumented cohort, it rivals or outperforms prior published methods. Even without curated activity labels, video features alone preserve accuracy and enable end-to-end inference on raw footage. Driven by the predictor, a generative motion prior produces biomechanically plausible variants with reduced peak loading, rediscovering strategies from the predictive simulation literature. This pipeline establishes uncalibrated monocular video as a viable modality for estimating joint loading, opening a path toward retrospective analysis of archived clinical recordings, primary-care screening, and at-home rehabilitation tracking.


[31] Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers cs.CV | cs.AI | cs.LGPDF

Tang Li, Yanlin Chen, Mengmeng Ma, Xi Peng

TL;DR: 该论文提出了一种名为ViSAE的机制可解释性工具箱,用于通过概念电路理解和引导视觉Transformer(ViT)的内部工作机制。ViSAE包含一个包含64K图像和16K视觉基础概念词汇的探测套件、自动恢复ViT概念电路的算法,以及用于审计和引导ViT行为的应用。通过概念编辑,ViSAE在WaterBirds数据集上的最差组准确率提升了48.2%,优于现有方法23.8%。

Details

Motivation: 尽管ViT具有高准确率,但其预测可能由虚假线索驱动,因此在安全部署前需要理解其内部工作机制。稀疏自编码器(SAEs)为将模型表示分解为人类可解释的概念提供了有前景的视角,但将其应用于ViT仍面临概念覆盖控制有限和特征解释主观、不可扩展的挑战。

Result: ViSAE的探测套件在概念覆盖效率上比ImageNet提升了20倍,解释准确率比现有概念集提高了28.7%。在WaterBirds数据集上,通过概念编辑,ViSAE将最差组准确率提升了48.2%,优于现有方法23.8%。

Insight: 论文的创新点包括:受神经科学启发的概念电路方法,自动化的概念读取和电路追踪算法,以及可扩展的视觉基础概念词汇库。从客观角度看,该研究将机制可解释性工具系统化应用于ViT,并通过概念编辑实现了模型行为的有效引导,为可解释AI提供了实用框架。

Abstract: Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: https://github.com/deep-real/ViSAE.


[32] RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video cs.CV | cs.GRPDF

Shichong Peng, Yanshu Zhang, Ke Li

TL;DR: RigPAPR是一种基于固定视角视频驱动静态神经点云动画的方法,它通过自动绑定骨骼并使用线性混合蒙皮(LBS)来驱动点云,避免了传统方法在关节边界产生的伪影。该方法利用邻近注意力点渲染(PAPR)技术,在渲染时从变形基元的位置重新组合像素,从而在关节处自然形成表面。

Details

Motivation: 解决现有方法(如基于网格代理或高斯泼溅的直接线性混合蒙皮)在驱动静态神经点云动画时,由于每个基元形状在规范姿势下固定,导致关节边界出现间隙和尖刺伪影的问题。

Result: 在合成数据集上,RigPAPR在监督视角下与最强基线相当,在新视角下比基于网格和高斯泼溅的基线高出3+dB PSNR,并在合成和真实对象上实现了更清晰的关节边界渲染。

Insight: 创新点在于采用PAPR表示,无需每个基元的形状参数,通过渲染时动态重组像素来适应关节变形,从而避免了关节伪影;同时,该方法无需网格代理、姿态依赖校正或类别模板,实现了从单固定视角视频的自动绑定和驱动。

Abstract: Static neural point reconstructions capture a subject at high fidelity from posed images. Given such a reconstruction, we aim to animate it to follow a monocular fixed-viewpoint driving video of the subject, whether captured or produced by image-to-video (I2V) generation, and to recover a rigged, re-posable 3D asset. Existing methods deform Gaussian splats through direct linear blend skinning (LBS) or mesh proxies, both of which are prone to joint-boundary artifacts under articulation, even with per-primitive corrections. We trace the artifact to the representation: each splat carries an individual shape calibrated in the canonical pose to tile with its neighbours. Under rigid LBS, each splat moves with its bone but cannot bend, so the canonical tiling breaks at joint boundaries into gaps and spikes. Proximity attention point rendering (PAPR) instead carries no per-primitive shape; each pixel is recomposed at render time from the deformed primitives’ positions, so the surface re-forms naturally with the articulation. We present RigPAPR, which auto-rigs a static PAPR cloud and drives it under direct LBS from a single fixed-viewpoint video, without mesh proxy, pose-dependent correction, or category template. On synthetic subjects, RigPAPR matches the strongest baseline at the supervised view and exceeds mesh-based and Gaussian-splatting baselines at novel views by 3+dB PSNR, with cleaner joint-boundary renderings of both synthetic and real subjects.


[33] MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models cs.CV | cs.AIPDF

Ryan D’Cunha, Alejandro Lozano, Xiaoxiao Sun, Daniel Vela Jarquin, Min Woo Sun

TL;DR: 本文提出了一个名为MMBU的大规模多模态生物医学理解基准测试,旨在全面评估视觉语言模型在生物医学领域的感知能力。该基准覆盖了35种子模态,包含丰富的结构化元数据,并设计了多种任务类型,以系统评估模型在不同生物尺度、临床环境和成像模态下的表现。

Details

Motivation: 当前生物医学视觉语言模型虽然前景广阔,但其实际应用需要模型具备鲁棒且细粒度的视觉感知能力,而现有基准测试存在局限,无法全面评估模型在多样生物医学场景下的表现。

Result: 通过对15个开源模型和2个前沿模型进行评估,发现尽管医学领域适应能为部分模型带来可测量的性能提升,但在现有基准上报告的高准确率可能掩盖了模型在视觉感知和领域泛化方面的缺陷。

Insight: 论文的创新点在于构建了迄今为止最大、最全面的生物医学视觉语言基准测试,其系统性的任务设计(如无锚分类、有锚分类和物体检测)为深入探究模型的感知能力提供了新工具,揭示了模型在复杂真实场景下性能评估的重要性。

Abstract: Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.


[34] Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception cs.CVPDF

Qian Zhang, Michal Golovanevsky, Fulvio Domini, James Tompkin

TL;DR: 该论文研究了视觉语言模型在从纹理感知表面倾斜角度的能力,发现多个VLM家族和不同规模的模型在零样本和上下文提示下均表现出明显的失败:它们仅在一小组锚点角度(如0°、±25°、±45°)预测倾斜,而对刺激的视场、光学倾斜或表面曲率依赖很小。监督微调部分缓解了这一问题,但残余的锚定效应仍然存在。

Details

Motivation: 动机是探究视觉语言模型是否像人类和先前研究的无监督CNN那样,在从纹理感知表面倾斜时表现出系统性的、渐进的偏差,以评估VLM在低级几何线索处理上的能力。

Result: 实验结果表明,VLMs在多个模型家族和规模下均未能表现出人类般的渐进感知,而是局限于锚点预测;监督微调仅部分改善,但未完全解决锚定问题。

Insight: 创新点在于揭示了VLMs在表示到输出语言接口上的失败:可能并非缺乏几何编码,而是无法以渐进形式表达它,这为理解VLM在低级视觉任务上的局限性提供了新视角。

Abstract: Human perception of surface slant from texture exhibits systematic, graded biases that emerge reliably in psychophysical experiments. Prior work showed that unsupervised CNNs reproduce several human-like biases, while supervised CNNs do not. Do Vision-Language Models (VLMs) exhibit similar competences? Across multiple VLM families and model scales, zero-shot and in-context prompting both produce distinctive failures: slant is predicted at only a small set of anchors (e.g., 0\degree, $\pm$25\degree, $\pm$45\degree) with little dependence on stimulus field of view, optical slant, or surface curvature. Supervised fine-tuning partially remediates the failure, but residual anchoring persists. While success in high-level vision-language benchmarks might not require sensitivity to low-level geometric cues, we interpret anchoring as a failure at the representation-to-output language interface: Not necessarily an absence of geometric encoding, but a failure to express it in a graded form.


[35] MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models cs.CVPDF

Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout

TL;DR: 本文提出MedSIGHT框架,旨在为医学大视觉语言模型(Med-LVLMs)提供结构化、像素级的理解能力,以实现视觉发现与语义解释相连接的临床推理。该框架通过引入区域感知器模块和医学区域码本,将空间信息编码到语言模型的表示空间中,并生成离散区域代码作为解剖和病理区域的符号表示,从而统一了视觉理解和图像分割能力。

Details

Motivation: 现有医学大视觉语言模型在视觉语言理解和医学图像分割方面取得了显著进展,但难以统一这两种能力,而临床推理需要将视觉发现与语义解释连接起来。

Result: 在仅使用72K多模态指令对进行训练的情况下,MedSIGHT在多种成像模态的医学理解和分割任务上均达到了最先进的性能。

Insight: 创新点包括引入区域感知器模块将空间信息直接编码到语言模型表示空间,提出医学区域码本作为解剖和病理区域的符号表示,以及采用渐进式训练策略稳定对齐这些模块,实现了端到端的空间定位和统一理解与分割的能力。

Abstract: Medical large vision-language models (Med-LVLMs) have recently achieved remarkable progress in vision-language comprehension and medical image segmentation. However, existing models still struggle to unify these two capabilities, which is essential for achieving clinically reasoning that connects visual findings with semantic interpretation. We present MedSIGHT, a unified framework that equips Med-LVLMs with structured, pixel-level understanding for grounded visual comprehension. MedSIGHT introduces a novel Region Perceiver module that produces region-centric tokens, encoding spatial information directly into representation space of the language model. We further propose a medical region codebook into the LLM vocabulary, allowing the model to generate discrete region codes as symbolic representations of anatomical and pathological regions. These codes are decoded through the Region Perceiver to reconstruct segmentation mask, achieving end-to-end spatial grounding. Lastly, MedSIGHT combines Region Perceiver, Codebook and LLM using our proposed progressive training strategy to gradually aligns these modules stably. Trained on only 72K multimodal instruction pairs, MedSIGHT achieves state-of-the-art performance across diverse imaging modalities on both medical comprehension and segmentation tasks.


[36] VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation cs.CVPDF

Ming Dai, Sen Yang, Boqiang Duan, Boyuan Tong, Jiedong Zhuang

TL;DR: 本文提出了VideoSEG-O3,这是首个用于推理视频目标分割(RVOS)的多轮强化学习框架。该框架模拟人类从粗到细的认知过程,通过多轮时空思维链迭代定位关键区间和关键帧以捕捉细节。此外,它引入了SEG感知的logit校准,将像素级分割反馈直接整合到token级logits中,并在强化学习阶段设计了分层解耦的思维轨迹来分解推理过程,同时构建了VTS-CoT冷启动数据集。

Details

Motivation: 现有RVOS方法仅限于对固定初始输入进行推理,缺乏主动获取进一步视觉证据的能力,这在处理长或复杂视频中的复杂指代时往往至关重要。

Result: 论文在RVOS任务上进行了实验,但摘要中未提及具体的定量结果(如基准测试名称、性能指标或与SOTA的比较)。

Insight: 主要创新点包括:1) 多轮强化学习框架模拟人类从粗到细的推理;2) SEG感知的logit校准,将分割质量反馈融入RL策略;3) 解耦的思维轨迹分层分解时空和语言推理;4) 构建了VTS-CoT冷启动数据集以支持训练。

Abstract: Reasoning Video Object Segmentation (RVOS) demands a sophisticated integration of temporal dynamics, spatial details, and linguistic reasoning to achieve precise pixel-level localization. Existing methods are limited to reasoning over fixed initial inputs and lack the capacity to actively acquire further visual evidence, which is often essential for resolving complex references in long or intricate videos. To address this, we propose \textbf{VideoSEG-O3}, the first multi-turn reinforcement learning framework for RVOS that emulates the human \textit{``coarse-to-fine’’} cognitive process. It employs a \textit{multi-turn temporal-spatial chain-of-thought} to capture fine-grained details by iteratively pinpointing critical intervals and keyframes. Additionally, to enable the policy to perceive segmentation quality beyond mere text probability of \texttt{[SEG]} during the RL stage, we introduce \textit{SEG-aware logit calibration}, which integrates pixel-wise segmentation feedback directly into the token-level logits. Furthermore, we design a \textit{decoupled thinking trace} to hierarchically decompose the reasoning process into temporal, spatial, and linguistic dimensions, and construct \textbf{VTS-CoT}, a specialized cold-start dataset featuring comprehensive reasoning trajectories. The code and models will be released at https://github.com/Dmmm1997/VideoSEG-O3.


[37] AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO cs.CV | cs.LGPDF

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang

TL;DR: 本文针对基于流的GRPO方法在文本到图像模型对齐中存在的两个关键问题——提示选择与优势估计的局限性,提出了AdaGRPO算法。该算法通过在线课程过滤策略动态选择与模型当前能力匹配的提示,并通过跨级别优势融合结合细粒度组内优势与宏观全局优势,实现了更稳定高效的强化学习训练。

Details

Motivation: 当前基于流的GRPO方法在提示选择上采用随机采样,忽视了数据选择对强化学习效果的关键影响;同时在优势估计上仅依赖组内统计,缺乏全局视角来准确衡量策略改进。这两个问题导致学习过程与模型当前能力脱节,存在关键盲点。

Result: 实验表明AdaGRPO能持续提升性能并显著稳定流模型的GRPO训练。该算法可作为轻量级即插即用模块,与Flow-GRPO、DanceGRPO、Flow-CPS等现有框架无缝集成。

Insight: 创新点在于将能力感知机制引入流模型强化学习:通过在线课程学习实现动态难度匹配,以及通过跨级别优势融合提供更全面的策略评估。这为流模型的偏好对齐提供了一种更稳定、自适应的训练范式。

Abstract: Group Relative Policy Optimization (GRPO) has demonstrated remarkable success in aligning text-to-image (T2I) flow models with human preferences. However, we have identified that the learning loop of current flow-based GRPO is fundamentally decoupled from the learner’s current capability, suffering from critical blind spots at both prompt selection and advantage estimation: (i) Existing methods sample prompts randomly, overlooking the substantial impact of data selection on reinforcement learning (RL) efficacy–a factor proven crucial in GRPO for large language models; (ii) They evaluate sample quality solely relying on intra-group statistics, lacking a global perspective to accurately measure true policy improvement. To address these issues, we propose Adaptive GRPO (AdaGRPO), a novel capability-aware RL algorithm tailored for flow models. Specifically, AdaGRPO consists of two principal components: (i) Online Curriculum Filtering Strategy: Dynamically tracks the model’s proficiency and adaptively selects prompts that best match its current learning boundary; (ii) Cross-Level Advantage Fusion: Synergistically integrates fine-grained intra-group advantages with macro-level global advantages, providing a comprehensive and unbiased policy evaluation. As a lightweight, plug-and-play module, AdaGRPO can be seamlessly integrated with existing frameworks such as Flow-GRPO, DanceGRPO, and Flow-CPS. Extensive experiments demonstrate that AdaGRPO consistently drives performance gains while significantly stabilizes GRPO training for flow models.


[38] CFRNet: Cycle-Consistent Fixed-Point Training for Real-Time Blind Face Restoration on Consumer Embedded NPUs cs.CVPDF

Fuchen Li, Xinyang Wang, Yahui Zhang, Yuhan Chen, Jiahong Guo

TL;DR: 本文提出CFRNet,一种用于消费级嵌入式NPU实时盲人脸恢复的轻量级模型。核心创新是循环一致定点训练(CCFP),通过训练网络使其成为定点算子,多次应用不会改变恢复结果,从而在保证质量的同时实现高效部署。

Details

Motivation: 解决消费设备上盲人脸恢复在图像质量与速度、内存之间的平衡问题,现有强方法依赖难以在小型NPU上编译和量化的算子,而小型卷积恢复器则存在过度平滑和伪影问题。

Result: 在300张图像的测试集上,CFRNet在三个循环时达到最佳感知分数(LPIPS 0.250,比单循环降低31%),并在两个循环时获得最佳PSNR和SSIM;在HiSilicon Hi3402 NPU上以INT8精度每循环运行约23毫秒,而基线模型无法在该芯片上编译。

Insight: CCFP训练策略通过渐进多循环监督、幂等损失和再退化循环损失,使网络成为定点算子,循环次数k可作为无需重新训练的质量调节旋钮;该方法也适用于更易部署的普通CNN,实现了车载驾驶员监控板上的实时运行。

Abstract: Blind face restoration on consumer devices has to balance image quality against speed and memory. Strong methods such as GFPGAN and CodeFormer give good perceptual quality, but they rely on large pretrained generative priors and on operators such as attention, codebook lookup, and style modulation that are hard to compile and quantize on the small neural processing units (NPUs) used in consumer hardware. Small convolutional restorers run fast enough, but they tend to over-smooth and to leave artifacts around the eyes, nose, and mouth. We present CFRNet, a 2.0,M-parameter ResNet-style restorer for on-device use at $256\times256$, the common face-crop size on consumer NPUs. The main idea is Cycle-Consistent Fixed-Point Training (CCFP). Instead of training the network for one pass and then running it several times by hand, we train it to act as a fixed-point operator, so that applying it again to a restored face does not change the face. CCFP uses three training losses, namely progressive multi-cycle supervision, an idempotence loss, and a re-degradation cycle loss, and it adds no cost at inference. To compare fairly under our deployment limits, we retrain all baselines from scratch at the same $256\times256$ resolution. On a 300-image test set, CFRNet reaches the best perceptual score (LPIPS 0.250 at three cycles, which is 31% lower than one cycle) and also the best PSNR and SSIM at two cycles. It runs in about 23,ms per cycle in INT8 on a HiSilicon Hi3402 NPU, while the same baselines cannot be compiled to that chip. The cycle count $k$ acts as a simple quality knob that needs no retraining: PSNR is best at $k!=!2$ and LPIPS keeps improving up to $k!=!3$. We further show that the same idea works with a plain CNN that is even easier to deploy, and we run the model in real time on an in-car driver-monitoring board.


[39] MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models cs.CV | cs.AIPDF

Yifan Xu, Chao Zhang, Ruifei Ma, Fei Gao, Zhifei Yang

TL;DR: 本文提出MotionEnhancer方法,通过从视频扩散模型中提取运动先验作为辅助监督,以注意力对齐的方式增强视觉语言模型对细粒度运动的理解能力。该方法包含两个无参数模块,无需额外训练参数或修改现有架构,即可提升模型在运动级视频理解任务上的性能。

Details

Motivation: 当前视觉语言模型在事件或故事级视频理解上表现出色,但在捕捉细粒度运动细节方面能力有限,而视频扩散模型擅长建模动态运动模式。本文旨在利用视频扩散模型的运动先验来增强视觉语言模型的运动理解能力。

Result: 在两项运动级视频理解基准测试上,MotionEnhancer在现有最先进的视觉语言模型基础上实现了持续改进,特别是在运动相关指标上表现突出。

Insight: 创新点在于通过无参数的注意力对齐模块(MHS和MTTI)从视频扩散模型中蒸馏运动先验,为运动理解提供了可扩展的解决方案,无需额外参数或架构修改。

Abstract: The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.


[40] FS-DVS: A Frequency-Selective Dynamic Visual Sensing Paradigm for Enhancing Information Completeness cs.CVPDF

Feiyu Ji, Xiaokang Yang, Xiaoyun Yuan

TL;DR: 本文提出了一种频率选择性动态视觉传感范式(FS-DVS),通过在事件触发过程前集成可学习的空间滤波器来模拟视网膜神经节细胞的聚合机制,以解决传统动态视觉传感器因缺乏对比敏感度函数和中频敏感性而导致的信息不完整问题。

Details

Motivation: 传统动态视觉传感器(DVS)依赖像素级独立触发机制,忽略了生物视网膜神经节细胞的空间整合功能,缺乏对比敏感度函数及其对中频的固有敏感性,导致亚阈值信号丢失和信息不完整。

Result: 在物体检测和动作识别任务中,FS-DVS实现了显著的性能提升;学习到的空间滤波器自发演变为强调中频成分的中心-周边模式,与人类对比敏感度函数一致,且在不同任务中均收敛到类似人类的特性。

Insight: 创新点在于将可学习的空间滤波器严格置于事件触发之前,通过端到端优化模拟生物聚合机制;该方法实现了选择性信息增强和高噪声鲁棒性,为下一代神经形态传感器提供了生物可信的蓝图,而非单纯提高传感器灵敏度或依赖后处理。

Abstract: Dynamic vision sensors (DVS) offer exceptional temporal resolution and dynamic range by asynchronously reporting pixel-level intensity changes. However, conventional DVS rely on a per-pixel independent triggering mechanism, ignoring the spatial integration performed by biological retinal ganglion cells (RGCs). Consequently, they lack the contrast sensitivity function (CSF) and its inherent sensitivity to mid-spatial frequencies, which inevitably leads to information incompleteness due to sub-threshold signal loss. To bridge this gap, we propose FS-DVS (Frequency-Selective Dynamic Vision Sensor), a novel paradigm that integrates a learnable spatial filter strictly preceding the event triggering process to mimic the RGC aggregation mechanism. By developing a differentiable event simulation framework, the spatial filter can be optimized end-to-end with downstream tasks. Our study reveals that starting from a delta function, the learned spatial filters spontaneously evolve into center-surround patterns that emphasize mid-frequency components, consistently aligning with human CSF. Beyond achieving substantial performance gains in object detection and action recognition, the consistent convergence to human-like CSF characteristics across different tasks underscores the universality of this mid-frequency selective mechanism. Compared to naively increasing sensor sensitivity or relying on post-processing, our paradigm achieves selective information enhancement with high noise resilience, providing a robust, biologically plausible blueprint for next-generation neuromorphic sensors.


[41] Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis cs.CVPDF

Sanket Kachole, Siddhesh Thakur, Shubham Innani, Sanyukta Adap, Suhang You

TL;DR: 本文提出了Multi-FRuGaL框架,一种用于癌症诊断和预后的多模态灵活冗余感知分解门控学习方法。该框架通过信号分解层、输入条件门控网络和信息感知融合目标,在数据缺失情况下分离冗余和模态特异性互补信号,选择性地加权信息丰富的模态并抑制冗余或噪声输入。

Details

Motivation: 现实世界中的患者数据(如放射学、病理学、文本报告和结构化临床信息)经常不完整,存在模态缺失或稀疏获取的问题,这限制了标准多模态融合方法的有效性。

Result: 在HANCOCK(N=763,5个模态)和HECKTOR(N=588,3个模态)两个头颈癌多模态数据集上评估,Multi-FRuGaL在多个任务(5年生存率、2年复发率、HPV状态分类)上均优于基线模型,例如将生存分析的AUC从0.601提升至0.8496,复发分析从0.672提升至0.8102,并在HECKTOR上实现HPV预测AUC达0.975,达到了SOTA水平。

Insight: 创新点在于提出了一个分解感知的自适应门控中间融合框架,通过显式分离冗余和互补信号来处理模态缺失问题,其门控机制和信息感知目标函数增强了模型的鲁棒性和判别力,即使在严重模态缺失条件下也能学习有效的多模态表示。

Abstract: Modern medicine relies on heterogeneous data sources spanning radiology, pathology, text reports, and structured clinical information. However, real-world patient data are frequently incomplete, with missing or sparsely acquired modalities, limiting the effectiveness of standard multimodal fusion approaches. To this end, we propose the Multimodal Flexible Redundancy-aware decomposed GAted Learning (Multi-FRuGaL) framework, a decomposition-aware, adaptive gated intermediate-fusion framework that performs modality-level representation learning under missing data. Multi-FRuGaL integrates per-modality encoders with a signal decomposition layer, an input-conditioned gating network, and an information-aware fusion objective to separate redundant from modality-specific complementary signals, selectively upweighting informative modalities and suppressing redundant or noisy inputs, and remaining well-defined even when multiple modalities are absent. We evaluate Multi-FRuGaL on two multimodal head and neck cancer cohorts: the HANCOCK challenge dataset (N = 763) comprising five modalities and two prognostic endpoints (5-year survival and 2-year recurrence), and the HECKTOR challenge dataset (N = 588) comprising three modalities for human papillomavirus (HPV) status classification. Multi-FRuGaL consistently achieves higher mean performance than the evaluated baselines across multiple tasks, improving AUC from 0.601 to 0.8496 for survival, from 0.672 to 0.8102 for recurrence, and achieving 0.975 AUC for HPV prediction on HECKTOR. For survival analysis, it further achieves a concordance index of 0.6814 for overall survival, 0.7421 for recurrence-free survival, and 0.7143 for progression-free survival on HANCOCK, and 0.7203 for recurrence-free survival on HECKTOR. Qualitative analyses further show that Multi-FRuGaL learns discriminative and robust multimodal representations, even under severe missing-modality conditions.


[42] LRMIL: Efficient Low-Resolution Multiple Instance Learning via High-Resolution Knowledge Distillation for Whole Slide Image Classification cs.CV | cs.LGPDF

Yonghan Shin, Won-Ki Jeong

TL;DR: 本文提出了一种名为LRMIL的高效低分辨率多示例学习框架,用于全切片图像分类。该框架通过高分辨率知识蒸馏,将高分辨率的知识迁移到低分辨率表示中,从而在推理时仅需处理低分辨率图像块,显著降低了计算成本。

Details

Motivation: 现有MIL方法依赖高分辨率图像块的提取和编码,存在两个关键局限:难以捕捉低放大倍率下的全局视觉线索,以及因每张切片包含大量高分辨率块而产生巨大计算开销。

Result: 在多个WSI基准测试上的广泛实验表明,LRMIL在实现更高效推理的同时,其性能持续优于最先进的MIL方法。

Insight: 创新点在于采用了两阶段蒸馏策略:首先进行块级别的跨分辨率蒸馏以对齐嵌入,然后进行切片级别的知识蒸馏来训练低分辨率学生模型。这为临床病理学中的WSI分析提供了一个实用且可扩展的解决方案。

Abstract: Multiple instance learning (MIL) has become a standard paradigm for whole slide image (WSI) analysis in digital pathology, as it enables slide-level prediction without dense annotations. Existing MIL methods typically rely on exhaustive extraction and encoding of high-resolution patches. However, this practice suffers from two critical limitations in real-world clinical settings: it struggles to capture global visual cues at lower magnifications, and incurs substantial computational overhead due to the massive number of high-resolution patches per slide. To address these limitations, we propose an efficient low-resolution multiple instance learning (LRMIL) framework that transfers high-resolution knowledge to low-resolution representations. LRMIL adopts a two-stage distillation strategy. First, patch-level cross-resolution distillation aligns low-resolution patch embeddings with high-resolution representations. Second, slide-level knowledge distillation trains a low-resolution student MIL model under both slide-level supervision and teacher guidance. At inference time, LRMIL operates exclusively on low-resolution patches, substantially reducing data preprocessing and computational cost. Extensive experiments on multiple WSI benchmarks demonstrate that LRMIL consistently outperforms state-of-the-art MIL methods while achieving more efficient inference. These results highlight LRMIL as a practical and scalable solution for WSI analysis in clinical pathology.


[43] EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation cs.CV | cs.AIPDF

Yuan Zeng, Zilue Gao, Yujia Shi, Zongqing Lu, Wenming Yang

TL;DR: 本文提出EgoPressDiff,一种用于第一人称视角下UV域手部压力估计的多模态视频扩散框架。该方法通过结合手部姿态、3D网格顶点和深度信息,生成连续的压力场,解决了现有方法量化误差和时间不一致性问题。在EgoPressure数据集上评估,该方法在Volumetric IoU指标上相对基线提升超过34%,达到SOTA水平。

Details

Motivation: 从第一人称视角估计手部与表面接触压力对AR/VR设备、机器人模仿和人体工程学分析至关重要。现有方法常将压力信号离散化并独立处理帧,导致量化误差和时间不一致性。

Result: 在EgoPressure数据集的第一人称视角设置下,EgoPressDiff实现了最先进的结果,Volumetric IoU相对先前基线提升超过34%,同时降低了MAE并保持了高时间精度。

Insight: 创新点包括多模态条件策略(引入PoseNet和Vertex Encoder提取手部姿态和3D网格特征)以及分布校准空间层(在融合前对齐异构特征的统计特性),确保压力场物理合理性并提升生成质量。

Abstract: Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emph{EgoPressDiff}, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at https://egopressdiff.github.io/.


[44] Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows cs.CV | cs.CRPDF

Xiang Yang, Feifei Li, Mi Zhang, Geng Hong, Xiaoyu You

TL;DR: 本文提出了一种名为统一视觉安全调节器(UVR)的训练无关安全生成框架,用于解决基于多模态注意力扩散变换器(DiT)的图像生成(特别是图像编辑任务)中的有害内容生成问题。该方法通过分析多模态注意力中的信息流动态,识别出任务无关的启动阶段来定位不安全语义,随后通过统一的注意力调制和有害信息流限制来缓解不安全生成。

Details

Motivation: 现有的安全机制主要针对文本到图像生成或基于U-Net的架构设计,在基于DiT的统一框架中,特别是在图像到图像编辑任务中,防止有害内容生成仍是一个关键挑战。

Result: 在各种概念上的实验表明,UVR在图像合成和编辑任务中分别实现了91%和77%的擦除率,达到了最先进的安全性能,同时视觉质量和保真度下降最小。

Insight: 创新点在于从信息流角度分析多模态注意力的动态,识别出任务无关的启动阶段来早期定位不安全语义,并提出了一种统一的、无需训练的注意力调制方法来限制有害信息流。这为DiT架构的安全生成提供了一个新的、任务无关的干预视角。

Abstract: Diffusion transformers (DiTs) equipped with multimodal attention (MM-Attn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in image-to-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-to-image (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in DiT-based frameworks. To bridge this gap, we propose Unified Visual Safety Regulator (UVR), a training-free safe generation framework that regulates unsafe semantics in generated images. UVR is grounded in an analysis of attention dynamics from the perspective of information flow in MM-Attn. We identify a task-independent start-up stage, during which unsafe semantics in output patches rapidly emerge and can be accurately localized, followed by task-specific semantic amplification and interference stages, where harmful signals are further propagated and entangled with benign content. Based on these observations, UVR mitigates unsafe generation through unified, targeted attention modulation and explicit restriction of harmful information flow over the identified unsafe output patches. Experiments across various concepts show that UVR achieves state-of-the-art safety performance by achieving 91% and 77% erase rate in image synthesis and editing tasks, while preserving visual quality and fidelity with minimal degradation. Code is available at https://github.com/deng12yx/UVR.


[45] Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors cs.CVPDF

Hanxun Yu, Xuan Qu, Lei Ke, Boqiang Zhang, Yuxin Wang

TL;DR: 本文提出了Stream3D-VLM,一个支持在线3D空间理解的视觉语言模型。它通过自回归流式控制建模决定何时响应,并利用轻量级的视觉-空间特征集成模块增量注入几何先验。为解决长上下文解码开销,提出了可插拔的几何自适应体素压缩模块。此外,还构建了一个包含超过100万个在线时空3D问答对的数据集和基准测试。

Details

Motivation: 现有3D大模型多为离线工作,需要完整的场景观测或预定义视频片段,无法满足实时流式视频处理的需求。本文旨在实现从流式视频中进行实时3D空间理解。

Result: 大量实验表明,该方法在在线和离线的3D空间理解、推理和定位任务上,显著优于专有和开源模型。

Insight: 创新点在于将3D理解任务建模为在线流式处理问题,提出了自回归响应控制、增量几何先验注入和高效的视觉令牌压缩机制。同时,构建大规模流式3D-语言数据集的流程也具有借鉴意义。

Abstract: Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present an online 3D vision-language model that enables real-time spatial understanding from streaming video. Our approach adopts an autoregressive streaming control modeling based on the LLM’s next-token prediction objective to learn when to respond, and employs a lightweight Visual-Spatial Feature Integration (VSFI) module to incrementally inject temporally aligned geometry priors into the visual stream. To alleviate long-context decoding overhead, we propose a plug-and-play Geometry-Adaptive Voxel Compression (GAVC) module for efficient visual token compression. To address the scarcity of streaming 3D-language data, we further develop a scalable data generation pipeline that curates over 1M online spatio-temporal 3D QA pairs and establishes a comprehensive benchmark spanning 29 tasks. Extensive experiments show that our approach significantly outperforms both proprietary and open-source models across online and offline 3D spatial understanding, reasoning, and grounding tasks. The project page is available at https://stream3d-vlm.github.io/


[46] Diagnosing Visual Ignorance in Vision-Language Models cs.CV | cs.LGPDF

Runyu Zhou, Qi Zhang, Qixun Wang, Yisen Wang

TL;DR: 本文从机制和行为两个角度研究了视觉语言模型(VLMs)对语言先验的依赖问题。研究发现,模型内部存在一个多阶段瓶颈:中间层未能有效检索视觉信息,而后层进一步抑制残留的视觉信号以支持文本空间偏见。外部评估则通过渐进视觉衰减度量发现,当前基准测试中大量样本在视觉信息被严重破坏后仍能回答,表明现有基准无意中奖励了视觉忽视。

Details

Motivation: 视觉语言模型经常依赖语言先验,产生看似自信但缺乏视觉依据的答案。这种行为虽被广泛观察,但其内部机制及其对基准评估的影响尚未得到充分理解。

Result: 在12个视觉问答基准和3个代表性VLM上的实验表明,大量样本在严重或完全视觉模糊化后仍可回答,揭示了当前基准可能无意中奖励视觉忽视。

Insight: 创新点在于结合反事实层替换和监督层间MLP探测来追踪视觉语义与语言先验在语言解码器中的竞争,并引入基于多步高斯模糊的渐进视觉衰减度量来量化视觉忽视。研究强调未来需要基于结构隔离或反事实数据设计训练分布和评估协议,以强制真正的跨模态关联。

Abstract: Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark evaluation remain insufficiently understood. In this work, we study language-prior reliance from both mechanistic and behavioral perspectives. Internally, we combine counterfactual layer replacement with supervised layer-wise MLP probing to trace how ground-truth visual semantics and language-prior semantics compete across the language decoder. Our analysis reveals a multi-stage bottleneck: intermediate layers often fail to effectively retrieve visual information, while later layers can further suppress surviving visual signals in favor of text-space biases. Externally, we introduce a progressive visual decay metric based on multi-step Gaussian blurring, which identifies instances whose answers remain invariant even as visual content is increasingly destroyed. Across twelve visual question-answering benchmarks and three representative VLMs, we find that a substantial fraction of examples remain answerable under severe or total visual obfuscation, indicating that current benchmarks can inadvertently reward visual ignorance. These findings demonstrate that language-prior reliance is a systematic routing failure affecting both model internals and benchmark validity. Finally, we outline critical pathways for future research, highlighting the necessity of designing training distributions and evaluation protocols built on structurally isolated or counterfactual data to enforce genuine cross-modal grounding.


[47] Lighting-Aware Representation Learning under Controllable Lighting Variation cs.CV | cs.LGPDF

Lizhen Zhu, Charantej Reddy Pochimireddy, James Z Wang, Brad Wyble

TL;DR: 本文提出了一种光照感知的表示学习框架,通过引入捕捉渲染场景中光照相关变化的辅助目标,扩展了对比学习,使模型能够同时学习保持语义一致性并对光照相关视觉结构敏感的表示。该方法在ImageNet、ExDark和PASCAL VOC基准测试的图像分类和物体检测任务中进行了评估,结果表明其能持续提升下游性能,且具有广泛的适用性。

Details

Motivation: 光照变化是视觉表示学习的主要挑战,现有方法通常通过数据增强使模型对光照变化不变,但未在训练中显式建模光照信息。受人类视觉理论启发,本文旨在将光照变化作为显式训练信号而非需要抑制的干扰因素。

Result: 在ImageNet、ExDark和PASCAL VOC基准测试中,所提出的光照感知训练相比标准对比学习基线,在保持相同架构和训练预算下,持续提升了图像分类和物体检测的下游性能,并在监督学习框架和简单光照变化设置中表现出有前景的性能。

Insight: 创新点在于将光照变化作为显式训练信号融入对比学习,通过辅助目标建模光照依赖变化,这增强了模型对复杂视觉环境的鲁棒性和适应性,同时保持了语义一致性,为处理光照变化提供了新思路。

Abstract: Variations in illumination remain a major challenge for visual representation learning, as they induce substantial appearance changes both across and within environments. While existing approaches typically address this issue through data augmentations that encourage models to become invariant to lighting changes, such strategies do not explicitly model lighting information during learning. Inspired by theories of human vision, we propose a lighting-aware representation learning framework that incorporates illumination variation as an explicit training signal rather than a nuisance factor to be suppressed. Our method extends contrastive learning by introducing an auxiliary objective that captures illumination-dependent variation in rendered scenes, enabling the model to jointly learn representations that preserve semantic consistency while remaining sensitive to lighting-dependent visual structure. We evaluate the proposed model on image classification and object detection tasks across the ImageNet, ExDark, and PASCAL VOC benchmarks. Results demonstrate that the proposed lighting-aware training consistently improves downstream performance over standard contrastive learning baselines, while maintaining the same architecture and training budget. Furthermore, our approach shows promising performance in supervised learning frameworks and under settings involving simpler lighting variation, suggesting broad applicability beyond complex illumination scenarios. These results indicate its potential to enhance model robustness and adaptability in complex visual environments as well as in more conventional image processing tasks.


[48] Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy cs.CV | cs.AIPDF

Yuan Zeng, Yujia Shi, Yuhao Yang, Dongxia Liu, Zongqing Lu

TL;DR: 本文提出DirectAnimator框架,通过直接从驱动视频学习动画生成,避免依赖易出错的姿态估计器。该方法引入驱动提示三元组(姿态、面部、位置)捕捉运动、表情和对齐信息,并采用Same2X训练策略对齐跨身份特征,从而在去噪过程中实现可靠控制。

Details

Motivation: 现有基于姿态估计器的方法在遮挡或复杂姿态下容易产生误差,因此需要绕过中间表示,直接从原始驱动视频学习更鲁棒的动画生成。

Result: 在广泛实验中,DirectAnimator在视觉质量和身份保持方面达到SOTA水平,对遮挡和复杂关节运动具有鲁棒性,且计算资源需求更低。

Insight: 创新点包括驱动提示三元组的语义丰富表示、CueFusion DiT块的融合机制,以及Same2X训练策略通过同身份数据正则化跨身份特征对齐,加速收敛并提升泛化能力。

Abstract: Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence. Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at https://directanimator.github.io/.


[49] SVHighlights: Towards Extremely Long Sport Video Highlight Detection cs.CV | cs.MMPDF

Donggyu Lee, Youngbin Ki, Jeonghun Kang, Taehwan Kim

TL;DR: 该论文提出了SVHighlights,这是首个针对超长体育视频(每段视频时长超过一小时)的精彩片段检测基准数据集,并提出了一个无需训练的基于片段的基线方法TF-SELECTOR。该方法通过将视频分割成语义一致的片段,并利用大语言模型结合多模态输入来预测片段级显著度分数,在SVHighlights基准上取得了优于现有方法的性能。

Details

Motivation: 现有精彩片段检测方法大多局限于短视频,缺乏针对长视频(如超过一小时)的合适基准,且现有模型难以泛化到长视频,其片段级评分也缺乏识别精彩片段所需的更广泛上下文信息。

Result: 在提出的SVHighlights基准(包含320个平均时长2小时的视频)上,TF-SELECTOR方法在HIT@1、HIT@K和IoU等多项指标上优于基于视频时序定位(VTG)微调的基线模型,分别提升了+3.12、+4.06和+2.95。

Insight: 论文的创新点在于构建了首个超长体育视频精彩片段检测基准,并提出了一种无需训练、基于语义片段合并和利用大语言模型进行多模态融合(视觉描述、文本转录、音频音量)的片段级评分策略,有效解决了长视频上下文建模的挑战。

Abstract: While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +3.12 in HIT@1, +4.06 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.


[50] When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness cs.CVPDF

Sunoh Kim, Daeho Um

TL;DR: 本文提出了一种名为多视图引导自适应对抗攻击(MAC)的方法,旨在提升CLIP模型在测试时对抗扰动的鲁棒性。该方法通过构建输入图像的多视图增强、基于估计的污染程度自适应调整每个视图的对抗攻击强度,并聚合这些视图来生成鲁棒的最终预测。

Details

Motivation: 针对现有测试时对抗攻击(TTC)方法在强攻击下脆弱的问题,TTC依赖单一被污染的原始视图并使用噪声驱动的硬门控方案,无法适应不同程度的污染严重性。

Result: 在20个数据集和多种攻击场景下的广泛实验表明,MAC显著提高了鲁棒性,同时由于其免调优设计,保持了高推理速度和内存效率。

Insight: 创新点在于引入多视图增强和污染感知的软加权机制,自适应地调整对抗攻击强度,从而更有效地抵御强对抗扰动,提升了CLIP模型的测试时鲁棒性。

Abstract: Vision-language models such as CLIP have achieved remarkable zero-shot recognition capabilities, yet their robustness against adversarial perturbations remains limited. Test-time counterattack (TTC) was recently proposed to improve CLIP’s robustness by perturbing an input image to steer it away from a corrupted state during inference. However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. Specifically, MAC first constructs augmented views of an input image to obtain diverse embeddings. It then performs counterattacks to refine corrupted embeddings of views. Next, MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, the adaptively counterattacked views are aggregated to yield a robust final prediction. Extensive experiments across 20 datasets and diverse attack scenarios demonstrate that MAC substantially improves robustness while preserving high inference speed and memory efficiency with its tuning-free design. Our code is available at https://github.com/sunoh-kim/MAC.


[51] SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models cs.CV | cs.AIPDF

Sunoh Kim, Daeho Um

TL;DR: 本文提出了一种名为SS-TPT的测试时提示调优方法,旨在提升视觉语言模型(如CLIP)在对抗性扰动下的鲁棒性。该方法通过评估每个增强视图的稳定性和适合性分数,指导模型适应和推理过程,从而在保持高吞吐量的同时显著提高鲁棒性。

Details

Motivation: 现有基于多视图增强的测试时适应方法虽然能提升鲁棒性,但会导致计算开销大、速度慢,存在鲁棒性与吞吐量之间的权衡问题。

Result: 在多个数据集和不同视图数量下的广泛实验表明,SS-TPT显著优于先前的最先进方法,实现了更优的鲁棒性-吞吐量权衡。

Insight: 创新点在于引入了稳定性和适合性两个互补分数来评估增强视图的质量,并据此设计了SS引导的一致性损失和SS加权预测机制,从而有效放大可信视图并抑制受损视图的影响。

Abstract: Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations. Recent test-time adaptation defenses improve robustness by leveraging many augmented views, but this leads to impractical slowdown and a clear robustness-throughput trade-off. To address this challenge, we present Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT), evaluating the quality of each augmented view via two complementary scores: (1) stability, measuring prediction invariance to weak augmentations, and (2) suitability, measuring feature-space density among views. These stability and suitability (SS) scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, amplifying trustworthy views while suppressing corrupted ones. Extensive experiments demonstrate that SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, thereby demonstrating both strong practicality and generality. Our code is available at https://github.com/sunoh-kim/SS-TPT.


[52] DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection cs.CVPDF

Abhishek Ameta, Sayan Banerjee, Shreyas Pandith, Harshit, Ankita Chatterjee

TL;DR: 本文提出DRIFT方法,将AI生成图像检测问题重新定义为学习真实图像的结构化不变性流形。该方法基于冻结的视觉基础模型,引入轻量级投影头将表征空间分解为互补的鲁棒子空间和脆弱子空间,通过结构化排序边界强制物理不变性与编辑诱导变异性之间的层次分离,从而实现基于边界违反测试的检测。

Details

Motivation: 现有基于鲁棒性间隙的无训练检测方法依赖于预训练继承的固定不变性几何结构,缺乏对检测任务的原则性适应,难以应对开放世界中未见生成器的挑战。

Result: 大量实验表明,该方法在未见生成器和分辨率上表现出强大的开放世界泛化能力,始终优于基于鲁棒性的无训练基线方法,并能提供可解释的不变性违反图。

Insight: 核心创新在于将检测任务形式化为学习真实图像的结构化不变性流形,并通过分解表征空间和引入结构化排序边界,实现了对物理变换不变性与编辑敏感性的显式建模与分离,从而提升了方法的可解释性和泛化性能。

Abstract: The rapid evolution of generative image models challenges existing AI-generated image detectors, particularly in open-world settings with unseen generators. Recent training-free approaches measure robustness gaps in frozen vision foundation models (VFMs), detecting fakes via perturbation-induced embedding drift. However, these methods rely on fixed invariance geometry inherited from pretraining and lack principled adaptation to the detection task. We instead formulate AI-generated image detection as learning a structured invariance manifold of real images under one-class supervision. Building upon a frozen VFM, we introduce lightweight projection heads that decompose representation space into complementary robust and fragile subspaces. The robust subspace is explicitly trained to suppress variations induced by physically plausible imaging transformations, approximating tangent directions of a real-image manifold, while the fragile subspace retains sensitivity to edit-like perturbations. A structured ordering margin enforces hierarchical separation between physical invariance and edit-induced variability, enabling detection as a margin-violation test relative to the learned manifold. At inference, multi-scale patch-wise drift under both transformation families yields a dual-channel invariance signature and interpretable localization. Extensive experiments demonstrate strong open-world generalization across unseen generators and resolutions, consistently outperforming training-free robustness-based baselines while providing interpretable invariance-violation maps.


[53] From Vision to Text: A Compact Multimodal Approach for Robust, Cross-Domain Presentation Attack Detection on ID Cards cs.CVPDF

Qingwen Zeng, Juan E. Tapia, Sneha Das, Christoph Busch

TL;DR: 本文提出了一种基于新型生成和判别模块的紧凑多模态模型,用于结合视觉和文本数据进行身份证(ID)的真实与合成图像呈现攻击检测(PAD)。研究指出,尽管多模态模型在监督微调后表现出强大的泛化能力,但在零样本设置下表现不佳,并强调模型容量和真实世界数据对于可靠PAD至关重要,而现有合成数据集可能无法反映真实挑战。

Details

Motivation: 解决身份证呈现攻击检测中因隐私限制导致数据有限,以及跨域偏移带来的挑战,旨在通过多模态方法提升检测的鲁棒性。

Result: 模型在监督微调后展现出强泛化能力,但在零样本设置下失败;研究未提及具体基准或SOTA比较,但强调现有合成数据集不足以反映真实世界挑战。

Insight: 创新点在于提出紧凑多模态架构结合视觉与文本信息,并批判性评估合成数据的局限性,呼吁重新评估合成数据作为基准,并推动更真实、多样化的数据集以促进PAD研究。

Abstract: Cross-domain shifts challenge Presentation Attack Detection (PAD) on ID Cards, given the restricted data available due to privacy concerns. This work proposes a compact multimodal model, based on new generative and discriminative blocks, which combines visual and textual data for PAD on genuine and synthetic ID images. While multimodal models exhibit strong generalisation after supervised fine-tuning, they fail in zero-shot settings. Our findings underscore that model capacity and real-world data are essential for reliable PAD, while existing synthetic datasets may not reflect real-world challenges. We argue for a re-evaluation of synthetic data as a benchmark and emphasise the need for more realistic, diverse datasets to advance PAD research.


[54] CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection cs.CVPDF

Zihan Liu, Yuguang Yang, Shengjie Su, Jianing Pang, Linlin Yang

TL;DR: 本文提出CL-CLIP,一种基于CLIP的持续学习框架,旨在解决持续目标检测(COD)中的灾难性遗忘问题。该框架通过成本体积引导的类别解耦,将开放词汇检测器的零样本泛化能力与持续学习能力相结合,从而在引入新类别时保持对已学类别的检测性能。

Details

Motivation: 现有的基于CLIP的开放词汇检测器在零样本泛化上表现良好,但在持续更新学习新类别时,会遭受严重的灾难性遗忘,丧失之前已校准的检测能力。因此,需要一种方法使这些检测器具备更好的持续学习能力。

Result: 在PASCAL VOC和MS-COCO数据集上的大量实验表明,CL-CLIP显著改善了F-ViT基线在持续微调下的性能,并与现有的持续目标检测器取得了竞争性的结果,特别是在适应新引入类别的同时保持了有竞争力的基类性能。

Insight: 创新点在于利用CLIP图像-文本相似性成本体积作为零样本空间先验,将共享区域特征分解为类别特定的路径,并通过多专家RoI头处理,从而实现类别解耦以缓解遗忘。这为结合视觉-语言预训练与持续学习提供了新思路。

Abstract: Continual Object Detection (COD) requires a detector to acquire new categories over time while preserving previously learned ones. This goal is closely related to open-vocabulary detection, since both settings require reasoning over categories that are not fully covered by the annotations available at the current training stage. Recent CLIP-based open-vocabulary detectors have shown strong zero-shot generalization, and frameworks such as F-ViT demonstrate that vision-language pretraining can provide powerful zero-shot detection ability for unseen categories. However, real-world deployments cannot remain purely zero-shot: once these detectors are continually updated on newly introduced categories, they suffer severe catastrophic forgetting and quickly lose their previously calibrated detection ability. We therefore propose CL-CLIP, a CLIP-based COD framework that equips open-vocabulary detectors with better continual learning ability through cost-volume-guided category decoupling. Specifically, following CAT-Seg, we compute a CLIP image-text similarity cost volume, defined as dense category-wise response maps between visual tokens and class text embeddings. This zero-shot spatial prior decomposes shared region features into class-specific pathways, which are then processed by a Multi-Expert RoI head. Extensive experiments on PASCAL VOC and MS-COCO show that CL-CLIP substantially improves the F-ViT baseline under continual fine-tuning and achieves competitive performance with existing continual object detectors, especially in adapting to newly introduced categories while preserving competitive base-class performance.


[55] When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT cs.CV | cs.AIPDF

Md Enamul Hoq, Sharafat Hossain, Imraul Emmaka, Linda Larson-Prior, Lawrence Tarbox

TL;DR: 这篇论文研究了在肺部CT影像分析中,输入维度(2D、2.5D、3D)对卷积神经网络(CNN)和视觉变换器(ViT)模型性能与资源消耗的影响。研究发现,在固定的训练协议下,2.5D CNN在性能、稳定性和计算效率之间提供了最有利的权衡,而3D CNN存在阈值不稳定性,ViT则表现出预测退化问题。

Details

Motivation: 尽管3D模型在体医学影像中被广泛认为更优,但其实际价值取决于性能提升是否能证明增加的计算成本和复杂性是合理的。论文旨在探究不同输入维度如何影响模型行为,而非提出新架构。

Result: 在无数据泄露的NLST队列(n=1,977)和LIDC-IDRI数据上,2.5D CNN在比较中表现出最有利的区分度-稳定性权衡(ROC-AUC 0.682,95% CI [0.546, 0.799]),且具有稳定的操作点。3D CNN显示出阈值不稳定性,而变换器则表现出退化预测(如全阳性预测)。置信区间宽且重叠,因此结果以受控的资源-性能前沿和故障模式分类法呈现,而非明确的优越性声明。

Insight: 对于类别不平衡的肺癌筛查分类任务,2D和2.5D输入比完整的3D表示在性能、稳定性和计算效率之间提供了更可靠的权衡。研究提供了一个系统性的资源-性能前沿分析框架,强调了在医学影像中盲目追求3D建模可能并非最优,需综合考虑计算成本与模型稳定性。

Abstract: Three-dimensional models are widely assumed preferable for volumetric medical imaging, yet their practical value depends on whether performance gains justify added computational cost and complexity. Rather than proposing a new architecture, we study how input dimensionality (2D, 2.5D, 3D) affects model behavior across convolutional neural networks (CNNs) and Vision Transformers (ViTs) under a fixed training protocol. Using a leakage-free NLST cohort (n = 1,977) with supporting LIDC-IDRI data, we find that the 2.5D CNN offers the most favorable discrimination-stability trade-off in our comparison (ROC-AUC 0.682, 95% CI [0.546, 0.799]) with a stable operating point. In contrast, 3D CNNs show threshold instability, and transformers exhibit degenerate predictions, such as all-positive predictions. Confidence intervals are wide and overlapping, so we present these results as a controlled resource-performance frontier and a failure-mode taxonomy rather than as definitive superiority claims. For class-imbalanced lung cancer screening classification, 2D and 2.5D inputs provide a more reliable trade-off between performance, stability, and computational efficiency than full 3D representations.


[56] Don’t Pause: Streaming Video-Language Synchrony for Online Video Understanding cs.CV | cs.AIPDF

Zhenyu Yang, Kairui Zhang, Shengsheng Qian, Weiming Dong, Changsheng Xu

TL;DR: 本文提出了一种名为流式视频-语言同步(SVLS)的新范式,旨在解决在线视频大语言模型(Video-LLMs)在流式场景中因生成响应而暂停视频感知、导致实时同步中断的问题。为此,作者构建了LyraV,一个基于分层控制框架的直播助手,其核心创新包括无需训练的帧驱动转换控制器(FDTC)和即插即用的轻量级流式令牌步调器(SToP),实现了视频帧与生成词令牌的细粒度交错处理。

Details

Motivation: 现有在线视频大语言模型在流式处理时,通常在生成响应时会暂停视频感知,这破坏了视频与语言的实时同步性,导致交互卡顿,阻碍了无缝的人机交互。

Result: 在五个在线和三个离线基准测试上的广泛实验表明,LyraV在保持骨干模型通用理解能力的同时,显著提升了流式同步性和叙述流畅性,实现了98.29%的视频播放同步率和3.89 FPS的实时处理速度。

Insight: 主要创新点在于提出了一个无需训练的高层语义决策控制器(FDTC)和一个动态调整语言生成速率的预测模块(SToP),实现了逐帧增量、子预算解码,从而在细粒度上交织视频输入与语言生成。一个有趣的观察是模型展现了对流式令牌进行动态推理的能力,能够伴随视觉输入进行持续解释和“思考”。

Abstract: Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emph{per-frame incremental, sub-budget decoding}: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone’s general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and “thinking” alongside visual input.


[57] Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets cs.CV | cs.AIPDF

Zhenyu Yang, Zemin Du, Shengsheng Qian, Changsheng Xu

TL;DR: 本文提出了ZeroSight,一个用于零样本组合图像检索(ZS-CIR)的新基准,旨在解决现有数据集因图像来源噪声导致参考图像与目标图像完全不相关,以及未能实现真正零样本场景的问题。该基准包含一个从视频中提取的、具有视觉和语义一致性的参考-目标图像对数据集,一个数据构建流程,以及考虑多个正负目标图像排序的评估方法。此外,论文还提出了一种无需训练的、基于多模态大语言模型(MLLM)的对称一致性方法SC4CIR,该方法能有效识别困难负样本,并可即插即用地提升各种CIR方法的性能。

Details

Motivation: 现有ZS-CIR数据集存在两个主要问题:一是参考图像与目标图像因噪声来源而完全不相关,二是它们使用了CLIP等模型已训练过的公开图像数据集,未能实现真正的零样本场景。这导致对CIR方法能力的评估存在偏差。

Result: 在27种方法上的实验结果表明,当前的ZS-CIR数据集和评估指标夸大了检索性能。论文提出的SC4CIR方法是一个即插即用的模块,能显著提升各种CIR方法的性能。

Insight: 主要创新点在于:1)构建了一个真正零样本的ZS-CIR基准(ZeroSight),其数据来源于2022年3月31日之后发布的视频,确保了不在CLIP预训练数据中,且通过视频帧和LLM辅助方法保证了参考-目标对的一致性;2)提出了一种无需训练的、基于对称一致性检查的MLLM驱动方法(SC4CIR),能有效识别困难负样本,并具有通用性。这为评估和提升CIR方法在真实零样本场景下的能力提供了新的基准和技术路径。

Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP’s pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at https://github.com/sotayang/ZeroSight.


[58] STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation cs.CV | cs.AI | cs.CE | cs.LGPDF

Won June Cho, Daeky Jeong, Hyeongyeol Lim, Hongjun Yoon

TL;DR: 本文提出了STREAM框架,用于数字病理学图像生成。该框架利用预训练的病理学视觉基础模型(VFM)的patch-token特征作为潜在空间,并首次在该领域应用黎曼流匹配。它包含一个在单位超球面上建立可矫正性的随机扰动阶段和一个新颖的各向异性解码器,以在乳腺癌和结直肠癌数据集上实现最先进的生成性能。

Details

Motivation: 解决合成病理学图像生成中的关键挑战,如患者隐私和大规模训练数据需求。现有方法使用预训练VFM作为条件信号会导致“条件崩溃”,即条件信号主导潜在空间,降低生成样本的质量和多样性。因此,本文转而将预训练病理学VFM的特征直接用作潜在空间本身。

Result: 在乳腺癌和结直肠癌数据集上,STREAM实现了最先进(SOTA)的重建和生成性能。

Insight: 主要创新点在于:1) 利用预训练病理学VFM的归一化patch-token特征(位于单位超球面上)作为潜在空间,并引入黎曼流匹配;2) 提出一个新颖的各向异性解码器,该解码器能根据速度场雅可比矩阵的能量方向分配鲁棒性和保真度。从客观角度看,将几何(黎曼流形)考虑与领域特定(病理学VFM)特征表示相结合是一个有前景的方向。

Abstract: Synthetic histopathology image generation addresses critical challenges in computational pathology, including patient privacy and the growing need for large-scale training data for foundation models. Latent diffusion models have dominated the image generation domain, with recent works emphasizing that the choice of latent space is critical to the quality of generated images. Existing state-of-the-art generative models in histopathology use pretrained Vision Foundation Models (VFMs) as conditioning signals, and we observe that this leads to “conditioning collapse,” where the conditioning signal dominates the latent space and lowers the quality and diversity of generated samples. Therefore, we instead use pretrained histopathology VFMs as the latent space itself, leveraging their patch-token features that encode rich semantic information. We empirically show that these features are $\ell_2$-normalized and lie on the unit hypersphere $\mathcal{S}^{d-1}$ with strong angular dominance and intrinsic curvature, making them naturally suited for a Riemannian formulation. We therefore present STREAM, the first framework to apply Riemannian flow matching in the pathology domain. STREAM consists of two stages: 1) a bridge-type stochastic perturbation that establishes per-token rectifiability on $\mathcal{S}^{d-1}$ for training a Diffusion Transformer (DiT) in latent space, and 2) a novel anisotropic decoder that allocates robustness to low-energy directions of the velocity-field Jacobian while preserving fidelity along its high-energy directions. Together, STREAM achieves state-of-the-art reconstruction and generation performance on breast and colorectal cancer datasets. The code will be publicly released upon acceptance.


[59] GuideCAD: A Lightweight Multimodal Framework for 3D CAD Model Generation via Prefix Embedding cs.CVPDF

Minseong Kim, Jinyeong Park, Sungho Park, Jibum Kim

TL;DR: 本文提出GuideCAD,一种轻量级多模态框架,用于通过前缀嵌入生成3D CAD模型。该方法利用预训练大语言模型整合视觉和文本信息,通过映射网络将图像嵌入转换为前缀嵌入,再由基于Transformer的解码器预测构造序列来生成模型。实验表明,与微调方法相比,GuideCAD在参数减少约四倍、训练效率提升两倍的同时,能生成质量相当的3D CAD模型。

Details

Motivation: 解决现有用于3D CAD生成的多模态方法计算资源需求大、训练效率低的问题,旨在开发一个轻量高效的框架。

Result: 在构建的新数据集GuideCAD上进行评估,生成质量与现有方法相当,同时参数减少约四倍,训练效率提升两倍。

Insight: 创新点在于利用映射网络将图像嵌入转换为前缀嵌入,使预训练大语言模型能有效整合多模态信息,实现轻量高效的3D CAD生成;客观分析认为,该方法通过参数高效设计降低了计算开销,为多模态3D生成提供了新思路。

Abstract: Multi-modal approaches used for 3D CAD generation require substantial computational resources, necessitating efficient training. To address this, we propose GuideCAD, which leverages semantically rich visual-textual representations having only a small number of trainable parameters to generate 3D CAD models. Specifically, GuideCAD uses a mapping network that converts image embeddings into prefix embeddings, enabling a pretrained large language model to integrate visual and textual information. As a result, a transformer-based decoder predicts the construction sequence using the visual-textual embeddings in order to generate the 3D CAD model. For experimental evaluation, we construct a new dataset, referred to as GuideCAD, which consists of text-image pairs. Each pair includes a text prompt that represents a 3D CAD construction sequence and its corresponding 3D CAD image. Our experimental results show that GuideCAD generates comparably high-quality 3D CAD models while using approximately four times fewer parameters and achieving twice the training efficiency compared to fine-tuning approaches. We have released the source code and dataset for our method at: https://github.com/mskimS2/GuideCAD


[60] TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation cs.CV | cs.LGPDF

Dian Gu, Zhengyi Yang

TL;DR: 本文提出了TrioPose,一个基于SD3.5M架构的原生姿态引导文本到图像生成框架。它通过三流姿态感知扩散变换器、可学习关系偏置掩码和姿态引导空间损失加权策略,解决了复杂多人场景下的肢体扭曲和特征串扰问题,在多个基准测试上实现了最先进的性能。

Details

Motivation: 现有基于UNet的适配器难以处理长程空间依赖,而多模态扩散变换器中的简单信号拼接会严重破坏预训练的潜在分布,导致姿态引导生成在复杂多人场景中出现肢体扭曲和特征串扰。

Result: 在Human-Art、CrowdPose和OCHuman等具有挑战性的基准测试上实现了最先进的性能。特别是在Human-Art上达到了64.33的AP,比先前方法提升了30%,并在复杂多人生成的视觉保真度和文本-图像语义对齐方面设立了新标准。

Insight: 将姿态作为独立模态处理的三流扩散变换器设计,以及通过可学习关系偏置掩码对拓扑连通性进行细粒度建模以解耦实例间干扰,是处理复杂多人遮挡问题的关键创新。姿态引导的空间损失加权策略能够将解剖学监督集中在易扭曲区域,有效提升了生成质量。

Abstract: Pose-guided text-to-image generation often suffers from limb distortions and feature crosstalk in complex multi-person scenarios. While existing UNet-based adapters struggle with long-range spatial dependencies, emerging Multimodal Diffusion Transformers (MM-DiTs) offer superior global modeling. However, naive signal concatenation in MM-DiTs severely disrupts pre-trained latent distributions. To address this, we propose TrioPose, a native pose-driven framework built upon the SD3.5M architecture. Specifically, we introduce a Triple-Stream Pose-Aware DiT (TSPA-DiT) that treats pose as an independent modality. It employs layer-wise activation and zero-initialized dual-residual injection to smoothly enforce geometric constraints while preserving pre-trained latent stability. To resolve severe multi-instance occlusions, we design a Learnable Relational Bias Mask that categorizes topological connectivity into fine-grained physical states, mapping them into continuous attention soft constraints to effectively decouple inter-instance interference. Furthermore, a Pose-Guided Spatial Loss Weighting strategy modulates the native diffusion objective using heatmap-derived error maps, focusing anatomical supervision strictly on distortion-prone regions. Extensive experiments demonstrate that TrioPose achieves state-of-the-art performance across challenging benchmarks, including Human-Art, CrowdPose, and OCHuman. Notably, it attains an AP of $64.33$ on Human-Art, representing a $30%$ improvement over prior arts, while setting new standards for visual fidelity and text-image semantic alignment in complex multi-human generation.


[61] Detecting Temporally Localized Manipulations in Authentic Video Streams cs.CVPDF

Okan Umur, Ali Emre Güşlü, Ibrahim Delibasoglu

TL;DR: 该论文针对真实视频流中插入短时、高度逼真篡改片段这一特定场景,分析了现有数据集的局限性,并构建了一个专门的数据集。论文评估了两种互补的检测方法:基于DINOv3特征的线性探测器和基于连续帧相似性的方法,为部分篡改视频检测建立了初步基准。

Details

Motivation: 现有深度伪造检测、物体移除和视频修复数据集未能充分建模‘在真实视频中插入短时篡改片段后视频继续播放’这一现实场景,因此需要专门的数据集和方法来解决这种时间局部化的逼真篡改检测问题。

Result: 在自定义构建的测试集上评估了两种方法,为这一挑战性场景建立了初步基准。实验结果表明,需要内容自适应的阈值机制来提升检测性能。

Insight: 论文的核心创新在于识别并填补了现有研究在‘时间局部化篡改’检测任务上的空白,通过构建针对性数据集和评估基准方法,强调了内容自适应阈值策略的重要性,为后续研究提供了基础。

Abstract: The rapid advancement of video editing and generative artificial intelligence technologies has made realistic video manipulation increasingly accessible. Although existing datasets have significantly advanced research in deepfake detection, object removal, and video inpainting, they do not adequately model scenarios in which a short manipulated segment is inserted into an otherwise authentic video and the original video continues afterward. In this study, we review representative datasets from the literature, analyze their characteristics, and discuss their limitations with respect to temporally localized realistic manipulation detection. Based on this analysis, we motivate the need for a new dataset specifically designed for authentic videos containing short and highly realistic manipulated intervals. Finally, we evaluate two complementary approaches on our custom-curated test set to establish an initial benchmark for this challenging scenario. The first employs a linear probe on DINOv3 features, assessed under three thresholding strategies. The second leverages DINOv3 features with a consecutive frame similarity-based method to detect temporal manipulation boundaries. Together, these experiments provide an initial benchmark for partially manipulated video detection and highlight the need for content-adaptive thresholding mechanisms. The dataset, code, and supplementary materials are publicly available at https://github.com/OkanUmur/temporally-localized-video-manipulation-detection.


[62] An Adaptive Data cleaning Framework for Noisy Label Detection cs.CV | cs.LGPDF

Chen-Hsuan Fang, Wei-Hsinag Chen, Pin-Hsuan Yu, Jung-Hua Wang, Tsung-Wei Pan

TL;DR: 本文提出了一种自适应的数据清洗框架,用于检测训练数据中的噪声标签。该框架通过模块化特征拼接将样本映射到统一低维特征空间,并整合局部、全局和学习动态线索进行多度量聚类,从而自适应地将样本划分为干净主导和噪声主导两部分,无需手动设置阈值或噪声先验。

Details

Motivation: 现实应用中,数据标签常因模糊性、人为错误或动态环境而损坏,过参数化的深度神经网络容易在训练中记忆这些噪声标签,导致模型准确性和泛化能力下降。现有数据清洗和样本选择方法通常依赖手动阈值、噪声比例先验或单一度量,在复杂数据场景下不稳定。

Result: 在CIFAR-10、MNIST和ImageNet-100数据集上,使用5%至40%对称标签噪声进行实验,该框架在所有设置下均表现出高召回率,特别是在ImageNet-100上40%噪声时达到接近完美的召回率(>=98%)。后续训练在所有评估设置中均带来准确率提升,尤其在ImageNet-100严重损坏情况下效果显著。

Insight: 创新点在于提出了一种整合局部、全局和学习动态线索的多度量自适应聚类框架,避免了传统方法对单一标量度量(如应用一维高斯混合模型)和手动参数的依赖。从客观角度看,其模块化特征拼接和多度量集成提供了一种无需阈值、实用且调参少的噪声标签检测策略,增强了在复杂噪声场景下的鲁棒性。

Abstract: Deep neural networks (DNNs) excel in computer vision tasks given large annotated datasets. In real-world applications, however, labels are often corrupted by ambiguity, human error, or dynamic environments. Over-parameterized DNNs easily memorize these noisy labels during training, degrading model accuracy and generalization. Existing data-cleaning and sample-selection strategies often rely on manually specified thresholds, prior knowledge of the noise ratio, or a single metric (either learning dynamics or geometric structure), making them unstable in complex data regimes. This paper proposes a self-adaptive data-cleaning framework that integrates local, global, and learning dynamics cues for robust noisy-label detection. Samples are mapped into a unified low-dimensional feature space through a modular feature concatenation paradigm. We provide two instantiations: a 2D metric integrating class-adaptive KNN-based local disagreement with k-means-based global centroid distance, and a 3D multi-metric that additionally incorporates a z-normalized score. Unlike conventional 1D Gaussian Mixture Models applied to a single scalar metric, our framework performs multi-metric clustering on the feature space to adaptively partition samples into clean-dominant and noise-dominant components without requiring manual thresholds or noise priors. Experiments on CIFAR-10, MNIST, and ImageNet-100 with 5% to 40% symmetric label noise show high recall across settings, including near-perfect recall (>=98%) on ImageNet-100 at 40% noise. Subsequent training yields accuracy gains across evaluated settings, especially under severe corruption on ImageNet-100. These findings suggest that multi-metric integration provides a threshold-free, practical, and low-tuning strategy for noisy label detection.


[63] LARA: Latent Action Representation Alignment for Vision-Language-Action Models cs.CV | cs.ROPDF

Mengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang, Siyuan Huang

TL;DR: 本文提出了一种名为LARA(Latent Action Representation Alignment)的即插即用框架,旨在解决视觉-语言-动作(VLA)模型因依赖大规模高质量数据而性能受限的问题。该框架通过联合优化潜在动作模型(LAM)和VLA模型,实现表征对齐,使LAM能利用动作轨迹学习以避免虚假视觉变化,同时VLA模型能通过LAM学习的前向动力学进行正则化,减少产生功能无效轨迹的幻觉。

Details

Motivation: VLA模型直接从观察和语言指令预测动作,但其性能受限于真实世界机器人动作数据集的稀缺性。现有方法中,LAM和VLA通常分开训练,导致LAM在VLA训练中缺乏基础,而VLA模型受限于冻结的LAM表征。

Result: LARA在预训练、对预训练VLA模型的后训练增强以及LAM精炼方面均表现出多功能性和有效性,在3个模拟基准和1个精心设计的真实世界机器人操作基准上,平均分别实现了约10%、5%和15%的性能提升。

Insight: 核心创新在于提出了一个联合优化LAM和VLA的即插即用表征对齐框架,实现了两者的互利互惠:LAM通过动作轨迹学习更稳健,VLA则通过前向动力学正则化减少幻觉。这为利用丰富无标签人类视频数据增强VLA模型学习提供了新思路。

Abstract: Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.


[64] GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection cs.CV | cs.AIPDF

Taisei Saito, Koretaka Ogata, Takafumi Hiroi

TL;DR: GP-Adapter是一种无需训练的方法,通过在冻结的CLIP嵌入上构建高斯过程(GP)不确定性模型,用于少样本分类和分布外(OOD)检测。该方法融合图像和文本模态的预测统计信息,生成方差感知的置信度分数,无需微调CLIP主干,仅依赖少量样本缓存和轻量级超参数选择。

Details

Motivation: CLIP在零样本识别中表现优异,但提供的是确定性相似度分数,缺乏不确定性信息,这在分布偏移和数据稀缺场景下至关重要。因此,需要增强CLIP的不确定性建模能力以提升可靠性。

Result: 在ImageNet和多个OOD基准测试中,GP-Adapter在少样本分类上具有竞争力,并能持续提升与提示学习基线结合的OOD检测性能,突显了基于GP的不确定性建模与提示学习的互补性。

Insight: 创新点在于将概率推理(高斯过程)与大型预训练视觉语言模型(CLIP)结合,通过模态特定、类别级的一类GP和核函数选择(图像用RBF核,文本用线性核),实现高效的不确定性估计,从而在低数据和分布偏移设置中提高模型可靠性。

Abstract: We propose GP-Adapter, a training-free framework that augments CLIP (Contrastive Language-Image Pre-training) with Gaussian Process (GP) uncertainty modeling for few-shot classification and out-of-distribution (OOD) detection. While CLIP achieves strong zero-shot recognition, it yields deterministic similarity scores and offers limited uncertainty information, which is critical under distribution shift and data scarcity. GP-Adapter constructs modality-specific, class-wise one-class GPs on top of frozen CLIP embeddings using an RBF kernel for image features and a linear kernel for text prompts and fuses their predictive statistics to produce a variance-aware confidence score for OOD detection. The method requires no fine-tuning of the CLIP backbone and relies only on a small $K$-shot cache and lightweight hyperparameter selection, with memory cost scaling as $O(CK^2)$ for $C$ classes and $K$ shots. Experiments on ImageNet and multiple OOD benchmarks show that GP-Adapter provides competitive few-shot performance and consistently improves OOD detection when combined with prompt-learning baselines, highlighting the complementarity between GP-based uncertainty modeling and prompt learning. Overall, our results suggest that integrating probabilistic inference with large pre-trained vision-language models can improve reliability in low-data and distribution-shifted settings. Code is available at https://github.com/tms-byte/GP-Adapter


[65] TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance cs.CVPDF

Duc Tri Tran, Trung Thanh Nguyen, Vijay John, Phi Le Nguyen, Yasutomo Kawanishi

TL;DR: 本文提出TraRA方法,用于城市监控视频中的文本检测与识别任务。该方法通过轨迹级别的识别聚合,利用时间和多模态一致性来提升视频文本识别的鲁棒性。

Details

Motivation: 现有视频文本识别方法通常在单帧上独立进行识别,易受运动模糊、遮挡和尺度变化等动态因素影响,导致序列间结果不一致且不准确。

Result: 在RoadText、BOVText、ArTVideo和ICDAR15四个公开基准测试上的大量实验表明,TraRA在跟踪和识别性能上持续优于最先进的VTS方法。

Insight: 创新点在于提出了一个即插即用的轨迹级识别聚合框架,包含时间聚类和视觉-语言聚合两个关键模块,后者利用低秩适配增强的视觉-语言模型跨帧融合视觉线索与语言上下文。

Abstract: Video Text Spotting (VTS) is essential for urban surveillance and intelligent transportation systems, enabling automated reading of street signs, vehicle markings, and scene text in video streams. However, reliable recognition remains challenging due to dynamic video factors common in surveillance scenarios, including motion blur, occlusion, and scale variation, which degrade frame-level recognition. Existing VTS methods typically perform recognition independently on each frame, leading to inconsistent and inaccurate results across sequences. To address these limitations, we propose TraRA (Trajectory-level Recognition Aggregation for VTS), a plug-and-play method that performs trajectory-level text recognition by leveraging temporal and multimodal consistency. TraRA integrates two key modules: (1) the Temporal Clustering and (2) the Vision-Language Aggregation. The former refines noisy trajectories by grouping temporally and visually coherent text instances, while the latter employs a Low-Rank Adaptation-enhanced Vision-Language model to fuse visual cues with linguistic context across frames. By aggregating information over entire text trajectories, TraRA achieves robust text recognition even under challenging surveillance conditions. Extensive experiments on four public benchmarks, including road and urban scene datasets (RoadText, BOVText, ArTVideo, and ICDAR15), demonstrate that TraRA consistently improves tracking and recognition performance over state-of-the-art VTS methods. The source code is available at https://github.com/trid2912/TraRA.


[66] Textual Supervision Enhances Geospatial Representations in Vision-Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya

TL;DR: 本文分析了三种模型家族(纯视觉架构、视觉语言模型和大规模多模态基础模型)在地理空间表示学习上的表现,发现文本监督能有效提升模型的地理空间理解能力,尤其是在图像定位和空间推理任务中。

Details

Motivation: 地理空间理解是机器学习系统在图像地理定位和空间推理等任务中的关键但未被充分探索的维度,本文旨在评估不同模型在地理空间表示上的差距。

Result: 通过评估基于可定位性分组的图像集群(如人物、地标和日常物体),研究揭示了模型在空间准确性上的系统性差距,并表明文本监督增强了地理空间表示的学习。

Insight: 语言作为编码空间上下文的有效补充模态,多模态学习是推进地理空间人工智能的关键方向,文本监督能弥补纯视觉模型的不足。

Abstract: Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.


[67] Seeing Without Exposing: Adaptive Privacy Control for Open-World, Context-Hungry MLLMs cs.CVPDF

Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui Li, Shiqi Wang

TL;DR: 本文针对多模态大语言模型(MLLMs)面临的隐私挑战,提出了一种名为锚定隐私漂移(APD)的训练无关方法。该方法通过将隐私敏感元素漂移为语义等效的替代物,同时将上下文线索锚定到源图像,以平衡隐私保护和上下文保留。为系统评估这一双重目标,作者还引入了AdaptShield基准,涵盖22个隐私类别,并进行了广泛的实验验证。

Details

Motivation: MLLMs在处理用户输入时,常包含不可预测的敏感信息,且模型推理依赖于可能本身具有隐私敏感性的丰富视觉上下文。现有隐私保护方法依赖于预定义的敏感类别和固定的混淆策略,难以应对MLLMs中的此类挑战。

Result: 在Qwen2.5、Qwen3、InternVL3和InternVL3.5四个MLLM系列上的广泛实验表明,该方法在隐私净化和内容保留方面均实现了平衡的改进,在文本类别上平均提升10.4%,在基于MLLM的评估中平均提升8.5%。

Insight: 创新点在于提出了无需训练的APD方法,通过语义等效漂移和上下文锚定来动态适应开放世界的隐私控制,避免了预定义类别的限制。同时,引入了结合传统隐私指标和基于MLLM的上下文效用评估的综合基准AdaptShield,为系统评估提供了新工具。

Abstract: Multimodal large language models (MLLMs) have raised new privacy challenges. On the data side, user-provided inputs often include unpredictable sensitive information; while on the downstream task side, model reasoning depends on rich visual context that may itself be privacy-sensitive. Existing privacy protection methods, however, rely on predefined sensitive categories and fixed obfuscation strategies, struggling to tackle such challenges in MLLMs. To address this dilemma, we propose Anchored Privacy Drifting (APD), a training-free method that drifts privacy-sensitive elements toward semantically equivalent alternatives while anchoring contextual cues to the source image. To systematically evaluate this dual objective of privacy protection and contextual preservation, we introduce AdaptShield, a comprehensive benchmark covering 22 privacy categories, which combines conventional privacy metrics with MLLM-based assessments of contextual utility. Extensive experiments show that our method achieves balanced improvements in both privacy sanitization and content retention, with average gains of 10.4% on textual categories and 8.5% under MLLM-based evaluation across four MLLM series, i.e., Qwen2.5, Qwen3, InternVL3, and InternVL3.5.


[68] When Recovery Matters: The Blind Spot of Surrogate Privacy in MLLM Editing cs.CVPDF

Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui LI, Shiqi Wang

TL;DR: 本文针对多模态大语言模型图像编辑中的隐私保护问题,提出了首个面向恢复的基准SPPE,涵盖36个细粒度隐私类别和65种编辑指令。该基准定义了两个互补任务:编辑性评估和代理到源编辑恢复,并分别提出了ERMA和C2E-S2SER方法来解决这些任务,实验表明在SPPE和InstructPix2Pix基准上均取得了显著提升。

Details

Motivation: 现有隐私保护策略在云编辑前用代理内容替换敏感区域,但忽略了本地恢复,导致输出是编辑后的代理而非期望的编辑后源图像,因此需要解决代理隐私保护中的恢复盲点问题。

Result: 在编辑性评估任务中,ERMA相比最佳基线在SRCC和PLCC指标上分别提升13.9%和12.3%;在代理到源编辑恢复任务中,C2E-S2SER在SPPE基准的所有8个源完整性和编辑一致性指标上均优于SOER方法。

Insight: 创新点在于首次提出了面向恢复的隐私保护编辑基准SPPE,将编辑性评估和代理到源编辑恢复作为核心任务,并通过指令感知多模态关系建模和循环一致恢复方法有效解决了这两个任务,为隐私保护图像编辑提供了新的评估框架和方法论。

Abstract: Multimodal Large Language Models (MLLMs) enable flexible instruction-driven image editing, but privacy risks arise when user images expose diverse and user-specific private content. Canonical privacy protection strategies typically substitute sensitive regions with surrogate content before cloud editing. Yet, the resulting output is often an edited surrogate rather than the desired edited source image, neglecting the local recovery in both design and evaluation scope. To this end, we introduce SPPE (Surrogate-based Privacy-Preserving Editing), the first recovery-oriented benchmark covering 36 fine-grained privacy categories and 65 editing instructions. It defines two complementary tasks: 1) editability assessment, which estimates before cloud interaction whether a surrogate can induce an edit consistent with the original image; and 2) surrogate-to-source edit recovery, which evaluates whether the edited surrogate can be transferred back to the private source with the edit effect preserved. We address each task with a dedicated method: ERMA predicts surrogate editability through instruction-aware multimodal relation modeling, while \method performs cycle-consistent recovery by using the surrogate editing pair as visual edit evidence and the source image as a source-preserving anchor. Experiments on SPPE and InstructPix2Pix show consistent improvements on both tasks. For editability assessment, ERMA improves over the best-performing baselines by 13.9% in SRCC and 12.3% in PLCC. For surrogate-to-source edit recovery, C2E-S2SER outperforms SOER across all 8 source integrity and edit consistency metrics on SPPE.


[69] OPTIMUS-Prime: Minimal and Sufficient Concept Explanations for Deep Vision Models cs.CV | cs.LGPDF

Arthur Hoarau, Chenrui Zhu, Vu Linh Nguyen

TL;DR: 本文提出了一种名为OPTIMUS的新框架,用于为深度视觉分类模型生成基于概念的解释。该框架产生的解释以视觉热图的形式呈现,不仅对终端用户可解释,而且基于成熟的质蕴含理论,提供了形式化保证,特别是满足充分性和最小性两个关键属性。

Details

Motivation: 当前计算机视觉领域的可解释性方法往往为了终端用户的可访问性而牺牲了形式化保证,导致实际效用与理论严谨性之间存在关键差距。本文旨在通过引入具有理论基础的视觉解释框架来弥合这一差距。

Result: 该方法在一个视觉分类基准上进行了验证,结果表明OPTIMUS热图能够自然而忠实地揭示模型预测背后与决策相关的概念。

Insight: 主要创新点在于将逻辑学中的质蕴含理论应用于视觉解释,确保解释同时满足充分性(突出概念能保证预测)和最小性(没有更小的子集能保证预测),从而在保持视觉连贯性的同时提供形式化保证,这是现有基于显著性的方法所缺乏的。

Abstract: The growing demand for transparency in automated decision-making has propelled eXplainable Artificial Intelligence (XAI) to the forefront of machine learning research. In computer vision, however, existing explanation methods often prioritize end-user accessibility at the expense of formal guarantees, leaving a critical gap between practical utility and theoretical rigor. In this paper, we address this gap by introducing OPTIMUS, a novel framework for generating concept-based visual explanations for deep classification models. OPTIMUS explanations take the form of visual heatmaps that not only remain interpretable to end users, but are grounded in the well-established theory of prime implicants, providing formal guarantees that have been largely absent from existing saliency-based methods. Specifically, OPTIMUS explanations satisfy two desirable properties: sufficiency, ensuring that the highlighted concepts provably guarantee the classifier’s prediction, and minimality, ensuring that no strict subset of those concepts retains this guarantee. Together, these properties yield explanations that are both logically tight and visually coherent. We validate our approach on a visual classification benchmark, demonstrating that OPTIMUS heatmaps naturally and faithfully surface the decision-relevant concepts underlying model predictions.


[70] Does Appearance Help? A Systematic Study of Image-Based Re-Identification in Online 3D Multi-Pedestrian Tracking cs.CV | cs.LG | cs.ROPDF

Eduardo Borges, Luís Garrote, Urbano J. Nunes

TL;DR: 本文系统研究了基于图像的ReID在在线3D多行人跟踪中的应用,提出了一种轻量级投影框架来解耦几何与外观建模,并分析了不同特征提取架构与多模态数据关联策略在平衡计算延迟与鲁棒跟踪方面的表现。

Details

Motivation: 基于LiDAR的3D MOT通常仅依赖几何信息,在长时间遮挡或拥挤环境中难以区分目标,而现有集成RGB ReID的方法计算开销大,难以满足移动机器人实时性需求。

Result: 在KITTI数据集的Pedestrian类上进行实验,结果表明简单的线性融合会因视觉噪声降低性能,而级联匹配策略能有效恢复遮挡轨迹且不损失整体精度,显著减少了身份切换。

Insight: 创新点在于系统性地探索了轻量级CNN与Vision Transformer在3D MOT中的ReID应用,并提出级联匹配策略以平衡延迟与判别力,为移动机器人提供了安全导航与社会感知的优化方案。

Abstract: LiDAR-based 3D Multi-Object Tracking (MOT) typically relies solely on geometric information, which is often insufficient to distinguish between targets during prolonged occlusions or in crowded human-populated environments. While integrating RGB-based Re-Identification (ReID) offers a theoretical solution for preserving identity context, existing approaches often rely on computationally expensive parallel detectors that hinder real-time robot responsiveness. This work presents a systematic study of image-based ReID in online 3D MOT, utilizing a lightweight projection-based framework to decouple geometric and appearance modeling for mobile robots. A comprehensive analysis of feature extraction architectures is conducted, employing lightweight CNNs and Vision Transformers, and evaluating various multi-modal data association strategies to balance computational latency with robust tracking. Experiments on the Pedestrian class of the KITTI dataset reveal that naive linear fusion, of appearance and motion costs, degrades performance due to visual noise. Conversely, a cascaded matching strategy successfully recovers occluded tracks without compromising overall precision, effectively preventing identity switches to maintain human-robot interaction continuity. We show that lightweight architectures can offer an optimal trade-off between the low latency required for safe navigation and the discriminative power needed for social awareness.


[71] Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation cs.CVPDF

Zihao Zhang, Aming Wu, Yang Li, Yahong Han, Jialie Shen

TL;DR: 本文提出了一种基于超图的几何感知推理框架,用于点云分割中的新类发现任务。该方法通过超图建模类别间的高阶关联,并引入几何感知原型来增强类级几何信息的表示,从而提升已知类到新类的知识迁移和分割精度。

Details

Motivation: 现有方法主要依赖成对关联进行类别分配和新类推理,难以捕捉已知类与新类间的复杂关系,且对点云几何信息利用不足,导致语义分割不准确。

Result: 在SemanticKITTI和SemanticPOSS数据集上的实验表明,该方法在点云分割的新类发现任务中具有有效性和优越性。

Insight: 创新点在于利用超图进行高阶关系推理以超越传统成对关联,并结合几何感知原型通过超边传播几何信息,从而更好地理解类别间的空间分布,提升分割准确性。

Abstract: Novel class discovery in point cloud segmentation aims to transfer knowledge from known classes to automatically identify and segment unlabeled novel classes in point clouds. Existing methods mainly rely on pairwise associations for class assignment and novel class reasoning, which limits their ability to capture complex relationships among known and novel classes and may lead to inaccurate semantic segmentation. To address this issue, we introduce a hypergraph-based framework that models high-order associations among classes and enables collaborative reasoning from known classes to novel classes beyond traditional pairwise relations. Moreover, existing methods tend to focus on semantic feature extraction while paying insufficient attention to geometric information in point clouds. To better exploit spatial structure, we propose Geometric-Aware Prototypes to enhance the representation of class-level geometric cues. By propagating geometric information through hyperedges, the proposed method improves the understanding of spatial distributions across classes and leads to more accurate segmentation. Experiments on the SemanticKITTI and SemanticPOSS datasets demonstrate the effectiveness and superiority of our method.


[72] Reconstructing Multi-Decadal Forest Disturbances: A Spatio-Temporal Transformer Approach cs.CVPDF

Linus Scheibenreif, Anton Raichuk, Maxim Neumann

TL;DR: 本文提出了一种基于视觉Transformer的深度学习框架,用于重建美国本土38年(1984-2022)的森林扰动历史。该方法通过同时建模时间轨迹和空间邻域,有效过滤弱监督信号中的噪声,生成空间连贯的扰动地图。

Details

Motivation: 精确监测森林扰动对于理解碳动态和土地管理至关重要,但传统方法通常依赖于卫星时间序列的逐像素分析,忽略了空间上下文信息。

Result: 该方法在多个卫星(Landsat, Sentinel-1, Sentinel-2)和时间窗口(38年和近6年)上进行了详尽评估,并在手动标注验证数据集(n=300)和独立火灾边界数据集(n=706)上验证了性能。结果显示,该时空模型在MTBS数据集上实现了高达98.2%的精度(±1年检测),在CONUS验证数据集上精度达71.3%,F1分数分别高达75.8%和47.3%,有效减少了空间伪影,但与逐像素基线相比,在不同扰动机制上存在性能权衡。

Insight: 论文的主要创新点在于将视觉Transformer架构应用于森林扰动监测,通过联合建模时空信息来利用空间上下文,从而提升弱监督信号下的噪声过滤能力和地图的空间连贯性。从客观角度看,将Transformer用于长时序、大范围的遥感时间序列分析,并系统评估其在多种扰动机制下的表现,为一致的森林监测提供了有前景的基础框架。

Abstract: Accurate monitoring of forest disturbances is essential for understanding carbon dynamics and land management, yet traditional approaches typically rely on pixel-wise analysis of satellite time-series, ignoring spatial context. We present a deep learning framework that maps 38 years (1984-2022) of forest disturbance across the contiguous United States by modeling temporal trajectories and spatial neighborhoods simultaneously. By leveraging a vision transformer architecture, our approach effectively filters noise from weak supervision signals to produce spatially coherent disturbance maps. We perform exhaustive evaluations across multiple satellites (Landsat, Sentinel-1, Sentinel-2) and temporal windows (38 years and the more recent 6 years), validating performance against a novel, manually annotated validation dataset (n=300) and independent fire perimeter dataset (n=706). The results highlight the complexity of the task: while our spatio-temporal model demonstrates high precision (up to 98.2% for +-1 year detection on MTBS and up to 71.3% on the CONUS validation datasets, with F1-scores up to 75.8% and 47.3%, respectively) and effectively reduces spatial artifacts, it exhibits performance trade-offs across different disturbance regimes compared to pixel-wise baselines. Our method offers a promising foundation for consistent forest monitoring.


[73] AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization cs.CVPDF

Yu Li, Menghan Xia, Gongye Liu, Xintao Wang, Conglang Zhang

TL;DR: AnchorWorld是一个用于具身自我中心世界模拟的框架,通过增强交互完整性和灵活的定制机制来推进自我中心模拟。它利用3D人体运动作为主要交互模态,并引入辅助训练监督来补充自我中心视图中被遮挡或截断的身体部位。此外,框架提出了一种简单有效的机制,通过定义统一世界坐标系中的锚点视图和文本描述来定制自演化世界。

Details

Motivation: 交互式世界建模在实用场景所需的多样化可控性方面仍未被充分探索,现有方法缺乏灵活的世界定制能力。

Result: 实验结果表明,AnchorWorld在性能上显著优于最先进的基线方法,消融研究验证了其关键设计的有效性。定制方案展现出良好的时空几何一致性,并严格遵循规定的演化动态。

Insight: 创新点包括利用外源性视点辅助训练以增强人体-世界交互的空间基础,以及通过锚点视图和文本描述实现可定制、自演化的世界模拟,这为具身AI和交互式环境构建提供了更可控的框架。

Abstract: Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile controllability required by practical scenarios. To bridge this gap, we present AnchorWorld, a framework that advances egocentric simulation through enhanced interaction integrity and a flexible mechanism for world customization. First, we utilize 3D human motion as the primary interaction modality. To complement the out-of-view or truncated body parts in egocentric views, we introduce an auxiliary training supervision that incorporates exogenous viewpoints decoupled from the agent’s first-person sensorium. It allows the model to observe the agent’s full-body positioning relative to the environment, facilitating a more robust spatial grounding of human-world interactions. Furthermore, we propose a simple yet effective mechanism for customizing self-evolving worlds. This is achieved by defining anchor views within a unified world coordinate system, coupled with textual descriptions dictating the dynamic evolution of local scenes. Experimental results show that AnchorWorld significantly outperforms state-of-the-art baselines, while ablation studies validate the effectiveness of our key designs. Notably, our customization scheme exhibits promising spatio-temporal geometric consistency and adheres strictly to the prescribed evolutionary dynamics.


[74] CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models cs.CV | cs.AIPDF

Anku Rani, Wei Dai, Shravan Nayak, Pattie Maes, Mahdi M. Kalayeh

TL;DR: 本文提出了CultureScore,一个用于评估视频生成模型文化忠实度的组合评估框架,将文化忠实度分解为身份、背景和行为三个维度。通过对10个国家、三个SOTA模型生成的6180个视频进行评估,发现当前模型在文化忠实度方面表现不佳,最高分仅为56.8%,且人类偏好与视觉质量评分存在倒置。

Details

Motivation: 随着视频生成模型的进步,现有评估指标(如VideoScore)仅关注视觉质量,缺乏对文化忠实度的评估机制,导致模型可能错误呈现文化元素(如用握手代替合十礼)却获得相同分数。

Result: 在涵盖10个国家的评估套件中,三个SOTA模型的整体CultureScore最高仅为56.8%,其中行为维度最具挑战性,所有模型均低于52%。人类偏好排名与CultureScore方向一致,但与VideoScore倒置,视觉质量最高的模型在文化忠实度上排名最后。

Insight: 创新点在于提出了首个针对视频生成模型文化忠实度的组合评估框架,将文化分解为可量化的三维度,并通过大规模跨文化评估揭示了当前模型在文化表征上的严重不足,强调了公平视频生成中文化忠实度的必要性。

Abstract: As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,180 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8% overall CultureScore, with Behavior the most challenging dimension, which remains below 52% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation.


[75] VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning cs.CVPDF

Zikai Zhang, Hubert P. H. Shum, Toby P. Breckon

TL;DR: VeriDrive是一个用于构建面向规划的可验证反事实监督框架,旨在解决现有视觉语言驾驶模型中自由形式推理监督生成成本高昂的问题。该框架将驾驶推理转化为结构化的感知-评估-修订链,通过结合本地生成和验证器引导的选择性校正来扩展数据构建。在nuScenes数据集上构建的VeriDrive数据集在Omni-Q协议下训练,实验表明其在降低生成成本的同时提升了规划性能。

Details

Motivation: 现有视觉语言驾驶模型使用的推理监督通常是自由形式的,且依赖前沿模型生成,成本高昂。VeriDrive旨在提供一种可验证、面向规划的反事实监督方法,以在现实的标注预算下改进监督质量。

Result: 在nuScenes数据集上的受控开环实验表明,VeriDrive在L2误差、碰撞率和路口违规率等指标上优于OmniDrive,同时显著减少了记录令牌使用量、生成时间以及实际支付的LLM/VLM成本。

Insight: 创新点在于将驾驶推理结构化为可审计的感知-评估-修订链,并引入验证器引导的选择性校正机制来高效扩展数据。这为在有限预算下构建高质量的规划监督提供了可借鉴的结构化方法。

Abstract: Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets. To scale data construction, VeriDrive combines local generation with validator-guided selective correction, escalating only invalid or difficult samples. We build the VeriDrive dataset on nuScenes and train under the Omni-Q protocol. Controlled open-loop experiments show that VeriDrive improves L2, Collision, and Intersection over OmniDrive while reducing logged token usage, generation time, and actual paid LLM/VLM cost. These results show that auditable intermediate fields and structured revision targets can improve vision-language planning supervision under realistic annotation budgets. Code, prompts, and validator scripts are coming soon and will be released after the review process.


[76] Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition cs.CVPDF

Xucheng Shen, Kun Li, Fei Wang, Wei Qian, Jin Jiang

TL;DR: 本文提出了一种空间-时间解耦适配器(Spatial-Temporal Decoupled Adapter)用于微手势在线识别,通过轻量级深度卷积将视频适配分解为独立的时间和空间分支,以捕捉微手势的细粒度模式。同时,针对基准数据集中的长尾分布问题,引入了自适应软平衡增强(Adaptive Soft Balanced Augmentation)方法,根据类别稀有性和学习难度动态分配增强强度。该方法在第四届EI-MiGA-IJCAI挑战赛的Track 2中取得了F1分数0.43808,排名第一。

Details

Motivation: 微手势在线识别旨在从未修剪的视频中时序定位和分类细微手势,但由于其持续时间极短、运动幅度低且视觉线索模糊,捕捉有区分度的时空表示极具挑战性。现有参数高效适配器通常使用单一分支联合建模时空线索,可能无法捕获微手势的细粒度模式。

Result: 在第四届EI-MiGA-IJCAI挑战赛的Track 2中,该方法取得了F1分数0.43808,排名第一,达到了SOTA水平。

Insight: 创新点包括:1)空间-时间解耦适配器,通过独立分支分别建模时空信息,以更好地捕捉微手势的细粒度特征;2)自适应软平衡增强,动态处理长尾分布问题,无需手动阈值设置,提高了模型在稀有类别上的性能。

Abstract: Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.


[77] Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos cs.CV | cs.LG | cs.ROPDF

Anurag Ghosh, Francesco Pittaluga, Khiem Vuong, Angela Chen, Juan Alvarez-Padilla

TL;DR: Dash2Sim是一个将野外单目行车记录仪视频转换为度量、地理参考的4D驾驶日志的框架,使其与现有模拟器兼容。该框架通过独立地图验证日志,无需标注,并应用于大规模视频语料库创建了ROADWork4D基准数据集,包含4,244个场景和270万个3D对象。研究还发现,在施工区场景中,基于规则和混合规划器优于学习型规划器,但所有规划器在处理临时车道变更时仍存在不足。

Details

Motivation: 解决自动驾驶模拟中数据来源有限的问题,特别是如何利用覆盖更广地点和长尾场景(如施工区)的行车记录仪视频,这些视频因难以从单目视频恢复精确4D场景而较少被使用。

Result: 在ROADWork4D-CL子集(2,201个场景)上评估特权闭环规划器,发现施工区场景具有挑战性:基于规则和混合规划器泛化能力优于学习型规划器,但所有规划器均无法完成临时施工区车道变更。此外,Dash2Sim恢复的密集深度将新视角合成质量提升高达19%(基于感知指标)。

Insight: 创新点包括无需标注即可从单目视频生成兼容模拟器的4D日志,并通过地图验证确保准确性;该方法为闭环传感器模拟提供了丰富的条件信息,有助于处理长尾驾驶场景。

Abstract: Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.


[78] Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation cs.CVPDF

Danial Hamdi, Fardin Ayar, Mahdi Javanmardi

TL;DR: 本文提出了一种诊断框架,通过整数线性规划(ILP)将身份和类别分配问题形式化,构建了一个模型无关的oracle,用于分层隔离视频实例分割(VIS)中的错误来源。该框架应用于七种VIS方法,揭示了跟踪不稳定性是在线方法的关键瓶颈,尤其在严重遮挡下性能差距超过20 AP,且随视频长度和实例密度急剧增加。

Details

Motivation: 在视频实例分割(VIS)中,分类、分割和跟踪目标被联合评估,但各自对性能损失的贡献不透明,因此需要一种诊断工具来明确识别性能瓶颈,以指导模型改进。

Result: 在YouTube-VIS 2019/2021和OVIS的诊断子集上分析显示,跟踪不稳定性是在线方法的主要瓶颈(遮挡下AP差距超20),而语义分类在标准基准上有意义但在跟踪失败时影响可忽略;更强的骨干网络提升默认分数但未改善跟踪差距,表明时间脆弱性是算法性的。

Insight: 创新点包括基于ILP的模型无关诊断框架,能分层隔离错误源,以及TrackLens可视化工具将性能差距转化为可观察的查询级失败模式;客观分析认为,这为针对VIS核心挑战(鲁棒长期时间关联)提供了系统化基础,强调了算法改进而非仅表示学习的重要性。

Abstract: In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube-VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query-level failure modes. Together, these tools provide a systematic foundation for targeting VIS’s core challenge: robust long-term temporal association.


[79] OpenGlass: Open-Source Smart Glasses for On-Device Event-Based Gesture Recognition cs.CVPDF

Pietro Bonazzi, Julian Moosmann, Ahmet Celik, Philipp Mayer, Michele Magno

TL;DR: 本文介绍了OpenGlass,一个用于快速原型开发的开源智能眼镜平台,采用模块化设计支持事件和帧式相机,并通过软硬件协同的电源管理系统实现低功耗。原型机使用200mAh电池可连续运行11.8小时,在GAP9 RISC-V SoC上实现了基于事件的实时手势识别。

Details

Motivation: 解决紧凑型智能眼镜在功耗、内存和计算资源上的严重限制,填补支持事件视觉和嵌入式机器学习的小型开源硬件平台的空白。

Result: 在LynX数据集上使用事件相机进行手势识别,R(2+1)D模型在留二交叉验证中达到83.94%的准确率(宏F1=0.781),端到端延迟33.9毫秒;时间增强和模糊类别移除带来最大性能提升(+8.9个百分点)。

Insight: 创新点包括模块化FPC转接板设计、事件驱动唤醒的软硬件协同电源管理架构,以及开源释放完整硬件设计、固件和模型,为可穿戴设备研究提供可复现平台。

Abstract: Smart eyewear enables unobtrusive, context-aware interaction through multimodal sensors and on-device intelligence, but is severely limited by power, memory, and compute constraints in a compact form factor. Open-hardware platforms supporting event-based vision and embedded ML at this scale are rare. This work introduces an open-source smart glasses platform for rapid prototyping of novel sensors and algorithms. Its modular design uses a flexible FPC interposer to support both event-based and frame-based cameras without full PCB redesign. A hardware-software co-designed power management system combines a configurable PMIC with event-driven wake-up via an nRF5340 coordinator, keeping the GAP9 RISC-V SoC powered down between inferences. The prototype achieves up to 11.8 hours of continuous on-device ML from a 200 mAh battery. As a demonstration, an egocentric hand gesture recognition pipeline was evaluated on the LynX dataset using polarity-separated event histograms from a Prophesee GENX320 camera. R(2+1)D achieved the best cross-subject accuracy of 83.94% (macro F1 = 0.781) under leave-two-subjects-out validation, with 33.9 ms end-to-end latency on the GAP9. Temporal augmentation and removal of ambiguous classes provided the largest gains (+8.9 pp). All hardware designs, firmware, and models are released open source.


[80] RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents cs.CVPDF

Ameya Joshi, Joon Kim, Gus Eggert, Joseph Bajor, Cindy Hao

TL;DR: 本文提出了RealDocBench基准测试,用于评估文档解析系统在真实世界受监管文档上的字段级问答和布局理解能力。该基准包含两个赛道:QA赛道包含来自四个领域的581个文档的1356个字段级问题,评估解析器在字段级和严格问题级准确率;布局赛道包含1500张人工验证的页面图像,采用九类公共分类法进行COCO风格边界框标注,并通过匈牙利匹配器进行评分。

Details

Motivation: 现有公开基准大多在干净的学术布局或合成文本上评估解析器,使用单一的OCR或标记相似度分数,这与下游代理在实际混乱文档中提取特定字段值的需求相关性较差。

Result: 在统一提取和评分协议下评估了18个系统(包括商业解析API、通用VLMs和开源OCR模型),揭示了单数字基准隐藏的广泛性能差异、医学子领域的持续困难性以及成本/延迟之间的尖锐权衡。

Insight: 创新点在于构建了基于真实受监管文档的双赛道基准,强调字段级准确性和布局理解的细粒度评估;客观分析认为其通过类型化答案字典和邻接感知的拆分/合并恢复机制,提供了更贴近实际应用场景的评估框架。

Abstract: Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.


[81] TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment cs.CV | cs.AI | cs.CL | cs.LGPDF

Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele

TL;DR: 论文提出TEVI框架,通过稀疏自编码器解耦图像嵌入,并训练掩码模块根据给定文本描述选择性重构嵌入,以改善视觉语言模型(如CLIP)中图像与文本嵌入的对齐问题。实验表明,TEVI在合成字幕的受控设置中能有效保留描述属性并丢弃无关信息,在自然图像数据集上提升了检索性能。

Details

Motivation: 视觉语言模型(如CLIP)的共享嵌入空间中图像与文本嵌入往往对齐不佳,影响下游任务性能;这归因于图像信息比字幕描述更丰富导致的信息不平衡问题。

Result: 在合成字幕的受控实验中,TEVI能有效保留描述属性并丢弃其他信息;在自然图像数据集上,TEVI在粗粒度短字幕(MS COCO、Flickr)和细粒度长字幕(IIW、DOCCI)基准上提升了检索性能,尤其在丰富字幕上增益更强,并在RoCOCO基准上增强了鲁棒性。

Insight: 创新点在于利用字幕作为信号,通过稀疏自编码器解耦图像嵌入并选择性重构,以解决视觉语言模型中的信息不平衡问题;客观分析认为,该方法提供了一种可解释的嵌入编辑机制,有助于提升跨模态对齐的泛化能力。

Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.


[82] Watch, Remember, Reason: Human-View Video Understanding with MLLMs cs.CV | cs.AI | cs.MMPDF

Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu

TL;DR: 本文提出了一种基于多模态大语言模型(MLLMs)的视频理解新视角——‘观察、记忆、推理’,旨在统一分析模型如何获取证据、保持上下文并生成可靠输出。文章系统性地阐述了视频MLLMs在感知表征、记忆状态、推理轨迹和最终预测方面的构成,并围绕这三个核心能力梳理了代表性方法、应用领域、数据集和评估基准。

Details

Motivation: 当前视频理解研究正从短视频转向长视频、多模态和知识密集型场景,这些场景要求模型处理稀疏证据、长程依赖、多模态对齐以及在有限计算预算下的可靠推理。本文旨在提供一个统一的分析框架,以应对这些挑战。

Result: 本文是一篇综述性论文,未报告具体的定量实验结果,但系统性地梳理和分析了现有视频MLLM方法在不同任务、模态和能力维度上的表现与基准。

Insight: 创新性地提出了’观察、记忆、推理’这一人类视角的分析框架,将视频理解任务统一为证据获取、上下文保持和可靠推理三个核心功能。这为评估和设计视频MLLMs提供了一个结构化的方法论,并明确指出了在时空感知、长视频高效处理、记忆建模、流式理解和忠实推理等方面的关键挑战与未来方向。

Abstract: Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.


[83] Streaming Video Generation with Streaming Force Control cs.CVPDF

Hanhui Wang, Yiming Xie, Haiwen Feng, Zhaoyang Lv, Shenlong Wang

TL;DR: 本文提出了StreamForce,一种流式视频生成框架,通过连续的力输入实现物理基础的视频控制。该框架采用统一的力表示作为控制信号,并设计了蒸馏流程,结合了自回归效率与力响应能力,能够实时、连贯地响应局部和全局的时变力,在单GPU上达到16.6 FPS的生成速度。

Details

Motivation: 现有视频生成模型通常针对不同类型的力训练独立模型、假设力固定或依赖非因果处理,缺乏对时变力的实时、连贯响应能力。本文旨在解决这一问题,实现一个因果、统一的模型,以支持物理基础的流式视频控制。

Result: StreamForce在力遵循度和运动真实感方面均达到了最先进的性能(SOTA),并在单GPU上实现了高达16.6 FPS的生成速度。

Insight: 创新点在于设计了一种统一的力表示作为控制信号,并开发了用于力可控视频生成的蒸馏流程,从而实现了因果、高效的流式生成,同时保持了光度和动态的真实感。

Abstract: We introduce StreamForce, a streaming video generation framework that enables physically grounded control through continuous force inputs. Unlike prior video models that train separate models for different force types, assume fixed forces, or rely on non-causal processing, StreamForce is a causal and unified model that responds instantly and coherently to both local and global, time-varying forces. To achieve this, we design a unified force representation as a control signal and develop a distillation pipeline for force-controllable video generation. Our model combines autoregressive efficiency with force responsiveness, sustaining stable photometric and dynamic realism. StreamForce runs at up to 16.6 FPS on a single GPU, achieving state-of-the-art performance in both force adherence and motion realism. Project website: https://neu-vi.github.io/StreamForce/


[84] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism cs.CV | cs.AI | cs.CLPDF

Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang

TL;DR: MemDreamer是一个用于长视频理解的即插即用框架,通过解耦感知与推理,将长视频处理转化为智能体探索过程。它通过增量式视频流构建分层图记忆,并在推理时使用智能体增强检索机制,在多个主流基准测试中取得了最先进的结果。

Details

Motivation: 当前视觉语言模型在处理小时级长视频时面临令牌爆炸和注意力稀释的挑战,导致性能下降。论文旨在通过解耦感知和推理来解决这一问题,使模型能够高效处理长视频内容。

Result: MemDreamer在四个主流长视频理解基准测试中取得了最先进的结果,将模型与人类专家的差距缩小至仅3.7分。同时,它将推理上下文窗口限制在全上下文输入的2%,并实现了12.5分的绝对准确率提升。

Insight: 创新点在于提出了分层图记忆结构和智能体增强检索机制,将长视频理解转化为层次化探索过程。此外,论文发现视觉语言模型在逻辑推理和长视频理解性能之间存在强正线性相关,为多模态理解提供了新的可扩展范式。

Abstract: Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM’s performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.


[85] Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning cs.CVPDF

Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang

TL;DR: 本文提出Skill-3D框架,旨在解决多模态大语言模型(MLLM)智能体在3D空间推理任务中工具使用不当和偏好偏差的问题。该框架通过学习自我演化的场景感知技能,根据具体场景和任务动态选择工具,从而显著提升智能体的工具利用效率和推理性能。

Details

Motivation: 现有方法在3D场景下往往误用工具或表现出有偏好的工具选择,导致智能体范式相比非智能体策略仅有边际收益。核心问题是3D空间推理任务在不同场景间具有异质性,而现有智能体对所有场景采用统一的工具使用策略,而非根据具体场景和任务进行适配。

Result: 在VSI-Bench基准上,工具利用率从39%提升至78%;在MMSI-Bench上,将Gemini-3-Flash的性能提升了67%。此外,通过对技能引导的轨迹进行智能体后训练,使Qwen3-VL-8B在VSI-Bench上的性能提升了43%。

Insight: 创新点在于提出了一个场景记忆与技能库协同演化的循环框架,通过聚合和蒸馏相似场景的成功轨迹形成可复用的场景感知技能,并将失败轨迹作为教训附加到技能中。这实现了从静态、统一的工具使用策略到动态、场景自适应的技能学习与应用的转变。

Abstract: This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent’s tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.


cs.GR [Back]

[86] Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers cs.GR | cs.CV | cs.HCPDF

Adam Cole, Mick Grierson

TL;DR: 本文提出了Real-Time AttentionBender,一个用于视频扩散变换器(DiT)的实时交互式网络弯曲工具。它作为一个插件,允许艺术家在视频生成过程中,实时、精细地操控自注意力、交叉注意力和前馈网络等内部机制,从而实现对模型生成过程的深度干预和创造性探索。

Details

Motivation: 现有的生成式视频模型仅通过文本提示进行交互,限制了艺术家的创作能动性,且模型内部的生成过程对用户而言是黑箱。本文旨在提供一个工具,使艺术家能够实时、交互式地探索和操控视频扩散变换器的内部机制,以增强创作控制并理解模型行为。

Result: 论文未在摘要中提供具体的定量评估结果或基准测试数据。其成果主要体现为开发并集成了一个实时交互工具,该工具已作为插件集成到DayDream Scope生态系统中,并封装了开源实时Wan管线。

Insight: 创新点在于将网络弯曲(network bending)实践扩展到视频扩散变换器的全深度,并实现了实时交互。它提供了前所未有的细粒度操控能力(可定位到单个扩散步骤、DiT层、提示词标记和隐藏神经元),促进了艺术家与模型之间的“材料亲密感”,同时兼具可解释人工智能(XAI)研究和艺术表达工具的双重定位。

Abstract: Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model’s material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call “material intimacy” with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model’s default representational space.


[87] Semantic-Structural Alignment for Generative Pictorial Charts cs.GR | cs.CVPDF

Zhida Sun, Yulin Zhang, Zheng Gu, Min Lu, Bongshin Lee

TL;DR: 本文提出了一种生成式框架,用于自动合成象形统计图,旨在弥合语义表达与结构保真度之间的差距。该框架将问题构建为双条件生成任务,通过文本提示和上下文图像分别控制语义内容和统计图结构,并引入结构对齐和语义对齐机制来增强控制效果。该方法能够跨多种视觉通道和语义域生成既具艺术吸引力又保持结构一致性的象形图。

Details

Motivation: 传统统计图精确但缺乏视觉吸引力、记忆性和参与感,而现有方法通常仅将图表视为可风格化的图像,未能有效结合语义表达与结构保真。本文旨在通过生成式方法自动合成兼具艺术表现力和数据准确性的象形统计图。

Result: 广泛的定量评估和感知用户研究表明,该框架在生成高保真、数据驱动的象形图方面优于传统的可控生成和图像编辑基线方法,为表达性视觉叙事提供了基础。

Insight: 创新点在于将象形图生成构建为双条件控制任务,并引入结构对齐和语义对齐的特征级机制,确保生成结果在空间布局上锚定输入图表结构,同时在纹理上从参考图像迁移表达性语义,实现了语义与结构的有效对齐。

Abstract: Traditional statistical graphics are precise but often lack the visual appeal, memorability, and engagement of pictorial charts. We present a generative framework for the automated synthesis of pictorial charts that bridges the gap between semantic expression and structural faithfulness. Rather than treating charts merely as images to be stylized, we frame the problem as a dual-conditioned generation task guided by two parallel external control signals: a text prompt capturing the semantic context of the editing intent, and a context image providing the abstract statistical chart’s global structure. To reinforce these controls within a Multi-Modal Diffusion Transformer, we introduce two complementary feature-level mechanisms: structural alignment to anchor spatial layouts to the input chart, and semantic alignment to transfer expressive textures from reference images. Generalizing across major visual channels (i.e., length, area, angle, and position) and diverse semantic domains, our method produces pictorial charts that are both artistically compelling and structurally consistent. Extensive quantitative evaluations and perceptual user studies demonstrate that our framework outperforms traditional controllable generation and image editing baselines, providing a foundation for high-fidelity, data-driven generative modeling in expressive visual storytelling. Project page: https://ssalign.github.io/.


cs.LG [Back]

[88] RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning cs.LG | cs.CLPDF

Yongliang Miao, Fengyuan Liu, Wei Shi, Yanguang Liu, Fei Sun

TL;DR: 本文提出了一种名为RASFT(Rollout-Adaptive Supervised Fine-Tuning)的策略感知监督微调框架,用于改进大语言模型在推理任务上的微调。该方法通过在线策略验证来评估问题的可解性,从而动态调整对专家轨迹的模仿强度,并在模型自身推理可靠时融入其生成的正确轨迹,以缓解传统SFT的过拟合问题。

Details

Motivation: 传统监督微调(SFT)在让大语言模型适应推理任务时,通常将单一专家轨迹作为模仿目标,但这可能导致模型过度拟合表面形式并抑制其自身的推理分布,而推理并非简单的路径模仿。

Result: 在六个数学推理基准和两个代码推理基准上的实验表明,RASFT在多个模型上均取得了优于标准SFT、其变体以及代表性强化学习方法(RL)的整体性能。

Insight: 核心创新在于提出了一个基于问题可解性动态调整专家监督强度的策略感知微调框架,并引入了冻结参考模型与当前策略之间的裁剪逆比率来约束过度的策略漂移,从而在利用专家知识的同时更好地激发和保留模型自身的推理能力。

Abstract: Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model’s own reasoning distribution. We propose Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift. Experiments across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT achieves better overall performance than SFT, SFT variants, and representative RL methods. The code is available at https://github.com/zjd1sq/RASFT.


[89] DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios cs.LG | cs.AI | cs.CLPDF

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang

TL;DR: 本文提出了DEFINED,一个用于辩论场景中细粒度创造力评估的数据高效计算框架。该框架通过一个分层的八维度度量系统来操作化辩论创造力,并利用一个带有分层评分头的预训练自回归语言模型实现,支持细粒度和粗粒度评估。通过从真实辩论比赛中获取陈述及其专家评分,并采用约束数据增强策略解决原始数据中的精英偏差,DEFINED采用混合粒度训练策略,从有限的有监督数据中稳健学习。

Details

Motivation: 在大型语言模型时代,人类创造力已成为关键能力,但在复杂、开放环境中评估创造力是数据挖掘领域的重大挑战,目前受限于对标准化简单任务的依赖和细粒度专家数据的稀缺。辩论作为一个生态有效的评估情境,反映了创造力的多个维度,但当前主流自动化评分方法不适合辩论等复杂场景,仍依赖昂贵的人工评估。

Result: 在评估协议中,该评分模型实现了准确且稳定的评分,其表现优于基于提示的大型语言模型评估器和现有的辩论评分方法。

Insight: 创新点包括:1) 为辩论创造力设计了一个分层的八维度度量系统,实现了细粒度操作化;2) 提出了一个结合预训练语言模型和分层评分头的架构,支持混合粒度评估;3) 采用了约束数据增强和混合粒度训练策略,以数据高效的方式处理有限的专家标注数据;4) 通过纳入对辩论新手参与者的实证研究,严格验证了框架在真实生态中的有效性,超越了合成基准测试。

Abstract: Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.


[90] Closed-Form Spectral Regularization for Multi-Task Model Merging cs.LG | cs.CVPDF

Yongxian Wei, Runxi Cheng, Xingxuan Zhang, Li Shen, Chun Yuan

TL;DR: 本文提出了一种基于谱正则化的闭式多任务模型合并方法,将模型合并重新表述为一个带噪声的线性逆问题,并设计了一种谱滤波估计器。该方法通过软指数滤波和硬截断来抑制噪声放大,无需训练数据或优化器状态,显著提升了合并效率和性能。

Details

Motivation: 现有模型合并方法将合并视为逐层二次干扰最小化问题,其闭式伪逆解在实践中表现不如梯度下降迭代解,但迭代过程成本高昂且有效性未得到解释。本文旨在揭示迭代求解器的作用机制,并提出更高效的闭式合并方法。

Result: 在四个通用基准和一个涵盖VQA、几何、图表、OCR、接地和模态合并的多模态合并基准上,提出的谱求解器(SWUDI及其自适应变体SWUDI-A)达到或超越了最先进合并方法的性能,同时将运行时间减少了28-72倍,峰值GPU内存降低了高达50%。

Insight: 创新点在于将模型合并形式化为带噪声的线性逆问题,并引入谱滤波估计器来隐式正则化病态正规方程;具体实现的SWUDI方法结合了软指数滤波(匹配梯度流轨迹)和硬截断(抑制小特征值方向的噪声放大),其自适应变体进一步提升了跨架构的鲁棒性。

Abstract: Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models. State-of-the-art merging methods formulate merging as a layer-wise quadratic interference minimization problem. Although this problem admits an exact closed-form pseudoinverse solution, that solution underperforms hundreds of iterations of gradient descent in practice. The iterative loop dominates the cost of the pipeline, yet its effectiveness has remained unexplained. We revisit this regime and show that the iterative solver does not primarily act as an optimizer; rather, it serves as an implicit spectral regularizer for an ill-posed normal equation, where small-eigenvalue directions of the per-layer interference operator amplify proxy noise. Building on this finding, we formalize multi-task model merging as a noisy linear inverse problem and propose a spectral filtering estimator parameterized by a per-direction filter. We instantiate this estimator with SWUDI, a closed-form method that combines a soft exponential filter, which matches the gradient-flow trajectory of iterative descent, with a hard top-K truncation that suppresses noise-amplifying small-eigenvalue directions. Furthermore, we propose SWUDI-A, an adaptive variant that replaces the global rank hyperparameter with per-layer rank rules, further improving robustness across architectures. Both variants share a single symmetric eigendecomposition per linear layer and require no training data or optimizer state. Across four general benchmarks and a multimodal merging benchmark spanning VQA, Geometry, Chart, OCR, Grounding, and modality merging, our proposed spectral solvers match or outperform state-of-the-art merging methods. Crucially, they reduce wall-clock time by 28-72x and peak GPU memory by up to 50%.


cs.RO [Back]

[91] What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos? cs.RO | cs.AI | cs.CV | cs.LGPDF

Richard Li, Aditya Prakash, Andrew Wen, Saurabh Gupta, Yilun Du

TL;DR: 本文研究了利用日常人类视频协同训练机器人操作策略时的影响因素,发现手部姿态质量对迁移有影响,但即使手部姿态准确,固有的运动差异也会阻碍迁移,除非视觉网络和策略网络针对具体机器人进行专门化。

Details

Motivation: 解决如何利用丰富的日常互联网视频(而非专门录制的演示视频)来训练机器人操作策略的问题,探究从这类视频迁移到机器人的关键因素。

Result: 提出的协同训练方法在六个操作任务上取得了显著提升,在机器人数据稀缺的情况下,绝对成功率提高了29.7%。

Insight: 创新点在于系统性地分析了从日常视频迁移到机器人的关键因素(手部姿态质量和运动差异),并提出了通过视觉和策略网络专门化来弥合运动差异的协同训练方法。

Abstract: Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7%$ in the low-robot-data regime across six manipulation tasks.


[92] Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation cs.RO | cs.AI | cs.CVPDF

Xiangyi Zheng, Xiangyu Wang, Qinan Liao, Zimu Tang, Yue Liao

TL;DR: 本文提出了FLIGHT基准测试,用于细粒度长航时无人机导航任务,并设计了FLIGHT VLA异步架构,将低频的任务状态推理与高频的连续控制解耦,以执行多阶段语义指令并生成平滑的物理可行飞行轨迹。

Details

Motivation: 现有视觉语言导航基准通常使用离散或粗粒度动作,而无人机视觉语言动作任务侧重于短时、原子化机动,缺乏执行长航时、细粒度语义指令的能力。

Result: 在闭环评估中,FLIGHT VLA在FLIGHT基准上持续超越代表性的VLN和VLA基线,在多阶段任务完成度、子目标遵循和终端控制方面表现更强,其训练的流式飞行员推理VLM也提升了无人机视频推理能力。

Insight: 创新点在于提出了一个结合多阶段指令和密集6自由度轨迹标注的细粒度长航时导航基准,以及一个通过显式‘飞行员推理’文本监督、解耦任务推理与连续控制的异步架构,实现了实时飞行中的任务状态推理与高频精确控制的协同。

Abstract: Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.


[93] ActionMap: Robot Policy Learning via Voxel Action Heatmap cs.RO | cs.CVPDF

Pei Yang, Hai Ci, Yanzhe Chen, Qi Lv, Han Cai

TL;DR: 本文提出了ActionMap,一种用于机器人策略学习的体素动作热图解码器,以替代现有视觉-语言-动作(VLA)模型中非结构化的单点动作预测器。该方法将动作空间建模为体素网格,并预测每个体素对应的动作概率热图,从而利用动作空间的几何邻近性。在LIBERO仿真和真实世界Franka机器人操作任务上的实验表明,该方法在多个骨干网络上均能提升性能、加速收敛并提高数据效率。

Details

Motivation: 当前大多数VLA模型的动作解码器(如自回归token分箱、L1回归或流匹配去噪)将动作空间视为非结构化的,忽略了相邻动作之间的几何邻近关系,这限制了训练效率和性能。本文旨在通过引入结构化的动作表示来解决这一问题。

Result: 在LIBERO仿真基准的四个任务套件平均性能上,ActionMap比OpenVLA-OFT的L1回归头提升了8.2%。该方法在两个架构不同的骨干网络上均实现了更快的收敛速度和更高的数据效率,尤其是在低训练数据量下表现显著更优。

Insight: 核心创新在于将动作预测从单点回归重构为在结构化动作空间(体素网格)上的概率热图预测,从而显式地建模了动作的几何结构。这表明动作表示本身是提升VLA性能的一个独立且有效的杠杆,而非仅仅依赖于骨干网络或训练方案的扩展。

Abstract: Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone’s hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT’s L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: https://github.com/showlab/ActionMap.


[94] Robotic Policy Adaptation via Weight-Space Meta-Learning cs.RO | cs.CV | cs.LGPDF

Christian Bianchi, Siamak Yousefi, Alessio Sampieri, Andrea Roberti, Luca Rigazio

TL;DR: 本文提出WIZARD,一种基于权重空间的元学习框架,用于机器人策略适应。它通过为冻结的视觉-语言-动作模型生成任务特定的LoRA参数,仅需语言指令和简短演示视频即可实现单次前向传播的适应,无需目标任务的动作标签或测试时优化。

Details

Motivation: 动机在于解决现有视觉-语言-动作模型适应新任务时,通常需要任务特定的演示、动作标注和额外微调,导致部署成本高且难以扩展的问题。

Result: 在LIBERO基准测试中,WIZARD在未见过的数据集集合上性能提升高达约2倍,在未见过的任务上提升高达约14倍。在Franka Emika Panda机器人上的实验表明,其性能持续优于真实领域适应的基线。

Insight: 创新点在于将元学习应用于权重空间,直接学习从任务证据到专家LoRA更新的映射,实现了无需微调的高效策略适应,为机器人部署提供了可扩展的解决方案。

Abstract: Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We propose WIZARD, a weight-space meta-learning framework that sidesteps task-specific fine-tuning by generating task-specific LoRA parameters for a frozen VLA policy. Given only a language instruction and a short demonstration video, WIZARD predicts the corresponding adaptation weights in a single forward pass, without target-task action labels or test-time optimization. During meta-training, WIZARD learns to map task evidence directly to expert LoRA updates, capturing relationships between tasks in weight space. Experiments on LIBERO show that WIZARD improves performance by up to ~2x on unseen dataset collections and up to ~14x on unseen tasks. On a Franka Emika Panda, WIZARD consistently improves over a real-domain adapted baseline, showing that generated adapters provide task-level specialization beyond simulation.


[95] Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation cs.RO | cs.AI | cs.CVPDF

Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang

TL;DR: 本文提出了一种名为轨迹航点的新范式,用于解决连续环境中的视觉语言导航问题。该方法将每个候选航点与一条可执行轨迹关联起来,通过TSDF引导的扩散策略生成避障轨迹,并利用轨迹增强的导航器确保高层语义决策与底层执行的一致性。

Details

Motivation: 现有VLN-CE方法通常采用航点预测、导航选择和底层控制的三阶段解耦框架,这容易导致航点不可达或规划与控制不一致的问题。

Result: 在VLN-CE基准测试上的大量实验表明,所提出的轨迹航点范式优于基线方法,取得了更优的性能。

Insight: 创新点在于将航点预测从孤立点扩展为可执行的轨迹,并通过扩散策略确保轨迹的避障性和航点的可达性,同时利用轨迹信息增强导航器的规划一致性。

Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.


[96] Planning-aligned Token Compression for Long-Context Autonomous Driving cs.RO | cs.AI | cs.CVPDF

Zhixuan Liang, Yuxiao Chen, Yurong You, Peter Karkus, Wenhao Ding

TL;DR: 本文提出COMPACT-VA,一种面向长上下文自动驾驶的规划对齐令牌压缩框架。该框架基于条件VQ-VAE,通过将历史轨迹与学习到的规划意图相结合,将长时上下文压缩为有界表示,以保留决策关键信息,并实现端到端优化。

Details

Motivation: 现有单体视觉-动作模型在处理长时上下文时会产生超出实时计算预算的令牌序列,而现有压缩方法(如基于时间衰减的启发式规则)与规划解耦,可能导致决策关键信息丢失。

Result: 在高信号动态场景(如停车、让行等)的行为正确性评估中,在可比令牌预算下,COMPACT-VA实现了>6%的成功率提升(达68.3%),并在闭环评估中保持了总体驾驶性能,同时获得了3.3倍加速和2.7倍内存减少。

Insight: 创新点在于将压缩过程与规划意图对齐,通过条件VQ-VAE的后验编码器从未来轨迹中蒸馏规划意图,先验编码器学习从压缩观测中预测该意图,从而确保压缩表示保留决策相关信息,无需修改主干网络架构。

Abstract: Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.


eess.IV [Back]

[97] Advanced Flood Prediction with Physics-Guided Deep Learning: Combining UNet, FNO, and SAR/Optical Imagery eess.IV | cs.CV | cs.LGPDF

Tewodros Syum Gebre, Jagrati Talreja, Leila Hashemi-Beni

TL;DR: 本文提出了一种结合物理约束的深度学习框架,用于洪水预测。该框架融合了多模态遥感数据(Sentinel-1 SAR、Sentinel-2光学影像和DEM地形特征),并利用浅水方程(SWE)作为物理约束。模型架构结合了UNet(捕捉精细空间细节)和傅里叶神经算子(FNO,建模流域尺度水力相互作用),通过物理信息残差损失确保质量和动量一致性。

Details

Motivation: 解决洪水测绘中因地面观测有限、地形条件异质以及数据驱动模型难以保证水动力一致性而面临的准确性和可扩展性挑战。

Result: 在多种洪泛区环境中评估,该混合模型在洪水范围预测上取得了0.82的IoU和0.90的F1分数,优于仅使用UNet或FNO的基线模型。以水动力模拟为参考数据,模型在水深和流速上的RMSE分别为0.21米和0.15米/秒,物理一致性良好,质量不平衡低于2.1%。

Insight: 主要创新点在于将水动力原理(浅水方程)作为物理约束嵌入深度学习框架,并通过结合UNet和FNO的混合架构来同时处理局部细节和全局相互作用。客观来看,这种物理引导的深度学习范式增强了模型的可靠性、稳定性和泛化能力,为业务化监测和大规模部署提供了潜力。

Abstract: Accurate and scalable flood mapping remains challenging due to limited ground observations, heterogeneous terrain conditions, and the difficulty of enforcing hydrodynamic consistency within data-driven models. This work introduces a physics-guided deep learning framework that integrates multi-modal remote sensing (Sentinel-1 SAR, Sentinel-2 optical imagery, and DEM-derived terrain features) with constraints from the depth-averaged shallow water equations (SWE). The proposed hybrid architecture combines a UNet to capture fine-scale spatial details with a Fourier Neural Operator (FNO) to model basin-scale hydraulic interactions, while physics-informed residual losses ensure mass and momentum consistency. Evaluated across diverse floodplain settings, the hybrid model achieves an Intersection over Union of 0.82 and an F1 score of 0.90 for flood extent prediction, outperforming UNet-only and FNO-only baselines. Using hydrodynamic simulations as reference data, the model achieves an RMSE of 0.21 m for water depth and 0.15 m/s for flow velocity. Physics consistency is maintained, with low residuals and mass imbalance below 2.1%. Ablation studies confirm that removing physicsbased regularization significantly degrades performance, underscoring the value of physical constraints for stability and generalization. These results demonstrate that embedding hydrodynamic principles into deep learning yields more accurate, reliable, and physically coherent flood predictions, offering strong potential for operational monitoring and large-scale deployment.


[98] Physics-Driven Semantic Scattering Structure Understanding of Aircraft Target in SAR Images eess.IV | cs.CVPDF

Yifei Yin, Xiaogang Yu, Hao Shi, Liang Chen, Wei Li

TL;DR: 本文提出了一种新的合成孔径雷达(SAR)飞机目标解译范式——语义散射结构理解(Semantic Scattering Structure Understanding),旨在解决现有方法基于无序、部件无关的局部散射中心表示所导致的结构重建不完整和不稳定的问题。为此,论文定义了语义散射关键点,将其与飞机物理部件关联,并引入可见性感知属性以保留弱散射响应部件,从而构建稳定的语义散射结构。基于此,作者提出了一个名为S3U-SAR的物理驱动框架,用于定位关键点并构建受多维物理先验约束的完整表示,并构建了首个细粒度基准数据集KP-SAR-Aircraft-1.0。

Details

Motivation: 现有SAR目标解译方法主要依赖局部散射中心表示,这种无序且与物理部件无关的表示对于飞机目标高度不稳定,常常遗漏散射响应弱的物理部件,导致重建的拓扑结构不完整。

Result: 大量实验表明,S3U-SAR框架在构建的KP-SAR-Aircraft-1.0基准上取得了最佳性能。跨类别和跨数据集的评估进一步验证了其鲁棒性和可迁移性。

Insight: 核心创新在于将语义理解引入SAR目标散射结构分析,通过定义语义散射关键点并引入可见性感知属性,将局部电磁响应与有物理意义的飞机部件关联,从而构建更完整、稳定的结构表示。此外,提出的物理驱动框架(S3U-SAR)通过整合散射异质性、刚体拓扑和散斑不确定性等多维物理先验,以及置信度门控联合监督策略,有效缓解了优化冲突,提升了模型的泛化能力。

Abstract: Synthetic aperture radar (SAR) has become indispensable for target interpretation owing to its all-day and all-weather observation capability. In SAR target interpretation, electromagnetic scattering information provides a physically grounded cue beyond visual texture and has been widely exploited for target interpretation. However, existing methods remain dominated by local scattering center representations. Such unordered and component-agnostic representations are highly unstable for aircraft targets. As a result, physically existing components with weak scattering responses are often missed, resulting in the incomplete reconstructed topology structure. To address this limitation, we establish Semantic Scattering Structure Understanding as a new paradigm for SAR aircraft interpretation. Semantic scattering keypoints are defined to associate local electromagnetic responses with physically meaningful aircraft components, while visibility-aware attributes are introduced to retain weakly observable yet physically existed components. The keypoints are further organized into a stable semantic scattering structure. Build upon this, we propose S3U-SAR, a physics-driven framework to localize semantic scattering keypoints and construct the complete representation constrained by multi-dimensional physical priors containing scattering heterogeneity, rigid-body topology, speckle uncertainty. A confidence-gated joint supervision strategy is further introduced to alleviate optimization conflicts. We construct KP-SAR-Aircraft-1.0, the first fine-grained benchmark for semantic scattering structure understanding. Extensive experiments demonstrate that S3U-SAR achieves the best performance compared with baselines. Cross-category and cross-dataset evaluations further verify its robustness and transferability.


[99] DaX: Learning General Pathology Representations Across Scales eess.IV | cs.AI | cs.CVPDF

Bokai Zhao, Yiyang Zhang, Long Bai, Tai Ma, Hanqing Chao

TL;DR: DaX是一个针对计算病理学的视觉基础模型,它通过改进DINOv3风格的自监督学习方法,结合连续放大倍率训练、跨尺度组织视图、方向无关和采集鲁棒的增强、多输入尺寸训练以及Gram锚定的密集一致性等技术,旨在学习跨尺度的通用病理学表示。该模型在包含161个临床任务的大规模WSI级基准测试中表现出色,在多个临床领域和任务类别上均取得了最高的平均性能。

Details

Motivation: 计算病理学需要能够跨不同临床终点迁移、并对放大倍率、染色、扫描仪类型、切片制备和输入分辨率等变化具有鲁棒性的视觉表示。现有方法在连接局部细胞形态与全局组织结构、以及跨输入尺度稳定密集令牌级表示方面存在不足。

Result: 在包含44个公共数据集、28,182名患者、34,394张切片的161个临床任务基准测试中,DaX在固定患者级交叉验证协议下,取得了跨任务最高的平均性能,并在诊断病理学、生物标志物和分子谱分析、组织/标本背景以及风险、反应和预后等任务类别上均表现出持续强劲的任务级排名得分。

Insight: 创新点包括:将自然图像DINOv3权重初始化并适配到全切片组织病理学;引入连续放大倍率训练和跨尺度组织视图以连接局部与全局信息;采用方向无关和采集鲁棒的增强以及多输入尺寸训练以提高鲁棒性;提出Gram锚定的密集一致性以稳定跨尺度的密集令牌表示。此外,论文构建了一个标准化的、可重复的WSI级基准测试框架,用于未来病理学基础模型的评估。

Abstract: Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: https://alibaba-damo-academy.github.io/DaX/benchboard/.


[100] Beyond Universality: The GCC-FER Dataset and Culture-Aware Adaptation for Dynamic Facial Expression Recognition eess.IV | cs.CVPDF

Sonalika Singh, Jyotirindra Dandapat, Avishi Razdan, Kshipra V. Moghe, Puneet Gupta

TL;DR: 这篇论文针对动态面部表情识别(DFER)中存在的文化偏见问题,提出了一个名为GCC-FER的新型混合多文化视频数据集,该数据集涵盖了四个文化群体和七种基本表情。基于此数据集,论文还提出了一种文化感知的FER(CA-FER)系统,该系统通过自适应地重新校准潜在面部表征来减轻文化偏见。实验表明,该系统在GCC-FER和DFEW基准上均能持续提升多文化环境下的FER性能。

Details

Motivation: 现有FER系统通常假设情感表达在不同人群中具有普遍一致性,但文化差异会显著影响FER性能,而当前缺乏文化多样性的基准数据集是推进跨文化FER的主要挑战。

Result: 在提出的GCC-FER数据集和DFEW基准上进行的广泛实验表明,所提出的CA-FER系统在多文化环境下持续提高了FER性能。

Insight: 主要创新点在于构建了首个大规模全球跨文化DFER数据集(GCC-FER),并设计了一个文化感知的FER系统(CA-FER),该系统通过引入文化先验和自适应表征校准来缓解文化偏见,为开发更具包容性和鲁棒性的情感计算系统提供了新思路。

Abstract: Dynamic Facial Expression Recognition (DFER) is a key enabling technology in affective computing, human-computer interaction, and intelligent multimedia systems. Despite the significant influence of cultural nuances on FER performance, most existing FER systems assume that emotional expressions are universally consistent across populations. This variation can be attributed to systematic differences in facial muscle activation patterns across cultures. A major challenge in advancing cross-cultural FER lies in the scarcity of culturally diverse benchmark datasets. To address this, a new hybrid multicultural video dataset termed Global Cross-Cultural Facial Expression Recognition (GCC-FER) is introduced. GCC-FER comprises 23,934 video samples spanning four cultural groups (African, Caucasian, East Asian, and South Asian) across seven basic expressions, combining psychologically supervised in-house data collection for underrepresented populations with rigorous ethnicity filtering of existing sources. To the best of our knowledge, GCC-FER is the first large-scale global cross-cultural DFER dataset designed to address these demographic gaps. Leveraging this dataset, behaviorally grounded cultural priors are derived for each cultural group and a global prior for practical deployment. A Culture-Aware FER (CA-FER) system is proposed to mitigate cultural bias by adaptively recalibrating latent facial representations. Extensive experiments on GCC-FER and DFEW demonstrate that the proposed system consistently improves FER performance across multicultural settings.


stat.AP [Back]

[101] An Integrated Roadside Sensing and Communication Framework for Vulnerable Road User Safety at Signalized Intersections stat.AP | cs.CVPDF

Parvez Anowar

TL;DR: 本文提出了一种用于信号交叉口弱势道路使用者(VRU)保护的集成路边感知与通信框架,该框架结合了LiDAR、雷达、RGB相机和热成像相机进行多模态感知,在边缘侧进行预测和替代安全分析,利用V2X和P2X通信,并采用自适应信号控制。研究基于首个公开的路边LiDAR-视觉-热成像数据集R-LiViT进行实证案例分析,揭示了VRU的分布特征,并论证了多模态感知与边缘分析的必要性。

Details

Motivation: 弱势道路使用者(如行人)占城市交通死亡人数近一半,且事故多集中于交叉口。现有保护系统多为单/双传感器部署,缺乏将多模态感知、边缘侧近距碰撞分析、双向V2X/P2X通信以及自适应信号控制集成于单一交叉口机柜的解决方案。

Result: 基于R-LiViT数据集(包含200个多模态序列和2400个标注的RGB-T帧)的分析显示:VRU约占所有道路使用者观测的49%;从白天到夜晚,行人和车辆的密度分别下降38%和45%,且夜间近距离事件比例更高;不同位置的每帧近距离事件数量差异可达10倍;83%的行人边界框在图像空间中较小,表明VRU通常距离单个传感器较远。这些结果支持了多模态感知和边缘侧分析的有效性。

Insight: 创新点在于提出了一个集感知、计算、通信与执行于一体的完整交叉口VRU保护框架,并首次利用公开的多模态路边数据集(R-LiViT)进行实证研究。客观来看,其将多传感器融合、边缘计算分析与V2X/P2X通信及自适应控制相结合的系统性设计,为智能交通系统中的VRU安全提供了可借鉴的集成解决方案,强调了根据上下文(如昼夜、位置)进行自适应部署的重要性,而非统一的单传感器方案。

Abstract: Vulnerable road users (VRUs) account for approximately half of urban traffic deaths globally, with intersections concentrating a disproportionate share of these casualties. Recent reviews of sensing technology for VRU protection have cataloged dozens of single-sensor and dual-sensor deployments, yet none of the surveyed systems couples multi-modal sensing with edge-side near-miss analytics and bidirectional vehicle-to-everything (V2X) and pedestrian-to-everything (P2X) messaging in a single intersection cabinet. This paper presents an integrated framework for VRU protection at signalized intersections, combining LiDAR, radar, RGB camera, and thermal camera at the perception layer, edge-based prediction and surrogate-safety analytics at the computation layer, V2X and P2X messaging at the communication layer, and adaptive signal control at the actuation layer. The framework is grounded in an empirical case study using R-LiViT, the first publicly released roadside LiDAR-Visual-Thermal dataset, which provides 200 multi-modal sequences and 2,400 annotated RGB-T frames at three German intersections. Analysis of 53,319 detection annotations reveals that VRUs comprise approximately 49% of all road-user observations, that day-to-night density drops by 38% for pedestrians and 45% for vehicles while the night distribution shows a higher close-proximity share, that per-frame close-proximity event counts vary approximately 10-fold across the eight unique locations at three intersections, and that 83% of pedestrian bounding boxes are small in image space, indicating that VRUs are typically far from any single sensor. These findings support multi-modal sensing, edge-side analytics, and adaptive context-sensitive deployment rather than uniform single-sensor solutions.


cs.SD [Back]

[102] Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations cs.SD | cs.AI | cs.CLPDF

Naman Kothari, Arjun Gangwar, Adarsh Arigala, S Umesh

TL;DR: 本文系统分析了基于BigVGAN的多语言多说话人单元声码器,研究了离散语音单元中语音、说话人和语言信息的纠缠问题,重点探讨了聚类大小和条件策略对语音可懂度、说话人相似性和单元级指标的影响。

Details

Motivation: 解决离散语音单元在自监督嵌入的k-means聚类中混合了语音、说话人和语言信息,导致多语言多说话人语音生成中的说话人混合和跨语言干扰问题。

Result: 在四种印度语言上的实验表明,聚类大小通过提高语音可区分性来控制可懂度,显式说话人条件对于防止身份崩溃至关重要,语言监督主要在聚类规模较小时带来额外收益。

Insight: 创新点在于揭示了跨语言相似音素在较小聚类库存中会合并到相同簇ID,而较大聚类能逐步分离它们,为设计更有效的多语言语音生成系统提供了关键见解。

Abstract: Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results show that cluster size governs intelligibility by improving phonetic discriminability, while explicit speaker conditioning is indispensable for preventing identity collapse. Language supervision yields further gains mainly at lower cluster sizes where units remain ambiguous. Our analysis shows similar phonemes across languages collapse to the same cluster IDs at smaller inventories, with larger clusters progressively separating them.


[103] HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec cs.SD | cs.AI | cs.CLPDF

Arjun Gangwar, S Umesh

TL;DR: HybridCodec是一种结合了语义与声学解耦范式的快速双流神经音频编解码器,它通过分离的语义和声学分支,并将自监督学习表示蒸馏到语义流中,实现了无需推理时SSL模型的强解耦,在语义专业化和重建质量上表现优异,且速度比现有双流模型快3倍。

Details

Motivation: 随着多模态大语言模型的兴起,神经音频编解码器作为语音分词器日益流行,现有方法在引入语义信息时存在两种主要范式:一种将SSL表示蒸馏到第一层残差向量量化层,另一种则维护独立的语义和声学流,本文旨在统一这两种范式以提升解耦效率和性能。

Result: 在领域内测试集上,HybridCodec在语义专业化(RVQ-1)方面表现优越,重建质量(RVQ-all)具有竞争力;在领域外和零样本跨语言设置中展现出鲁棒性,且速度比现有双流模型快3倍。

Insight: 创新点在于将分离分支架构与SSL蒸馏相结合,实现了无需推理时SSL模型的强语义-声学解耦,这为高效、鲁棒的音频编码提供了统一框架,可加速多模态语言模型中的语音处理。

Abstract: The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main approaches to introduce semantic information into codec models: one distills semantic information from SSL representations into the first RVQ layer, while the other maintains separate streams for semantic and acoustic features. We propose HybridCodec, a unified architecture that combines both paradigms. It employs separate semantic and acoustic branches while distilling SSL representations into the semantic stream. This design ensures strong disentanglement without requiring an SSL model during inference. HybridCodec shows superior semantic specialization (RVQ-1) on in-domain test set and competitive reconstruction (RVQ-all). We demonstrate its robustness in out-of-domain and zero-shot cross-lingual settings, achieving a 3x speedup over existing dual-stream models.


[104] MMAE: A Massive Multitask Audio Editing Benchmark cs.SD | cs.CL | cs.MMPDF

Ziyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu

TL;DR: 本文提出了MMAE(大规模多任务音频编辑基准),这是首个为基于指令的通用音频编辑设计的全面评估测试平台。MMAE涵盖7种音频模态、6级任务复杂度、2级粒度和8种操作类型,包含2000个高质量样本和基于量规的评估框架。对领先模型的评估显示,当前系统在复杂任务上的精确匹配率极低,远未达到可靠编辑水平。

Details

Motivation: 随着智能创作趋势的发展,交互式编辑从视觉领域扩展到音频领域,但现有评估基础设施严重滞后,高度碎片化且局限于特定子领域或基本操作。

Result: 在MMAE基准上对领先模型进行评估,发现精确匹配率(EMR)始终低于5%,在复杂的混合模态任务中甚至降至0%,暴露了当前系统在执行精确性和结构鲁棒性方面的关键瓶颈。

Insight: 创新点在于构建了首个大规模、多任务、覆盖广泛真实场景的音频编辑基准,并引入了基于量规的评估框架,将自由形式任务分解为可验证的准则,实现了对指令遵循和上下文一致性的精确多维评估。

Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.


cs.MA [Back]

[105] MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring cs.MA | cs.CLPDF

Ali Keramati, Shiyuan Zhou, Sharad Mehrotra, Mark Warschauer

TL;DR: MADRAG是一个无需训练的作文分析评分框架,它结合了多智能体辩论与检索增强生成技术。该框架通过分解评估过程为交互式角色(倡导者识别优点、质疑者批评缺点、法官综合论点给出最终分数),并利用评分标准对齐的范例检索进行校准,从而提升评分的可靠性和稳定性。

Details

Motivation: 解决标准LLM-as-judge方法在作文评分中存在的偏见和不稳定问题,通过结构化交互和外部记忆增强来提高基于LLM的评估的可靠性。

Result: MADRAG显著优于基于提示的基线方法,并在无需任务特定训练的情况下接近监督系统的性能;消融研究表明检索驱动了校准增益,而辩论则改善了高级特征的推理。

Insight: 创新点在于将多智能体辩论与检索增强生成相结合,通过角色分工和范例检索实现校准,这突出了结构化交互和外部记忆在可靠LLM评估中的互补作用。

Abstract: We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score. Crucially, the Judge is augmented with rubric-aligned exemplar retrieval, enabling calibration through comparison with scored examples. Our results show that MADRAG significantly outperforms prompt-based baselines while approaching the performance of supervised systems without requiring task-specific training. Ablation studies demonstrate that retrieval drives calibration gains, while debate improves reasoning on higher-level traits. Our findings highlight the complementary roles of structured interaction and external memory in reliable LLM-based evaluation.


cs.AI [Back]

[106] OpenSkill: Open-World Self-Evolution for LLM Agents cs.AI | cs.CL | cs.LGPDF

Zhiling Yan, Dingjie Song, Hanrong Zhang, Wei Liang, Yuxuan Zhang

TL;DR: 本文提出了OpenSkill框架,用于实现LLM智能体在开放世界中的自我进化。该框架无需目标任务的监督,仅利用文档、代码库和网络等开放资源,通过获取基础知识和验证锚点、合成可迁移技能,并在基于锚点构建的虚拟任务中精炼技能,从而让智能体从零开始构建技能和验证信号。

Details

Motivation: 现有自进化方法依赖于精心设计的技能、成功轨迹或验证器信号等学习循环,而现实开放世界部署可能只提供一个任务提示,缺乏这些可用资源。本文旨在研究智能体如何在无目标任务监督的情况下,仅利用开放世界资源,从零开始构建其技能和自身验证信号。

Result: 在三个基准测试和两个目标智能体上,OpenSkill在满足无监督约束的条件下,实现了最佳的自动化通过率。分析表明,其技能无需模型特定适配即可跨模型迁移,且其自建的验证器与真实结果对齐,尽管从未访问过这些真实结果。

Insight: 创新点在于提出了一种完全从开放世界资源中自举学习循环的框架,将开放世界同时作为知识来源和独立于监督的实践环境。其核心思想是利用文档等资源构建验证锚点和虚拟任务,从而在不依赖目标任务监督的情况下实现技能的获取与精炼,这为无监督的智能体自进化提供了新思路。

Abstract: Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work, we study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Analysis shows its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.


[107] Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization cs.AI | cs.CVPDF

Zhe Yang, Ruyi Zhang, Hongtao Chen, Wenrui Li, Hengyu Man

TL;DR: 本文提出了一种用于开放词汇音频-视觉事件定位(OV-AVEL)的分层语义约束异构图(HSCHG)框架。该框架通过在欧几里得空间构建包含音频和视觉片段节点及其对应视频级节点的异构图,并采用双阈值门控融合策略和跨层语义约束,最后将多级表征映射到双曲空间以建模层次关系,从而解决了跨模态一致性和跨层级语义一致性的挑战。

Details

Motivation: 现有OV-AVEL方法在欧几里得空间中学习联合表征,面临两大挑战:一是缺乏对未见类别的监督信号,难以在多个时间尺度上保持音频-视觉一致性;二是缺乏片段级和视频级语义之间的层次约束,阻碍了模型建立跨层级的语义一致性。

Result: 广泛的实验结果表明,该方法在OV-AVEL基准测试上优于现有方法。消融研究进一步验证了所提方法的有效性。

Insight: 创新点包括:1)构建异构层次图并引入多向时间边以捕捉完整时序信息;2)采用双阈值门控融合策略,仅在置信度高时引入跨模态信息;3)在片段级和视频级表征间引入双向语义约束;4)将多级表征与文本原型统一映射到双曲空间,并使用分层蕴含正则化损失来刻画层次关系。这为处理开放词汇下的跨模态时序定位提供了新的图建模和几何表征学习思路。

Abstract: Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.