Table of Contents
- cs.CL [Total: 38]
- cs.CV [Total: 99]
- cs.RO [Total: 3]
- cs.AI [Total: 4]
- stat.ML [Total: 2]
- physics.optics [Total: 1]
- eess.IV [Total: 1]
- cs.MM [Total: 1]
- eess.SP [Total: 1]
- cs.SE [Total: 1]
- cs.IR [Total: 1]
- cs.SD [Total: 1]
- cs.LG [Total: 10]
- cs.GR [Total: 1]
- cond-mat.mtrl-sci [Total: 1]
cs.CL [Back]
[1] Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas cs.CL | cs.AI | cs.LGPDF
Jon-Paul Cacioli
TL;DR: 该研究对33个前沿大语言模型在MMLU基准的六个知识领域进行了元认知监控分析,发现所有模型在领域层面均存在显著变异,其中应用/专业知识领域最容易监控,形式推理和自然科学领域最难监控,且模型家族内部存在显著的监控模式聚类。
Details
Motivation: 解决聚合元认知质量评分掩盖模型在不同MMLU基准领域内变异的问题,探究大语言模型在细粒度知识领域上的元认知监控能力差异。
Result: 在MMLU基准的六个领域上,应用/专业知识领域监控性能最佳(平均AUROC=0.742),形式推理和自然科学领域最差;模型家族内部(如Anthropic、Google-Gemini、Qwen)监控模式存在显著聚类;Gemma 4 31B相比Gemma 3 27B在AUROC上提升了0.202。
Insight: 揭示了聚合指标会掩盖模型在特定知识领域上的稳定性能变异,支持在模型部署前进行基准阶段领域筛查;同时验证了口头置信度评分与二元探针格式在评估元认知时的特异性差异。
Abstract: Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall’s W = .164). A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within-family profile-shape clustering is significant for Anthropic, Google-Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google-Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B. Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe-format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split-half aggregate stability r = .893; profile-level split-half is weaker (grand median r = .184). These results show stable benchmark-domain variation obscured by aggregate metrics, and support benchmark-stage domain screening as a step before deployment in specific application areas.
[2] MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes cs.CL | cs.AI | cs.HC | cs.MM | cs.SD | eess.ASPDF
Maximillian Chen, Xuanming Zhang, Michael Peng, Zhou Yu, Alexandros Papangelis
TL;DR: 本文介绍了MIST数据集,这是一个用于智能家居场景的多模态交互式语音工具调用数据集,旨在解决基于语音的物联网设备控制中复杂的时空约束、动态状态跟踪和混合主动交互等挑战。
Details
Motivation: 随着物联网设备的普及,需要能够处理复杂用户体验的语音界面,而现有大语言模型在建模真实世界物联网设备时面临时空约束、语音输入、动态状态跟踪和混合主动交互模式结合的未充分研究的难题。
Result: 研究发现,在MIST数据集上,开源和闭源多模态大语言模型之间存在显著性能差距,即使前沿的闭源模型仍有很大提升空间。
Insight: 论文的创新点在于提出了一个合成多轮语音驱动代码生成任务的数据集MIST,并提供了一个可扩展的数据生成框架,以促进对考虑物理世界约束的混合主动语音助手的研究。
Abstract: The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.
[3] MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text cs.CL | cs.AIPDF
Chenjun Li, Cheng Wan, Johannes C. Paetzold
TL;DR: 本文提出MELD(多任务均衡学习检测器),一种用于AI生成文本检测的可部署模型。MELD通过引入生成器家族、攻击类型和源域等辅助监督任务来增强二元检测,并采用EMA教师-学生蒸馏和困难负样本排序损失来提高鲁棒性。在公开基准测试中,MELD表现优于开源检测器,并与商业模型竞争,尤其在低误报率和受攻击场景下表现出色。
Details
Motivation: 现有AI生成文本检测器通常仅优化二元分类任务,一旦任务饱和,模型难以学习生成器、攻击或领域结构,导致在对抗性改写、未见生成器和领域迁移以及低误报率要求下鲁棒性不足。
Result: 在公开RAID排行榜上,MELD是最强的开源检测器,与领先商业模型竞争,尤其在受攻击和低误报率(FPR)下表现优异。在标准保留基准测试中,MELD匹配或超越有监督基线。在MELD-eval(基于四大LLM提供商最新聊天模型构建的评估集)上,无需额外微调,MELD在1% FPR下达到99.9%真阳性率(TPR),而许多基线模型性能显著下降。
Insight: 创新点包括:1)通过多任务学习(生成器家族、攻击类型、源域分类)丰富二元检测的表示学习;2)使用学习到的同方差不确定性权重平衡多任务损失;3)采用EMA教师-学生蒸馏提高对抗鲁棒性;4)引入困难负样本成对排序损失以扩大AI文本与易混淆人类文本的得分差距。这些方法提升了检测器在对抗攻击、领域迁移和低误报率下的鲁棒性,同时保持推理时与标准检测器相同的接口和成本。
Abstract: Large language models are now embedded in everyday writing workflows, making reliable AI-generated text detection important for academic integrity, content moderation, and provenance tracking. In practice, however, a detector must do more than achieve high aggregate AUROC on clean, in-distribution human and AI text: it should remain robust to attacks and adversarial rewrites, transfer to unseen generators and domains, and operate at low false-positive rates (FPR). Most existing detectors optimize a single AI/Human objective, giving the representation little incentive to learn generator, attack, or domain structure once the binary task saturates. We introduce MELD (Multi-Task Equilibrated Learning Detector), a deployable detector for AI-generated text that enriches binary detection with auxiliary supervision. MELD attaches generator-family, attack-type, and source-domain heads to a shared encoder, and balances the four losses with learned homoscedastic uncertainty weights. To improve robustness, an EMA teacher predicts on clean inputs while an attack-augmented student is distilled toward the teacher. MELD further uses a hard-negative pairwise ranking loss to enlarge the score margin between AI-generated texts and the most confusable human texts. At inference, all auxiliary heads are discarded, giving MELD the same interface and cost as a standard detector. On the public RAID leaderboard, MELD is the strongest open-source detector and is competitive with leading commercial models, especially under attack and at low FPR. Across standard held-out benchmarks, MELD matches or outperforms supervised baselines. We further introduce MELD-eval, a held-out evaluation pool built from recent chat models released by four major LLM providers. Without additional finetuning, MELD achieves 99.9% TPR at 1% FPR on MELD-eval, while many baselines degrade sharply.
[4] NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models cs.CLPDF
George Boateng, Naafi Ibrahim, Samuel John, Philemon Badu, Patrick Agyeman-Budu
TL;DR: 该论文提出了NSMQ Riddles,一个基于加纳国家科学与数学竞赛(NSMQ)的、包含科学和数学谜语的新颖基准测试,用于评估大语言模型(LLMs)。该基准包含约1800个多线索谜题,并评估了包括GPT-4、Gemini、Claude等在内的多个先进LLM,发现即使是SOTA模型的表现也逊于最佳学生选手。
Details
Motivation: 当前LLMs在科学和数学教育领域的评估主要依赖西方数据集,缺乏来自全球南方(Global South)的代表性数据,且现有基准多为易于评估的单项选择题。本文旨在填补这一空白,提供一个更具挑战性的、基于真实竞赛的开放式问答基准。
Result: 在NSMQ Riddles基准上,评估了GPT-4、Gemini 3.1 Pro、Claude Opus 4.6等闭源模型以及Kimi-K2.5、DeepSeek-V3.1等开源模型。结果表明,即使是最先进的LLMs在该数据集上也表现不佳,其性能低于竞赛中的最佳学生选手。
Insight: 论文的创新点在于引入了一个来自全球南方(加纳)的、基于真实教育竞赛的、具有多线索渐进提示的开放式问答基准,这有助于更全面、公平地评估LLMs在科学和数学推理上的全球能力,并揭示了当前模型在复杂、开放式推理任务上的局限性。
Abstract: Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evaluated on science and mathematical educational datasets from the Western world, with an underrepresentation of datasets from the Global South. Furthermore, they tend to have multiple-choice answer options that are trivial to evaluate. In this work, we present NSMQ Riddles, a novel benchmark of Scientific and Mathematical Riddles from Ghana’s National Science and Maths Quiz (NSMQ) competition to evaluate LLMs. The NSMQ is an annual live TV competition for senior secondary school students in Ghana that brings together the smartest high school students in Ghana who compete in teams of 2 by answering questions in biology, chemistry, physics, and math over five rounds and five stages until a winning team is crowned for that year. NSMQ Riddles consists of 11 years of riddle questions (n=1.8K) from the 5th round, with each riddle containing a minimum of 3 clues. Students compete to be the first to guess the answer on any of the clues, with earlier clues being vague and also fetching more points. The answers are usually a number, word, or short phrase, allowing for automatic evaluation. We evaluated state-of-the-art models: closed (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) and open models (Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B) with high and low reasoning settings. Our evaluation shows that the dataset is challenging even for state-of-the-art LLMs, which performed worse than the best student contestants. This work contributes a novel and challenging benchmark for scientific and mathematical reasoning from the Global South towards enabling a true global benchmarking of LLMs’ capabilities for science and mathematics education.
[5] GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations cs.CL | cs.AIPDF
Jyotika Singh, Fang Tu, Aziza Mirzadova, Amit Agarwal, Hitesh Laxmichand Patel
TL;DR: 本文提出了GSM-SEM,一个可重复使用且随机生成语义多样化基准变体的框架,旨在解决现有数学推理基准(如GSM8K)因静态测试集导致的模型记忆问题。该框架通过修改问题陈述中的实体、属性和/或关系来扰动语义,同时保持原始计算/答案和近似问题难度,从而生成具有更高语义方差的新变体。
Details
Motivation: 现有基准(如GSM8K)的排行榜提升可能因模型对固定测试集的记忆而夸大真实能力,且大多数鲁棒性变体仅应用表面级扰动(如释义、重命名、数字交换、干扰项),这些扰动基本保留了底层事实,静态发布的基准本身也可能成为记忆目标。
Result: 将GSM-SEM应用于GSM8K和两个现有变体套件(GSM-Symbolic和GSM-Plus),生成了GSM8K-SEM、GSM-Symbolic-SEM和GSM-Plus-SEM。评估14个SOTA大语言模型,观察到性能一致下降,当语义扰动与符号/plus变体结合时下降更大(在GSM-SEM最严格配置下平均下降率为28%)。
Insight: 创新点在于提出了一个可重复使用的随机生成框架,通过修改实体、属性和关系来引入实质性语义变化,要求模型在新条件下重新计算解决方案,从而更有效地评估模型的真实推理能力而非记忆能力,并展示了该框架可扩展到GSM风格数学问题之外的基准(如BigBenchHard、LogicBench、NLR-BIRD)。
Abstract: Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.
[6] Self-Consolidating Language Models: Continual Knowledge Incorporation from Context cs.CL | cs.LGPDF
Zekun Wang, Anant Gupta, Zihan Dong, Christopher J. MacLellan
TL;DR: 本文提出了一种名为自巩固语言模型(SCoL)的后训练框架,旨在解决大语言模型在连续信息流(如长上下文工作流)中有效整合和保留知识的问题。该框架通过元强化学习,使模型学会根据当前上下文生成文本更新指令,指定应更新自身的哪些Transformer层,从而实现知识的持续整合并限制对已巩固信息的干扰。
Details
Motivation: 大语言模型越来越多地处理信息流(如对话、长上下文),但更长的上下文窗口并不保证有用信息被有效保存和重用。因此,研究如何将当前上下文信息持续整合到模型权重中,同时限制对先前已巩固知识的干扰,成为一个关键问题。
Result: 在SQuAD知识整合任务和LongBench v2长上下文巩固任务上,SCoL在获取和保留知识方面优于提示、摘要、批量测试时训练和顺序微调等基线方法。分析表明,SCoL学习到的更新位置稀疏且与高Fisher信息层对齐,表明模型学会了将可塑性导向损失敏感区域以减少干扰。此外,SCoL能够从较短的元训练流迁移到更长的LongBench v2评估流,显示出可扩展的流式整合能力。
Insight: 创新点在于提出了一种基于元强化学习的自巩固框架,使模型能够自主决定更新哪些内部层,从而实现知识的持续、稀疏整合,这有助于减少灾难性遗忘并提高长上下文信息处理的效率。从客观角度看,该方法通过将更新决策与模型内部敏感度(如Fisher信息)对齐,为动态适应流式数据提供了一种可解释且可扩展的机制。
Abstract: Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.
[7] Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning cs.CLPDF
Jin Cui, Xinyue Long, Xunyong Zhang, Yadong Zhang, Chuanchang Su
TL;DR: 本文提出RIS框架,通过检索、整合与合成机制,将潜在视觉推理构建为预训练多模态大语言模型(MLLM)的兼容扩展,以解决现有潜在推理方法中存在的流形兼容性不足问题,提升细粒度视觉语言推理的忠实性与效果。
Details
Motivation: 现有MLLM大多将视觉证据压缩为离散文本思维,导致细粒度感知的信息瓶颈;而近期潜在视觉推理方法在连续隐藏状态中推理时,存在流形兼容性不足、潜在轨迹漂移、实例无关模式崩溃及在答案生成中被绕过等问题。
Result: 在V*、HRBench4K、HRBench8K、MMVP和BLINK等基准测试上,RIS相比闭源/开源基线及潜在推理基线均取得一致提升。
Insight: 创新点包括:构建带边界框和区域特定语义描述的逐步接地推理数据集;通过空间与语义证据锚定潜在标记,利用渐进注意力瓶颈强制其因果作用;引入简短语言过渡标记将合成的潜在状态桥接回词汇对齐解码。该方法学习了多样化、可解释且逐步整合的潜在轨迹,为MLLM中忠实的内部视觉推理提供了实用路径。
Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.
[8] Structural Rationale Distillation via Reasoning Space Compression cs.CL | cs.AI | cs.LGPDF
Jialin Yang, Jiankun Wang, Jiajun Wu, Henry Leung, Jiayu Zhou
TL;DR: 本文提出了一种名为D-RPC的蒸馏方法,旨在解决大型语言模型向小型模型蒸馏推理能力时,教师模型对相似问题生成的推理依据在结构和策略上不一致的问题。该方法通过维护一个紧凑、可复用的高层推理路径库,约束教师模型遵循检索到的最相关路径来生成一致且多样化的推理依据,从而提供更清晰的监督信号。
Details
Motivation: 动机在于解决现有推理蒸馏方法中,教师模型为相似问题生成的推理依据(rationales)在结构和策略上存在高度不一致性,这种噪声监督增加了学生模型的学习负担,难以内化。
Result: 在五个数学和常识推理基准测试(如GSM8K、MATH等)上,使用两种学生模型进行评估,D-RPC方法在性能上持续优于思维链蒸馏、自由形式推理生成、直接蒸馏和结构化监督基线方法,同时比依赖繁重模板的替代方案使用了更少的token。
Insight: 创新点在于提出了通过压缩推理路径来构建动态维护的推理库,以约束教师模型的输出一致性,这为知识蒸馏提供了一种新的结构化监督范式。从客观角度看,其理论分析(PAC-Bayes)将库大小与覆盖范围之间的权衡形式化,并找到了一个经验证的最优中间规模,为平衡监督熵和覆盖缺口提供了理论指导。
Abstract: When distilling reasoning from large language models (LLMs) into smaller ones, teacher rationales for similar problems often vary wildly in structure and strategy. Like a chef who makes the same dish differently each time, this inconsistency burdens the student with noisy supervision that is hard to internalize. We propose Distillation through Reasoning Path Compression (D-RPC), which constrains the teacher to follow a compact, dynamically maintained bank of reusable high-level reasoning paths. For each training question, D-RPC retrieves the most relevant path and conditions the teacher to follow it, producing rationales that are consistent across similar problems yet diverse enough to cover different problem types. A PAC-Bayes analysis formalizes the resulting trade-off between bank size and coverage: smaller banks reduce supervision entropy but risk coverage gaps, and the generalization bound identifies an optimal intermediate size confirmed by our ablations. Across five math and commonsense reasoning benchmarks with two student models, D-RPC consistently outperforms chain-of-thought distillation, freeform rationale generation, direct distillation, and structured-supervision baselines, while using fewer tokens than template-heavy alternatives.
[9] Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs cs.CLPDF
Wanli Yang, Hongyu Zang, Junwei Zhang, Wenjie Shi, Du Su
TL;DR: 本文探讨了强化学习(RL)在大型语言模型(LLM)中直接回忆参数化知识(而非推理)方面的作用。通过在零样本、单跳、闭卷问答设置下进行实验,仅使用二元正确性奖励并应用事实级去重,研究发现RL能带来约27%的平均相对提升,主要机制是重新分配现有知识的概率质量,将正确答案从低概率尾部移至可靠生成中。
Details
Motivation: 动机是探究强化学习是否能够提升LLM对参数化知识的直接回忆能力,而不仅仅是改善推理任务,以扩展RL在LLM中的应用范围。
Result: 在三个模型家族和多个事实性QA基准测试中,RL实现了约27%的平均相对增益,超越了训练时和推理时的基线方法。数据归因研究表明,最难的示例(其答案在128个预RL样本中从未出现)驱动了约83%的增益。
Insight: 创新点在于将RL重新定位为解锁潜在参数化知识的工具,而非获取新事实,并揭示RL主要通过概率质量重分配来提升知识回忆,且最难示例对训练增益贡献最大。
Abstract: Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.
[10] Rethinking Experience Utilization in Self-Evolving Language Model Agents cs.CLPDF
Weixiang Zhao, Yingshuo Wang, Yichen Zhang, Yanyan Zhao, Yu Zhang
TL;DR: 本文提出了一种名为ExpWeaver的轻量级方法,用于重新思考自进化语言模型智能体中的经验利用方式。该方法将经验作为推理过程中的可选资源,仅在需要额外指导时才调用,而非在初始化或每一步都强制注入。实验表明,该方法在多种框架、模型主干和环境上均能实现最佳性能,并能通过强化学习进一步优化。
Details
Motivation: 现有自进化智能体研究主要关注经验的构建、表示和更新,而忽视了在运行时决策中应如何使用经验,导致智能体依赖僵化的使用策略(如初始化时注入或每一步都使用),未能根据当前决策需求灵活调用经验。
Result: 在四个代表性框架、七个LLM主干和三种类型环境上的实验表明,ExpWeaver在不同利用策略中始终取得最佳性能。强化学习实验进一步表明该行为可通过训练得到增强。使用模式、因果消融和基于熵的分析揭示了ExpWeaver能使智能体在有益的决策点、在更高推理不确定性下有选择地调用经验。
Insight: 核心创新点在于将经验利用视为一个关键设计维度,提出了按需、选择性调用经验的轻量级方法。这推动研究重点从‘存储什么经验’转向理解‘如何’及‘何时’在决策中使用经验,为构建更高效、灵活的自进化智能体提供了新思路。
Abstract: Self-evolving agents improve by accumulating and reusing experience from past interactions. Existing work has largely focused on how experience is constructed, represented, and updated, while paying less attention to how experience should be used during runtime decision-making. As a result, most agents rely on rigid usage strategies, either injecting experience once at initialization or at every step, without considering whether it is needed for the current decision. This paper studies experience utilization as a critical design dimension of self-evolving agents. We ask whether agents benefit from interweaving experience use with decision-making, so that experience is invoked only when additional guidance is needed. To examine this question, we introduce {ExpWeaver}, a lightweight instantiation that leaves experience construction unchanged and modifies only runtime utilization by exposing experience as an optional resource during reasoning. Across four representative frameworks, seven LLM backbones, and three types of environments, ExpWeaver consistently achieves the best performance among different utilization strategies. Reinforcement learning experiments further show that this behavior can be amplified through training. Usage-pattern, causal ablation, and entropy-based analyses reveal that ExpWeaver enables agents to invoke experience selectively, at beneficial decision points, and under higher reasoning uncertainty. Overall, our findings call for a shift from merely studying \emph{what} experience to store toward understanding \emph{how} and \emph{when} experience should enter decision-making.
[11] Learning Agent Routing From Early Experience cs.CLPDF
Yimin Wang, Jiahao Qiu, Xuan Qi, Xinzhe Juan, Jingzhe Shi
TL;DR: 本文提出了一种名为BoundaryRouter的无训练路由框架,用于在轻量级LLM推理和完整智能体执行之间高效路由查询。该方法利用早期行为经验和规则引导的推理来决定是直接使用LLM回答还是升级到智能体处理,从而在冷启动设置下降低延迟和计算成本。
Details
Motivation: LLM智能体在复杂推理任务上表现优异,但存在高延迟和高计算成本问题。许多查询实际上在尖端LLM的能力范围内,无需完整智能体执行,因此如何在LLM和智能体之间进行有效路由成为关键挑战。
Result: 在RouteBench基准测试(涵盖领域内、改写和领域外路由设置)上,BoundaryRouter相比智能体减少了60.6%的推理时间,同时比直接LLM推理性能提升28.6%,分别优于基于提示的路由和纯检索路由方法37.9%和8.2%。
Insight: 创新点在于提出了一种无需训练的路由框架,通过构建紧凑的经验记忆库(在共享种子集上执行两种系统)并在推理时检索相似案例来指导路由决策,实现了在冷启动场景下的高效查询分配,平衡了性能与效率。
Abstract: LLM agents achieve strong performance on complex reasoning tasks but incur high latency and compute cost. In practice, many queries fall within the capability boundary of cutting-edge LLMs and do not require full agent execution, making effective routing between LLMs and agents a key challenge. We study the problem of routing queries between lightweight LLM inference and full agent execution under realistic cold-start settings. To address this, we propose BoundaryRouter, a training-free routing framework that uses early behavioral experience and rubric-guided reasoning to decide whether to answer a query with direct LLM inference or escalate to an agent. BoundaryRouter builds a compact experience memory by executing both systems on a shared seed set and retrieves similar cases at inference time to guide routing decisions. To evaluate this method, we introduce RouteBench, a benchmark covering in-domain, paraphrased, and out-of-domain route settings. Experiments show that BoundaryRouter reduces inference time by 60.6% compared to the agent while improving performance by 28.6% over direct LLM inference, outperforming prompt-based and retrieval-only routing by an average of 37.9% and 8.2%, respectively.
[12] The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval cs.CL | cs.AIPDF
Zekai Tong, Ruiyao Xu, Aryan Shrivastava, Chenhao Tan, Ari Holtzman
TL;DR: 本文研究了大型语言模型在信息检索任务中对不完美文本的鲁棒性,发现当单词被空格字符分割成碎片时,模型的检测准确率随插入率增加呈现U型曲线,作者称之为’文本恐怖谷’现象。
Details
Motivation: 现有LLM基准测试主要关注语法正确的输入,缺乏对不完美文本的评估,本文旨在探究单词边界损坏如何影响LLM检测目标信息的能力。
Result: 实验表明,在信息检索任务中,LLM的F1分数随空格插入率增加呈现U型曲线;数学推理任务中,Gemini 3.0 Flash模型也出现类似U型曲线,但更强模型未出现,说明该效应在依赖精确词汇对齐的任务中更明显。
Insight: 提出了模式转换假说:LLM对接近正常的文本采用词级模式,对严重碎片化文本采用字符级模式,而谷底对应两种模式均无效的混乱过渡阶段;这一发现揭示了干净文本基准测试中不可见的故障模式,对涉及噪声或未处理文本的实际部署场景具有直接意义。
Abstract: Existing Large Language Model (LLM) benchmarks primarily focus on syntactically correct inputs, leaving a significant gap in evaluation on imperfect text. In this work, we study how word-boundary corruption affects how LLMs detect targeted information. By inserting whitespace characters within words to break them into fragments, LLMs’ detection accuracy follows a U-shaped curve with the increase in insertion rate. We refer to this curve as the Text Uncanny Valley. To explain such observation, we propose a mode transition hypothesis: LLMs operate in a word-level mode for near-normal text and a character-level mode for heavily fragmented text, with the valley marking the disordered transition where neither mode is effective. Four experiments and one analysis are consistent with this account: in-context learning fails to rescue valley-bottom performance; regularizing the perturbation substantially reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not for stronger models, suggesting the effect is attenuated when tasks rely less on exact lexical alignment; and tokenization entropy peaks before the F1 minimum, consistent with a regime-conflict interpretation. These findings reveal a failure mode invisible to clean-text benchmarks yet directly relevant to any deployment scenario involving noisy or uncurated text inputs.
[13] Teaching Language Models to Think in Code cs.CLPDF
Hyeon Hwang, Jiwoo Lee, Jaewoo Kang
TL;DR: 本文提出ThinC框架,让语言模型以代码作为推理主体而非工具,通过代码块间的执行输出进行推理,解决了传统工具集成推理中代码仅作为后验验证器、自然语言中间计算易错等问题。
Details
Motivation: 针对当前工具集成推理范式下代码仅作为后验验证、自然语言中间计算易错、自然语言与代码角色重叠的问题,提出让代码本身承担推理角色的新框架。
Result: 在五个竞赛级数学基准测试中,ThinC-4B模型持续超越所有TIR基线模型,甚至超过了规模更大的Qwen3-235B-A22B-Thinking模型。
Insight: 创新性地将代码从工具提升为推理主体,通过代码块间的纯执行输出连接实现可靠推理;99.2%的最终答案基于解释器输出,且模型能从代码执行失败中可靠恢复,无需自然语言中间推理。
Abstract: Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.
[14] PaT: Planning-after-Trial for Efficient Test-Time Code Generation cs.CL | cs.LGPDF
Youngsik Yoon, Sungjae Lee, Seockbean Song, Siwei Wang, Wei Chen
TL;DR: 本文提出了一种名为Planning-after-Trial(PaT)的自适应策略,用于提升大型语言模型在代码生成任务中的测试时计算效率。该方法仅在验证失败时才调用规划器,避免了传统Planning-before-Trial(PbT)策略在简单问题上不必要的规划开销,并支持使用成本效益高的模型进行生成尝试,而将强大模型保留用于有针对性的规划干预。
Details
Motivation: 现有方法大多采用僵化的Planning-before-Trial(PbT)策略,即使在可直接解决的问题上也会产生规划开销,导致测试时计算资源分配效率低下。本文旨在解决这一问题,通过自适应策略更高效地分配计算资源。
Result: 在多个基准测试和模型系列上的实验表明,该方法显著提升了成本-性能的帕累托前沿。具体而言,其异构配置实现了与大型同构模型相当的性能,同时将推理成本降低了约69%。
Insight: 核心创新在于将规划步骤后置,仅在验证失败时触发,这实现了计算资源的自适应分配。从客观角度看,这种“先尝试后规划”的范式以及异构模型配置(轻量模型负责生成,强大模型负责规划)是高效利用模型能力、优化推理成本的有效思路,可推广至其他需要测试时推理的任务。
Abstract: Beyond training-time optimization, scaling test-time computation has emerged as a key paradigm to extend the reasoning capabilities of Large Language Models (LLMs). However, most existing methods adopt a rigid Planning-before-Trial (PbT) policy, which inefficiently allocates test-time compute by incurring planning overhead even on directly solvable problems. We propose Planning-after-Trial (PaT), an adaptive policy for code generation that invokes a planner only upon verification failure. This adaptive policy naturally enables a heterogeneous model configuration: a cost-efficient model handles generation attempts, while a powerful model is reserved for targeted planning interventions. Empirically, across multiple benchmarks and model families, our approach significantly advances the cost-performance Pareto frontier. Notably, our heterogeneous configuration achieves performance comparable to a large homogeneous model while reducing inference cost by approximately 69%.
[15] From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs cs.CLPDF
Hanmeng Liu, Shichao Weng, Xiulai Liu, Zhicai Zhang, Anli Yan
TL;DR: 本文提出LogiHard框架,通过将0阶选择题转化为2阶逻辑判断题,系统性地提升推理难度,并基于项目反应理论实现自适应测试。构建的LogiHard-2k数据集在12个前沿大语言模型上引发31%-56%的准确率下降,揭示了模型在组合推理上的系统性缺陷。
Details
Motivation: 现有选择题推理基准面临模型快速饱和和数据污染问题,临时性强化方法往往牺牲逻辑有效性,无法有效挑战先进推理模型。
Result: 在12个SOTA模型上,组合强化问题导致准确率下降31%-56%;零样本迁移至MMLU基准时准确率从89.84%降至42.86%。
Insight: 通过认知轨迹分析和组合变换构建难度可控的推理基准;揭示了LLMs特有的多选失败和早退偏差现象,表明缺陷源于组合推理缺口而非知识不足。
Abstract: Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.
[16] MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs cs.CL | cs.AIPDF
Hsin-Ling Hsu, Zizheng Wang, Donghua Zhang, Nai-Chia Chen, Jerry Wang
TL;DR: 本文提出了MedAction,一个用于训练主动多轮临床诊断大语言模型(LLM)的树状蒸馏流水线,旨在解决现有LLM在动态、信息不完整的真实临床诊断场景中的不足。通过LLM与环境交互合成高质量多轮诊断轨迹,并构建了包含32,681条轨迹的MedAction-32K数据集。在MedR-Bench和MedAction-300-Hard基准测试中,基于该数据集微调的8B模型在开源模型中达到了最先进的性能。
Details
Motivation: 现有LLM诊断评估多基于静态、单轮设置,即一次性提供完整患者信息,这与真实临床实践中从初始观察、开检查单、解读结果到更新鉴别诊断的多轮主动诊断过程不符,导致模型在动态、部分证据下的诊断能力不足。
Result: 在MedR-Bench和作者构建的MedAction-300-Hard基准测试上,使用MedAction-32K数据集微调的8B模型在开源模型中达到了最先进的性能水平。
Insight: 创新点在于提出了针对主动诊断场景的树状蒸馏流水线,通过LLM与环境交互合成多轮诊断轨迹,并引入了基于知识图谱的轨迹质量过滤指标(疾病轨迹一致性和推理-行动一致性)来确保数据质量,从而有效提升了模型在动态、信息不完整场景下的诊断能力。
Abstract: Most existing LLM diagnoses are evaluated on static, single-turn settings where complete patient information is provided upfront, an oversimplification of real clinical practice. We study active diagnosis: the real-life clinical process of starting from initial observation, ordering tests, interpreting results, and updating a differential diagnosis across multiple turns. Through systematic analysis, we identify three recurring failure modes in current LLMs: ungrounded test ordering, unreliable diagnostic update, and degraded multi-turn coherence. Together, these failures reveal a core deficit: existing medical training data teaches models to reason from complete information but not to act under evolving, partial evidence. To address this gap, we introduce MedAction, a tree-structured distillation pipeline that synthesizes diverse and high-quality multi-turn diagnostic trajectories via LLM-environment interaction. We propose two knowledge-graph-grounded metrics to filter trajectory quality: Disease Trajectory Consistency (DTC), which tracks whether the model’s hypothesis converges toward the correct diagnosis, and Reasoning-Action Consistency (RAC), which verifies that belief updates are driven by gathered evidence. Using this pipeline, we construct MedAction-32K, a dataset of 32,681 trajectories from 2,896 PMC cases. Fine-tuning an 8B model on MedAction-32K achieves state-of-the-art performance among open-source models on both MedR-Bench and our curated MedAction-300-Hard benchmark, pushing the edge for open-source medical LLMs.
[17] Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts cs.CLPDF
Yi-Chang Chen, Feng-Ting Liao, Da-shan Shiu, Hung-yi Lee
TL;DR: 本文通过系统干预(移除、掩码、乱序、噪声注入)对推理语言模型生成的密集顺序思维链进行分析,发现答案提取实际上依赖于稀疏、顺序不敏感且结构鲁棒的信息子集,而非传统假设的密集有序链。
Details
Motivation: 挑战现代推理语言模型中隐含的假设——思维链的每个词元都重要且步骤必须按顺序消费,探究答案提取对顺序和密度的实际依赖程度。
Result: 在三个模型和三个基准测试中,行级乱序仅降低准确率不到0.5个百分点,词级乱序保持62%-89%准确率;掩码数字使准确率降至0%,而掩码字母文本提升4.7个百分点;最激进简化表示(移除所有自然语言、任意乱序)仍达到83%准确率。
Insight: 答案提取基于稀疏、顺序不敏感的信息子集,这一特性源于预训练而非推理微调;这为并行化和词元高效的推理生成提供了新方向,挑战了传统思维链设计的必要性。
Abstract: Modern reasoning language models generate dense, sequential chain-of-thought traces implicitly assuming that every token contributes and that steps must be consumed in order. We challenge both assumptions through a systematic intervention pipeline–removal, masking, shuffling, and noise injection–applied to model-generated reasoning chains across three models and three benchmarks. Our findings are counterintuitive on three dimensions. Order: Does the sequential order of a reasoning chain matter for answer extraction? No–line-level shuffling reduces accuracy by less than 0.5 pp; word-level shuffling retains 62%-89% accuracy; only token-level shuffling collapses to near zero. Pretrained-only and instruction-tuned variants exhibit near-identical tolerance (78.67% vs. 78.00% under line shuffling), indicating order-independence originates from pretraining rather than reasoning-specific fine-tuning. Dense: Is all the information in a reasoning chain important for answer extraction? No–masking numeric digits collapses accuracy to exactly 0%, while masking alphabetic prose improves accuracy by 4.7 pp. Robustness: Is a reasoning chain that is both order-shuffling and non-dense still robust? Yes–the most aggressively reduced representation (all natural language removed, lines arbitrarily shuffled) still achieves 83% accuracy, and injecting false answers at 3x true-answer frequency leaves accuracy unchanged (83.3%->83.3%), falsifying a frequency-based extraction account. These results establish that answer extraction operates on a sparse, order-insensitive, and structurally robust informational substrate, opening paths toward parallelized and token-efficient reasoning generation.
[18] LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification cs.CLPDF
Xuan Li, Yining Wang, Yuchen Liu, Guanjun Liu, Delai Qiu
TL;DR: LaTER提出了一种两阶段推理范式,通过先在连续潜在空间进行有界探索,再切换到显式思维链进行验证和答案生成,从而在保持或提升准确率的同时显著减少推理所需的token数量。
Details
Motivation: 解决思维链推理导致推理成本高昂的问题,同时避免纯潜在推理在需要符号检查的任务上性能下降的缺陷。
Result: 在Qwen3-14B模型上,无训练的LaTER在多个基准测试上减少了16%-32%的token使用量,同时保持或提升了准确率(例如,在AIME 2025上将准确率从70.0%提升至73.3%,token数从15,730降至10,661)。经过微调的LaTER在AIME 2025上达到80.0%的准确率,比标准CoT基线高出10个百分点,同时减少33%的token。
Insight: 创新点在于将推理过程解耦为潜在探索和显式验证两阶段,并利用模型自身的隐藏状态和原生停止标记探针来实现训练自由的切换决策。该方法表明强推理模型在潜在空间中已存在结构化轨迹,为高效推理提供了新思路。
Abstract: Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent-Then-Explicit Reasoning (LaTER), a two-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training-free instantiation, LaTER projects final-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model-native stop-token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3-14B, training-free LaTER reduces total token usage by 16%-32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent-Switch-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at https://github.com/TioeAre/LaTER.
[19] Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study cs.CLPDF
Yash Ganpat Sawant
TL;DR: 本文通过实证研究发现,在监督微调(SFT)中有效的基于梯度的LoRA秩自适应分配策略,在强化学习(特别是GRPO)中会失效。在Qwen 2.5 1.5B模型和GSM8K任务上的实验表明,按梯度重要性比例分配秩反而导致准确率下降4.5个百分点。研究揭示了GRPO下梯度景观更平坦以及非均匀分配会引发梯度放大的负反馈机制。
Details
Motivation: 探究在监督微调(SFT)中成功的自适应LoRA秩分配策略(即根据梯度重要性为不同层分配不同秩)是否能够迁移到强化学习对齐训练(特别是GRPO方法)中,以提升效率。
Result: 在Qwen 2.5 1.5B模型和GSM8K基准测试上,与均匀分配相比,按梯度重要性比例分配秩导致准确率从74.5%降至70.0%,性能下降。GRPO下的最大与最小层重要性比率仅为2.17倍,远低于SFT文献中报告的>10倍。
Insight: 主要创新点在于首次系统研究了SFT时代的自适应秩分配策略在RL对齐训练中的适用性,并揭示了其失效的两个核心机制:GRPO下梯度景观更平坦(所有层都携带有效信号),以及非均匀分配会引发梯度放大的正反馈循环(加剧层间不平衡)。这提示RL下的梯度重要性不能预测容量需求,应避免简单套用SFT的策略。
Abstract: Adaptive rank allocation for LoRA, allocating more parameters to important layers and fewer to unimportant ones, consistently improves efficiency under supervised fine-tuning (SFT). We investigate whether this success transfers to reinforcement learning, specifically Group Relative Policy Optimization (GRPO). Using gradient-magnitude profiling on Qwen 2.5 1.5B with GSM8K, we find that it does not: proportional rank allocation degrades accuracy by 4.5 points compared to uniform allocation (70.0% vs. 74.5%), despite using identical parameter budgets. We identify two mechanisms behind this failure. First, the gradient landscape under GRPO is fundamentally flatter than under SFT, the max-to-min layer importance ratio is only 2.17x, compared to >10x reported in SFT literature. All layers carry meaningful gradient signal; none are truly idle. Second, we discover a gradient amplification effect: non-uniform allocation widens the importance spread from 2.17x to 3.00x, creating a positive feedback loop where high-rank layers absorb more gradient while low-rank layers are progressively silenced. Our results suggest that gradient importance does not predict capacity requirements under RL, and that naive transfer of SFT-era rank allocation to alignment training should be avoided.
[20] Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance cs.CLPDF
Jiachen Yu, Zhihao Xu, Junjie Wang, Yujiu Yang
TL;DR: 本文提出了Think-with-Rubrics新范式,用于指令跟随任务。它将评分标准生成整合到推理上下文中,使其从独立的外部评估工具转变为指导大语言模型生成的内在指南。模型在训练时依次生成评分标准和回答,并利用训练好的评分标准验证器,通过评估答案与自生成/黄金评分标准的一致性来提供联合监督。
Details
Motivation: 现有框架通常将评分标准仅作为与策略主要推理轨迹分离的外部评估器,这种设计将其限制为事后衡量工具,无法主动指导模型的生成过程。
Result: 在多个基准测试上的实验表明,Think-with-Rubrics平均比由黄金评分标准监督的Rubric-as-Reward基线高出3.87分。
Insight: 核心创新在于将评分标准生成内化为推理过程的一部分,并通过联合监督机制(结合黄金评分标准和自生成评分标准)来提升模型性能,分别通过提高自生成评分标准的质量和增强回答的内部一致性来实现。
Abstract: Rubrics have been extensively utilized for evaluating unverifiable, open-ended tasks, with recent research incorporating them into reward systems for reinforcement learning. However, existing frameworks typically treat rubrics only as external evaluator disjointed from the policy’s primary reasoning trace. Such design confines rubrics to post-hoc measurement, leaving them unable to actively guide the model’s generation process. In this work, we introduce Think-with-Rubrics, a novel paradigm for instruction following tasks. Think-with-Rubrics integrates rubric generation into the reasoning context, transforming the rubric from an independent artifact into an internal guidance of LLM’s generation. During training, LLM sequentially generates a rubric followed by a response, while a trained rubric verifier provides joint supervision by evaluating the consistency between the answer and the self-generated / golden rubrics. Experiments across multiple benchmarks demonstrate that Think-with-Rubrics consistently outperforms the Rubric-as-Reward baseline supervised by golden rubrics by an average of 3.87 points. We have also discussed the mechanism by which Think-with-Rubrics enhances model performance. Experimental results demonstrate that supervision from golden rubrics and self-generated rubrics enhances the performance of Think-with-Rubrics by improving the quality of self-generated rubrics and increasing the internal consistency of responses respectively.
[21] SEIF: Self-Evolving Reinforcement Learning for Instruction Following cs.CLPDF
Qingyu Ren, Qianyu He, Jiajie Zhu, Xingzhou Chen, Jingwen Chang
TL;DR: 本文提出了一种名为SEIF的自进化强化学习框架,旨在持续提升大语言模型的指令跟随能力。该框架通过构建一个包含指令生成、过滤、跟随和评判四个角色的闭环系统,实现了指令难度与模型能力的协同进化,无需依赖昂贵的外部监督或静态难度的指令。
Details
Motivation: 现有提升LLM指令跟随能力的方法要么依赖昂贵的人工或强教师模型监督,要么使用静态难度的指令进行自博弈训练,无法随模型能力提升而进化。SEIF旨在克服这些限制,实现模型能力的自主、持续进化。
Result: 在多个不同规模和架构的模型上的实验表明,SEIF能持续提升指令跟随性能,显示出很强的泛化性。分析还揭示了有效的自进化训练策略:早期充分训练打好基础,后期适度训练以避免过拟合并获得更好的最终性能。
Insight: 核心创新在于构建了一个指令难度与模型能力相互促进的闭环自进化系统。其提出的四角色(Instructor, Filter, Follower, Judger)协同进化框架,以及“先充分后适度”的训练策略,为开放域任务上的自进化学习提供了有效方案。
Abstract: Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training with static-difficulty instructions that cannot evolve as the model’s capabilities improve. To address these limitations, we propose SEIF (Self-Evolving Reinforcement Learning for Instruction Following), a self-evolving framework for enhancing the instruction-following ability of LLMs. SEIF forms a closed self-evolution loop that improves the model’s instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co-evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction-following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self-evolution on open-ended tasks: sufficient early-stage training to build a solid foundation, followed by moderate late-stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at https://github.com/Rainier-rq1/SEIF.
[22] WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation cs.CLPDF
Zinan Zheng, Yang Liu, Nuo Chen, Juepeng Zheng, Hong Cheng
TL;DR: 本文提出了天气预测报告生成任务,并构建了首个指令调优数据集WeatherSyn,覆盖美国31个城市和8个天气方面。基于此数据集开发了首个专门用于生成天气预报报告的模型WeatherSyn,该模型在多个指标上优于领先的闭源多模态大语言模型,尤其在结构复杂的天气方面表现出色,并展示了跨区域的强泛化能力。
Details
Motivation: 当前天气预报报告主要依赖人工分析多源数据,导致信息过载和效率低下,而利用数据驱动的多模态大语言模型在天气预测领域进行分析和报告生成尚未充分探索。
Result: 在自建数据集上,WeatherSyn模型在多个评估指标上一致优于领先的闭源MLLMs,特别是在结构复杂的天气方面表现突出,并展示了跨不同地理区域的强零样本泛化能力。
Insight: 创新点包括首次定义天气预测报告生成任务并构建相应的指令调优数据集,以及开发首个专门针对该任务的模型,强调了领域专用MLLMs的开发价值,其跨区域泛化能力为实际应用提供了潜力。
Abstract: Accurate weather forecast reporting enables individuals and communities to better plan daily activities and agricultural operations. However, the current reporting process primarily relies on manual analysis of multi-source data, which leads to information overload and reduced efficiency. With the development of multimodal large language models (MLLMs), leveraging data-driven models to analyze and generate reports in the weather forecasting domain remains largely underexplored. In this work, we propose the Weather Forecasting Report (WFR) task and construct the first instruction-tuning dataset for this task, named\DatasetNameL, which covers 31 cities in America and 8 weather aspects. Based on this corpus, we develop the first model, \ModelNameL, specialized in generating weather forecast reports. Evaluation across multiple metrics on our dataset shows that \ModelNameL consistently outperforms leading closed-source MLLMs, particularly on structurally complex weather aspects. We further analyze its performance across diverse geographic regions and weather aspects. \ModelNameL~ demonstrates strong transferability across different regions, highlighting its zero-shot generalization capability. \ModelNameL~offers valuable insight for developing MLLMs specialized in weather report generation. .
[23] Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation cs.CLPDF
Shenbin Qian, Yves Scherrer
TL;DR: 本文系统分析了大型语言模型在机器翻译中的失败模式,通过评估15个模型在22种语言对上的表现,发现非英语中心语言对的翻译质量较低。作者提出了Token激活率指标来量化模型对语言特定词汇的利用效率,并验证其与翻译性能的强相关性,同时发现推理型LLM在低TAR语言中会生成更多token作为补偿机制。
Details
Motivation: 现有研究主要关注提升或评估LLM的翻译质量,但缺乏对LLM翻译失败时机和原因的深入分析,特别是在低资源语言翻译中的表现。
Result: 在22种语言对的COMET评分中,非英语中心语言对得分显著低于英语中心语言对;Token激活率与翻译性能呈强正相关,且推理型LLM在低TAR语言中生成更多token。
Insight: 创新点在于提出Token激活率作为衡量LLM语言表征能力的代理指标,揭示了词汇利用效率与翻译性能的内在关联,为理解LLM的token级动态机制提供了新视角。
Abstract: Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is strongly associated with poorer translation performance. Furthermore, reasoning LLMs tend to generate more tokens when translating into low-TAR languages, suggesting a compensatory mechanism, although its impact on translation quality varies across models. Overall, our findings emphasize the importance of token-level dynamics in understanding MT performance of LLMs.
[24] Intent-Driven Semantic ID Generation for Grounded Conversational News Recommendation cs.CLPDF
Hongyang Su, Beibei Kong, Lei Cheng, Chengxiang Zhuo, Zang Li
TL;DR: 本文提出了一种意图驱动的语义ID生成方法,用于解决对话式新闻推荐中因用户意图隐式且新闻语料快速变化而导致的检索瓶颈问题。该方法采用生成后匹配范式,通过两阶段训练使LLM将多样意图映射为分层语义ID前缀,再与新闻池进行模糊匹配,确保推荐完全基于事实。
Details
Motivation: 解决对话式新闻推荐中,用户意图多为隐式且缺乏明确关键词,导致标准RAG流程存在检索优先瓶颈,难以在快速更新的新闻语料中实现精准推荐的问题。
Result: 在一个主流中文新闻平台上,7B模型在152K开放生成语义ID空间中实现了0%幻觉率和12.4%的L1匹配率(是随机基线的4倍)。在L1指标上与GPT-4+混合RAG相当,在更细粒度指标(L2为2倍,类别+1.2个百分点)上超越,且成本降低约100倍。冷启动用户的L1匹配率达到18.0%(是随机基线的6倍),为所有用户组中最高。
Insight: 创新点在于采用意图驱动的分层语义ID生成与模糊匹配的生成后匹配范式,取代了传统的检索优先流程;通过多任务语义ID对齐和GPT-4思维链蒸馏的两阶段训练,使较小模型能有效编码隐式意图;提出的PADR机制仅利用用户画像即可为冷启动用户生成有效推荐,解决了零分基线问题。
Abstract: Conversational news recommendation requires grounding each suggestion in a rapidly evolving article corpus while addressing implicit user intents that lack explicit retrievable keywords. To characterize this scenario, we identify 6 intent types from production dialogues: five are implicit and pose fundamental challenges to standard RAG pipelines, forming a critical retrieve-first bottleneck. To address these issues, we introduce intent-driven Semantic ID (SID) generation under a Generate-then-Match paradigm. With two-stage training that consists of multi-task SID alignment and GPT-4 Chain-of-Thought distillation, an LLM maps diverse intents to hierarchical SID prefixes, which are then fuzzy-matched to the current news pool to guarantee fully grounded recommendations. Profile-Aware Dual-Signal Reasoning (PADR) further enables cold-start users to obtain valid recommendations using only profiles. On a mainstream Chinese news platform, our 7B model achieves 0% hallucination and 12.4% L1 match in the 152K open-generation SID space (4x random baseline). It matches GPT-4+Hybrid RAG on L1 while surpassing it on finer-grained metrics (L2 2x, Category +1.2pp) at ~100x lower cost. Cold-start users, where existing baselines score 0%, achieve 18.0% L1 (6x random), the highest among all user groups.
[25] MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing cs.CL | cs.AI | cs.LGPDF
Yinsheng Yao, Jiehao Tang, Zhaozhen Yang, Dawei Cheng
TL;DR: 本文提出了MAVEN框架,这是一种受黑板模型启发的多智能体验证-阐述网络,通过角色解耦将大语言模型转变为具有明确推理轨迹的审慎推理者。该框架采用对抗性的怀疑者-研究者-法官循环,在推理步骤中进行认知审计,从而生成模块化、可验证的审议轨迹。
Details
Motivation: 现有显式推理轨迹方法通常依赖单一链条,缺乏中间验证,导致早期错误会不受控制地级联传播。这种模块化的缺失阻碍了细粒度审计,并损害了高风险应用所需的认知信任。
Result: 在OpenBookQA、TruthfulQA、HALUEVAL和StrategyQA基准测试上的实验表明,MAVEN在四个细粒度指标上均提供了卓越的推理质量。它持续优于GEMINI-3.1-Pro等隐式推理模型以及基于共识的基线方法(如ReConcile)。
Insight: 核心创新在于通过功能分离(怀疑者、研究者、法官)实现明确的角色解耦,模拟专家审议过程,从而在推理步骤中引入认知审计。这使得推理过程结构化、模块化且可验证,而非依赖隐式内部状态或事后共识。此外,该框架完全与模型无关,可作为强大且可迁移的推理增强器,适用于不同的骨干模型。
Abstract: While explicit reasoning trajectories enhance model interpretability, existing paradigms often rely on monolithic chains that lack intermediate verification, allowing early errors to cascade unchecked. This lack of modularity impedes granular auditing and compromises the epistemic trust required for high-stakes applications. We propose MAVEN (Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing), a blackboard-inspired framework designed to transform LLMs into deliberate reasoners through explicit role-decoupling. At its core, MAVEN operationalizes an adversarial Skeptic-Researcher-Judge loop, simulating expert deliberation by functionally separating logical defense from factual grounding. Experiments on OpenBookQA, TruthfulQA, HALUEVAL and StrategyQA benchmarks demonstrate that MAVEN delivers superior reasoning quality across four fine-grained metrics. Notably, MAVEN consistently outperforms latent reasoning models such as GEMINI-3.1-Pro and consensus-based baselines (e.g., ReConcile) by generating explicitly structured, modular, and verifiable deliberation trajectories, rather than relying on implicit internal states or post-hoc consensus. Moreover, comprehensive evaluations confirm that MAVEN is fully model-agnostic, serving as a strong and transferable reasoning booster that yields substantial performance improvements across diverse backbone models.
[26] Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning cs.CLPDF
Gengyang Li, Zheng-Fan Wu, Siqi Bao, Yunfang Wu
TL;DR: 本文通过注意力熵分析强化学习后训练中不同token的学习信号异质性,发现低注意力熵的锚点token提供稳定梯度但易在困难任务上停滞,而高注意力熵的探索者token包含有用推理信号但训练不稳定,并提出动态熵感知重加权方法提升模型性能。
Details
Motivation: 旨在理解基于强化学习的后训练中token级学习信号的异质性,揭示均匀token平均可能掩盖推理训练中有意义的差异。
Result: 在最强设置下,动态熵感知软重加权干预将Qwen3-8B-Base的保留集平均性能从34.39提升至37.40。
Insight: 注意力熵可作为优化相关结构指标区分token类型,锚点-探索者频谱为设计更精细的token级训练策略提供了新视角,动态重加权方法能有效利用异质性信号提升推理能力。
Abstract: Reinforcement-learning-based post-training has become a key approach for improving the reasoning ability of large language models, but its token-level learning signals remain poorly understood. This work studies their heterogeneity through attention entropy, which measures how concentrated or diffuse the contextual support is for each response token. We first show that token-level RL objectives are sparsely estimable: uniformly random 20 percent token subsets preserve much of the full-token held-out performance, suggesting substantial redundancy in token-level updates. However, entropy-structured subsets behave very differently. Low-attention-entropy tokens, which we call anchors, rely on concentrated support, produce stable gradients aligned with full-token updates, and provide a reliable optimization backbone, but tend to plateau on harder benchmarks. High-attention-entropy tokens, which we call explorers, aggregate more diffuse context and induce larger but more volatile gradients. Explorer-only training is unstable on average, though rare successful runs suggest that these tokens may contain useful hard-reasoning signals when optimization remains stable. We support this anchor-explorer spectrum with evidence-gathering analyses, entropy dynamics, gradient-geometry diagnostics, and controls showing that position, predictive entropy, and loss normalization do not explain the observed asymmetry. Finally, a dynamic entropy-aware soft-reweighting intervention improves Qwen3-8B-Base from 34.39 to 37.40 held-out average in the strongest setting. These findings suggest that attention entropy reveals optimization-relevant structure in token-level RL signals, and that uniform token averaging can obscure meaningful heterogeneity in reasoning post-training.
[27] DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain cs.CL | cs.AIPDF
Hsuvas Borkakoty, Sebastian Pohl, Cheng Wang, Bei Chen, Yufang Hou
TL;DR: 本文提出了DRIP-R基准测试,旨在评估LLM智能体在现实零售领域政策模糊场景下的决策与推理能力。该基准通过构建无单一正确答案的退货政策模糊情境,结合真实客户角色、全双工会话模拟工具及多维度评估框架,揭示了前沿模型在相同模糊场景下存在根本性分歧。
Details
Motivation: 现有智能体基准测试通常假设政策明确且定义清晰,而现实领域(如零售)的政策往往存在固有模糊性,导致多种有效解释并存,当前缺乏针对此类模糊性决策的系统性评估方法。
Result: 实验表明,前沿模型在完全相同的政策模糊场景中会得出根本不同的决策结果,证实了模糊性对LLM决策构成了真实且系统性的挑战。
Insight: 创新点在于首次构建了专门针对现实世界政策模糊性的零售领域基准(DRIP-R),其核心价值在于通过精心设计的模糊场景、全双工会话模拟与多维度评估体系,揭示了LLM在模糊政策下决策的不一致性,为评估和提升智能体在复杂现实环境中的鲁棒性提供了新范式。
Abstract: LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that frontier models fundamentally disagree on identical policy-ambiguous scenarios, confirming that ambiguity poses a genuine and systematic challenge to LLM decision-making.
[28] Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models cs.CLPDF
Fan Zhou, Tim Van de Cruys
TL;DR: 本文提出了一种动态控制扩散语言模型中无分类器引导(CFG)尺度的方法,将CFG尺度选择重新定义为序列决策问题,并通过强化学习(PPO)学习自适应的引导轨迹,以优化可控性与生成质量之间的权衡。
Details
Motivation: 当前CFG的引导尺度通常作为固定超参数,导致在扩散过程的不同阶段和不同任务中可控性与质量的权衡次优,尤其在NLP领域,因此需要动态调整引导尺度。
Result: 在三个受控NLP生成任务上的实验表明,自适应引导策略相比固定尺度方法能更优地平衡可控性与生成质量,且学习到的策略在不同任务中展现出可解释的轨迹。
Insight: 创新点在于将CFG尺度视为基于扩散状态的动态控制动作,而非静态超参数,通过强化学习优化策略,这为扩散模型的可控生成提供了更灵活和自适应的框架。
Abstract: Classifier-Free Guidance (CFG) is a widely used mechanism for controlling diffusion-based generative models, yet its guidance scale is typically treated as a fixed hyperparameter throughout generation. This static design yields a suboptimal controllability and quality tradeoff, as the optimal degree of guidance varies across tasks and across different stages of the diffusion process, especially in NLP domain. We recast CFG scale selection as a sequential decision-making problem and propose to learn dynamic guidance trajectories via reinforcement learning. Specifically, we model the guidance scale as a discrete control action selected at each generation step based on the evolving diffusion state, and optimize a policy using Proximal Policy Optimization (PPO) under task-level rewards. Experiments on three controlled NLP generation tasks using discrete diffusion language models demonstrate that adaptive guidance consistently achieves a better balance between controllability and generation quality than fixed-scale strategies. Further analysis of the learned policies reveals distinct and interpretable guidance trajectories across tasks, underscoring the importance of treating guidance as a dynamic control process rather than a static design choice.
[29] SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CLPDF
Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng
TL;DR: 本文提出SimCT方法,用于解决在策略蒸馏中教师和学生模型使用不同分词器时,因分词差异导致监督信号丢失的问题。SimCT通过比较教师和学生模型在短多词延续上的预测,恢复了因分词不匹配而丢弃的监督信号,从而提升蒸馏效果。
Details
Motivation: 传统在策略蒸馏假设教师和学生模型的分词器相同,能逐词比较预测,但当分词器不同时,这种假设失效,导致大量教师信号在词汇表不一致的位置被丢弃。
Result: 在数学推理和代码生成基准测试中,针对三个异构教师-学生对,SimCT相比共享词汇表OPD和代表性跨分词器基线方法均显示出持续增益,消融实验证实改进源于恢复了因精确共享词匹配而丢弃的监督。
Insight: 创新点在于将监督空间从共享词扩展到短多词延续,这是最细粒度的可联合分词监督接口,避免了更粗粒度替代方案移除对在策略学习有用的教师-学生区别,从而有效恢复丢失的监督信号。
Abstract: On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher-student distinctions that are useful for on-policy learning. Across three heterogeneous teacher-student pairs on mathematical reasoning and code-generation benchmarks, SimCT shows consistent gains over shared-vocabulary OPD and representative cross-tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared-token matching. Code is available at \href{https://github.com/sunjie279/SimCT-}{https://github.com/sunjie279/SimCT-}.
[30] Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL | cs.AI | cs.LGPDF
Victor Conchello Vendrell, Arnau Padres Masdemont, Niccolò Grillo, Jordi Ros-Giralt, Arash Behboodi
TL;DR: 本文提出了一种名为MELT(Memory-Efficient Looped Transformer)的新型循环Transformer架构,旨在解决循环语言模型中计算深度与内存消耗线性增长的问题。通过跨循环共享单层KV缓存并采用可学习的门控机制更新,MELT实现了内存消耗与推理深度的解耦。
Details
Motivation: 现有循环LLM架构(如Ouro)在迭代推理时,KV缓存随迭代次数线性增长,导致内存消耗过大,限制了模型的实际可扩展性。
Result: 实验表明,从预训练的Ouro参数微调得到的MELT模型,在性能上超越了同规模的标准LLM,同时内存占用与标准LLM相当,远小于Ouro模型。
Insight: 核心创新在于将KV缓存从每层每循环独立存储改为跨循环共享,并通过两阶段分块训练(插值过渡和注意力对齐蒸馏)实现稳定高效训练,从而在不牺牲循环模型性能的前提下实现恒定内存的迭代推理。
Abstract: Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro’s. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.
[31] SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL | cs.AIPDF
Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun
TL;DR: 本文提出了一种名为SOD(Step-wise On-policy Distillation)的逐步在线策略蒸馏框架,旨在解决将工具集成推理(TIR)能力迁移到小型语言模型时,传统在线策略蒸馏(OPD)方法因错误级联导致教师监督信号失效的问题。SOD通过基于步骤级差异自适应地调整每个推理步骤的蒸馏强度,在高差异区域减弱可能具有误导性的教师信号,同时在良好对齐的状态下保留密集指导。
Details
Motivation: 动机在于解决工具集成推理(TIR)难以扩展到小型语言模型的问题,因为长视野工具交互的不稳定性和有限的模型容量。传统的强化学习方法(如GRPO)仅提供稀疏的结果级奖励,而流行的在线策略蒸馏(OPD)在应用于TIR时会出现关键故障模式:错误的工具调用会在后续推理步骤中产生级联效应,逐步放大学生与教师之间的差异,使得教师的令牌级监督变得不可靠。
Result: 在具有挑战性的数学、科学和代码基准测试上的实验表明,SOD相比次优基线实现了高达20.86%的性能提升。特别地,一个0.6B参数的学生模型在AIME 2025基准上达到了26.13%的得分,证明了智能体推理能力向轻量级模型的有效迁移。
Insight: 论文宣称的创新点是提出了SOD框架,其核心在于引入了步骤级的自适应蒸馏强度重加权机制,以动态管理教师监督的可靠性。从客观角度看,这是一种针对序列决策任务(尤其是涉及外部工具调用的任务)中蒸馏过程稳定性的重要改进,通过局部差异感知来缓解错误传播,为在资源受限模型上部署复杂推理代理提供了新思路。
Abstract: Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher’s token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.
[32] TextLDM: Language Modeling with Continuous Latent Diffusion cs.CLPDF
Jiaxiu Jiang, Jingjing Ren, Wenbo Li, Bo Wang, Haoze Sun
TL;DR: 本文提出了TextLDM,一种将视觉领域的潜在扩散模型(DiT)框架迁移到文本生成任务中的方法。该方法通过Transformer-based VAE将离散文本token映射到连续潜在空间,并结合表示对齐(REPA)技术提升表示质量,然后使用标准DiT进行流匹配生成。在OpenWebText2数据集上从头训练,其性能超越了先前的扩散语言模型,并与GPT-2相当。
Details
Motivation: 动机在于探索统一的生成与理解架构:视觉DiT在图像和视频生成上已取得成功,本文旨在将其扩展到文本生成领域,以实现跨模态(视觉合成与文本生成)的单一扩散架构。
Result: 在OpenWebText2数据集上从头训练,TextLDM显著优于先前的扩散语言模型,性能与GPT-2相当,达到了同类设置下的先进水平。
Insight: 创新点在于将视觉DiT框架有效迁移至文本生成,关键是通过REPA对齐潜在特征与预训练语言模型,解决了连续文本表示的质量问题;这为构建统一的多模态扩散架构提供了具体步骤。
Abstract: Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.
[33] Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs cs.CL | cs.AI | cs.LGPDF
Sree Bhattacharyya, Samarth Khanna, Leona Chen, Lucas Craig, Tharun Dilliraj
TL;DR: 该论文提出了一种基于认知评价理论的多维度自我评估方法,用于提升大型语言模型(LLMs)的性能预测可靠性。研究通过激发六个基于评价的维度(如努力和能力)以及置信度,在12个LLMs和38个任务上评估这些维度对模型失败的预测效用,发现能力相关维度(特别是努力和能力)在大多数情况下优于或匹配置信度,且努力维度能提供更稳定、不过度乐观的估计。
Details
Motivation: 解决LLMs在关键应用中自我评估不可靠的问题,特别是传统置信度作为预测指标存在不一致性和过度乐观的局限性,需要更全面的评估框架。
Result: 在12个LLMs和38个任务(涵盖八个领域)上的实验表明,能力相关维度(努力和能力)在大多数设置中匹配或优于置信度,努力维度在不同模型大小下保持稳定;任务特性影响最有效维度:努力对推理密集型任务最有效,而能力和置信度在检索导向任务中占主导。
Insight: 创新点在于引入心理学中的认知评价理论,将模型自我评估分解为多个维度(如努力、能力),提供了一种结构化、多维度的自我评估方法,可提高模型部署的可靠性和安全性;客观分析认为,该方法通过超越单一置信度,为不同任务类型定制了更精准的失败预测指标。
Abstract: Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been shown to be an inconsistent and overoptimistic predictor of model correctness. Drawing on cognitive appraisal theory, a framework from human psychology that decomposes self-evaluation into multiple components, we propose a multidimensional perspective on model self-assessment. We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks spanning eight domains. We find that competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence across most settings. Effort additionally yields less overoptimistic estimates that remain stable across model sizes. In contrast, affective dimensions provide marginally predictive signals. Furthermore, the most informative dimension varies systematically with task characteristics: effort is most predictive for reasoning-intensive tasks, while ability and confidence dominate on retrieval-oriented tasks. Broadly, our findings indicate that structured multidimensional self-assessment is a promising approach to improving the reliability and safety of language model deployment across diverse real-world settings.
[34] CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers cs.CL | cs.AIPDF
Hexuan Deng, Xiaopeng Ke, Yichen Li, Ruina Hu, Dehao Huang
TL;DR: 本文提出了CoCoReviewBench,一个面向完整性和正确性的AI审稿人评估基准,通过构建类别特定的基准子集、利用审稿人-作者-元审稿讨论作为专家标注来筛选不可靠的人工审稿,从而增强评估的完整性和正确性。该基准基于ICLR和NeurIPS的3900篇论文,旨在实现可靠且细粒度的AI审稿人评估。分析表明,当前AI审稿人在正确性方面仍有限制,容易产生幻觉,并指出推理模型是更有效的审稿人,为改进AI审稿人提供了方向。
Details
Motivation: 现有AI审稿人评估方法过于依赖与人工审稿的重叠度,而人工审稿往往只覆盖部分关键问题且可能包含错误,因此不可靠作为黄金标准。为了解决这一问题,需要建立一个更注重完整性和正确性的评估基准。
Result: 在CoCoReviewBench基准上,分析显示AI审稿人在正确性方面表现有限,容易产生幻觉,同时推理模型被证明是更有效的审稿人。该基准提供了3900篇论文的数据集,支持细粒度评估。
Insight: 创新点在于通过类别特定子集和专家讨论标注来增强评估的完整性和正确性,避免了传统方法对不完美人工审稿的过度依赖。从客观角度看,该方法为AI审稿人评估提供了更可靠的基准,并揭示了推理模型的潜力,推动了该领域的发展方向。
Abstract: Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer–author–meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.
[35] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation cs.CL | cs.AIPDF
James Petullo, Nianwen Xue
TL;DR: 本文提出CA-SQL,一种新颖的文本到SQL推理管道,通过动态调整探索广度、基于进化搜索原理的提示种子方法以及新颖的投票机制,在BIRD基准的挑战性任务上实现了最先进的性能。
Details
Motivation: 针对当前推理时学习方法在BIRD基准中最具挑战性的文本到SQL任务上表现不佳的问题,主要原因是解决方案空间探索不足,无法发现可进一步优化为正确输出的候选查询。
Result: 在BIRD开发集的“挑战性”层级上,仅使用GPT-4o-mini模型就达到了51.72%的最新最优成绩,超越了其他上下文学习方法;在整个BIRD开发数据集上获得了61.06%的执行准确率和68.77%的Soft F1分数。
Insight: 创新点在于根据任务估计难度动态缩放候选解生成时的探索广度,结合基于进化搜索原理的定制提示种子方法以激发基础LLM的探索行为,以及用于最终选择最佳候选解的新颖投票方法,实现了计算预算的有效分配。
Abstract: While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the “challenging” tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.
[36] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents cs.CL | cs.AI | cs.GT | cs.MAPDF
Jiayuan Liu, Tianqin Li, Shiyi Du, Xin Luo, Haoxuan Zeng
TL;DR: 该研究发现大型语言模型(LLM)智能体的上下文窗口扩展在多智能体社会困境中会产生系统性失败,即更长的可访问历史记忆反而会损害合作,这种现象被称为‘记忆诅咒’。论文通过三项分析揭示了其内在机制:记忆内容(而非长度本身)侵蚀了前瞻性意图,而链式思考推理会放大这种负面效应。
Details
Motivation: 动机在于探究上下文窗口扩展这一通常被视为LLM能力升级的技术,在多智能体交互的社会困境场景中是否真的有益,以及其潜在的负面影响和内在机制。
Result: 在7个LLM和4个游戏超过500轮的实验中,扩展可访问历史在28个模型-游戏设置中的18个里降低了合作水平。通过微调干预(LoRA适配器)和记忆内容替换(记忆净化)实验,验证了机制并部分恢复了合作。
Insight: 核心创新点在于揭示了记忆长度对多智能体行为的影响并非中性,而是主动的决定因素,其效果取决于所引发的推理模式。具体而言,前瞻性意图的侵蚀是合作崩溃的关键,而链式思考推理会加剧‘记忆诅咒’。这为设计鲁棒的多智能体系统提供了重要洞见。
Abstract: Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model–game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.
[37] Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration cs.CLPDF
Shuhang Lin, Chuhao Zhou, Xiao Lin, Zihan Dong, Kuan Lu
TL;DR: 本文提出了一种名为Conformal Path Reasoning(CPR)的可信知识图谱问答框架,通过路径级校准来提供统计覆盖保证。该方法在路径级别进行查询级共形校准,并引入残差共形值网络(RCVNet)来学习判别性路径非共形分数,从而在保证覆盖率的同时显著减少预测集大小。
Details
Motivation: 现有知识图谱问答方法缺乏对检索答案的可靠覆盖保证,而传统共形预测方法在校准有效性和分数判别性方面存在局限,导致覆盖率保证被破坏且预测集过大。
Result: 在基准测试中,CPR相比共形基线将经验覆盖率提高了34%,同时将平均预测集大小减少了40%,验证了其在满足覆盖保证的同时生成更紧凑答案集的有效性。
Insight: 创新点包括在路径级别进行查询级共形校准以保持可交换性,以及设计轻量级RCVNet模块通过PUCT引导的探索学习判别性路径分数;这为可信KGQA提供了可扩展的统计保证框架。
Abstract: Knowledge Graph Question Answering (KGQA) has shown promise for grounded and interpretable reasoning, yet existing approaches often fail to provide reliable coverage guarantees over retrieved answers. While Conformal Prediction (CP) offers a principled framework for producing prediction sets with statistical guarantees, prior methods suffer from critical limitations in both calibration validity and score discriminability, resulting in violated coverage guarantees and excessively large prediction sets. To address these pitfalls, we propose Conformal Path Reasoning (CPR), a trustworthy KGQA framework with two key innovations. First, we perform query-level conformal calibration over path-level scores, preserving the exchangeability while generating path prediction sets. Second, we introduce the Residual Conformal Value Network (RCVNet), a lightweight module trained via PUCT-guided exploration to learn discriminative path-level nonconformity scores. Experiments on benchmarks show that CPR significantly improves the Empirical Coverage Rate by 34% while reducing average prediction set size by 40% compared to conformal baselines. These results validate the efficacy of CPR in satisfying coverage guarantees with substantially more compact answer sets.
[38] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling cs.CLPDF
Tong Zheng, Haolin Liu, Chengsong Huang, Huiwen Bao, Sheng Zhang
TL;DR: 本文提出了AutoTTS框架,用于自动发现测试时扩展策略,以优化大语言模型在推理时的计算分配。该方法将研究重点从手动设计启发式规则转向构建能让TTS策略自动发现的环境,并在数学推理基准测试上取得了优于人工设计基线的效果。
Details
Motivation: 现有测试时扩展策略多为人工设计,探索空间有限且效率低下,本文旨在通过自动化框架来更高效地发现更优的计算分配策略。
Result: 在数学推理基准测试上,自动发现的策略在准确率与计算成本的权衡上优于强人工设计的基线,且能泛化到未见过的基准和模型规模,整个发现过程仅需39.9美元和160分钟。
Insight: 创新点在于将TTS策略发现形式化为环境驱动的控制器合成问题,通过环境构建(如宽度-深度TTS、beta参数化和细粒度执行轨迹反馈)使搜索空间可处理且反馈廉价高效,实现了低成本、自动化的策略优化。
Abstract: Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width–depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy–cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.
cs.CV [Back]
[39] Visual Text Compression as Measure Transport cs.CV | cs.AIPDF
Lv Tang, Tianyi Zheng, Yang Liu, Bo Li, Xingyu Li
TL;DR: 本文提出了一种基于测度传输理论的视觉文本压缩(VTC)分析方法。该方法将文本和视觉token视为经验概率测度,将视觉编码器建模为一个前推映射,并将其传输成本分解为精度成本和覆盖成本。基于此分析,论文提出了无需下游标签的路由准则和基于传输成本的注视机制,以优化VTC在长上下文处理中的实际效用。
Details
Motivation: 视觉文本压缩(VTC)虽然能显著减少解码token数量(3-20倍),但其节省的token数量无法可靠地预测下游任务性能,有时表现优异,有时则完全失效。因此,需要一种原则性的方法来衡量视觉编码导致的任务相关信息损失,以理解和预测VTC的效用。
Result: 在Qwen3-4B模型上的24个NLP数据集评估中,提出的无需下游标签的路由准则在17/24(70.8%)的数据集上达到了与针对每个数据集单独优化的“预言机”相同的性能。与纯LLM基线相比,该方法平均任务得分提升了+3.3%,同时平均token使用量减少了-10.3%。
Insight: 核心创新点在于将VTC问题形式化为测度传输问题,从而量化信息损失(分解为精度成本与覆盖成本),并据此设计出无需下游标签的实用化工具(路由准则和注视机制)。这为理解和优化视觉-语言交叉模态处理提供了新的理论框架和可操作的工程方法。
Abstract: Visual text compression (VTC) promises efficient long-context processing by rendering text into an image and re-encoding it with a vision-language model, often producing $3$–$20\times$ fewer decoder tokens than subword tokenization. Yet token savings do not translate predictably into downstream utility: on some tasks the visual path matches or exceeds the text path, on others it collapses, and the compression ratio itself does not predict which regime will occur. The missing quantity is therefore not another summary of efficiency, but a principled measure of task-relevant information loss induced by visual encoding. We address this problem by formulating VTC in the language of measure transport. Treating text and visual tokens as empirical probability measures, we show that the ViT patch encoder induces a push-forward map whose transport cost decomposes into a precision cost from within-patch aggregation and a coverage cost from cross-patch fragmentation. Both terms are estimable from downstream-label-free probes. This formulation yields two operational consequences: a downstream-label-free routing criterion that selects whether to use the visual path for a given input or benchmark instance, and a transport-informed foveation mechanism that re-encodes high-cost regions at higher resolution. Across $24$ NLP datasets at Qwen3-4B, our label-free rule matches the per-dataset oracle on $17/24$ datasets ($70.8%$), and improves the average task score by $+3.3%$ with $-10.3%$ average tokens relative to a pure-LLM.
[40] Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey cs.CV | cs.AIPDF
Yiwen Xu, Tariq M. Khan, Yang Song, Erik Meijering
TL;DR: 本文是一篇关于边缘深度学习在计算机视觉和医疗诊断领域的全面综述,介绍了边缘深度学习的基本原理、技术优势、硬件平台分类、模型部署方法以及实际应用案例,并展望了未来发展方向和挑战。
Details
Motivation: 边缘深度学习结合了边缘计算和深度学习的优势,旨在实现实时决策并适应环境因素,解决传统云计算在延迟、隐私和带宽方面的局限性,特别是在计算机视觉和医疗诊断等对实时性要求高的领域。
Result: 作为一篇综述论文,未提供具体的定量实验结果,但总结了当前边缘深度学习在计算机视觉和医疗诊断应用中的最新进展和实际影响,强调了其在现实场景中的革命性潜力。
Insight: 创新点包括提出了基于性能和场景的边缘硬件平台新分类方法,有助于平台选择和提升操作效率;并系统梳理了在边缘设备上部署深度神经网络的技术(如轻量化设计和模型压缩),为研究和实践提供了全面参考。
Abstract: Edge deep learning, a paradigm change reconciling edge computing and deep learning, facilitates real-time decision making attuned to environmental factors through the close integration of computational resources and data sources. Here we provide a comprehensive review of the current state of the art in edge deep learning, focusing on computer vision applications, in particular medical diagnostics. An overview of the foundational principles and technical advantages of edge deep learning is presented, emphasising the capacity of this technology to revolutionise a wide range of domains. Furthermore, we present a novel categorisation of edge hardware platforms based on performance and usage scenarios, facilitating platform selection and operational effectiveness. Following this, we dive into approaches to effectively implement deep neural networks on edge devices, encompassing methods such as lightweight design and model compression. Reviewing practical applications in the fields of computer vision in general and medical diagnostics in particular, we demonstrate the profound impact edge-deployed deep learning models can have in real-life situations. Finally, we provide an analysis of potential future directions and obstacles to the adoption of edge deep learning, with the intention to stimulate further investigations and advancements of intelligent edge deep learning solutions. This survey provides researchers and practitioners with a comprehensive reference shedding light on the critical role deep learning plays in the advancement of edge computing applications.
[41] HumanNet: Scaling Human-centric Video Learning to One Million Hours cs.CV | cs.ROPDF
Yufan Deng, Daquan Zhou
TL;DR: HumanNet是一个包含一百万小时以人为中心的视频数据集,旨在通过大规模、多视角、细粒度标注的人类活动视频来支持具身智能的学习。该数据集不仅提供原始视频,还包含交互相关的注释,如字幕、运动描述和手部身体信号,以促进运动感知和交互感知的学习。
Details
Motivation: 当前具身智能的发展受限于缺乏大规模、多样化和丰富标注的人类活动数据,而视觉和语言领域已通过互联网语料库实现规模化。HumanNet旨在填补这一空白,为物理交互学习提供可扩展的数据基础设施。
Result: 在固定验证数据下,使用HumanNet中1000小时的第一人称视频对Qwen VLM模型进行持续训练,其表现超过了使用Magic Cobot中100小时真实机器人数据的训练结果,表明人类中心视频可作为机器人数据的可扩展且经济高效的替代品。
Insight: HumanNet引入了系统化的数据策展范式,强调以人为中心的过滤、时间结构化、视角多样性和注释丰富化作为核心设计原则,将非结构化互联网视频转化为可扩展的表示学习、活动理解、运动生成和人机迁移的基础。这为利用人类视频而非仅依赖机器人数据来扩展具身基础模型提供了新思路。
Abstract: Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.
[42] R$^3$L: Reasoning 3D Layouts from Relative Spatial Relations cs.CV | cs.AI | cs.LG | cs.ROPDF
Zhifeng Gu, Yuqi Wang, Bing Wang
TL;DR: 本文提出了R³L框架,旨在通过改进相对空间推理的可靠性和一致性来生成3D布局。该框架通过不变空间分解、一致空间想象和支持性空间优化三个核心组件,解决多跳推理中因参考系变换累积误差导致的语义和度量漂移问题,从而生成更物理可行和语义一致的3D布局。
Details
Motivation: 现有方法利用多模态大语言模型推断相对空间关系,但推断结果不可靠且通常依赖后处理启发式方法。多跳推理需要重复的参考系变换,这会累积误差并导致语义和度量漂移,因此需要提高相对空间推理的可靠性。
Result: 在多种场景类型和指令上的广泛实验表明,R³L能生成更物理可行和语义一致的布局。分析指出,解决由参考系引起的不一致性对于可靠的多跳相对空间推理至关重要。
Insight: 创新点包括:1) 不变空间分解以解耦关系链;2) 一致空间想象通过想象-修正循环促进自一致性;3) 支持性空间优化通过全局到局部坐标重参数化简化姿态优化。核心洞察是解决参考系变换导致的误差累积是提升多跳相对空间推理的关键。
Abstract: Relative spatial relations provide a compact representation of spatial structure and are fundamental to relative spatial reasoning in 3D layout generation. Recent works leverage Multimodal Large Language Models (MLLMs) to infer such relations, but the inferred relations are often unreliable and are typically handled with post-hoc heuristics. In this paper, we propose R$^3$L, a general framework that improves the reliability and consistency of relative spatial reasoning for 3D layout generation. Our key motivation is that multi-hop reasoning requires repeated reference-frame transformations, which accumulate errors in inferred relations and lead to semantic and metric drift. To mitigate this, we propose invariant spatial decomposition to break coupled relation chains, and consistent spatial imagination to promote self-consistency through an imagine-and-revise loop. We further introduce supportive spatial optimization to ease pose optimization via global-to-local coordinate re-parameterization. Extensive experiments across diverse scene types and instructions demonstrate that R$^3$L produces more physically feasible and semantically consistent layouts. Notably, our analysis shows that resolving frame-induced inconsistencies is crucial for reliable multi-hop relative spatial reasoning. The code is available at https://github.com/Neal2020GitHub/R3L.
[43] LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute cs.CV | cs.LGPDF
Ali Salamatian, Anthony Fuller, Pritam Sarkar, James R. Green, Leonid Sigal
TL;DR: LookWhen是一个用于快速视频识别的选择器-提取器框架,通过学习何时、何地以及计算什么来分解视频识别过程。该框架通过浅层选择器快速评估时空令牌的重要性,并由深层提取器仅处理选定的top-K令牌来近似全视频表示,从而减少计算冗余。
Details
Motivation: Transformer在视频识别中占主导地位,但处理视频令牌的计算成本高昂且存在冗余,因此需要一种更高效的方法来减少不必要的计算。
Result: 在Kinetics-400、SSv2、Epic-Kitchens、Diving48、Jester和Charades等基准测试中,LookWhen在准确率与计算量(FLOPs)的权衡上优于高效模型和类似规模的基线模型,在12个案例中的9个上实现帕累托占优,并在准确率与吞吐量上达到6.7倍于InternVideo2-B的速度提升。
Insight: 创新点包括:1)将视频识别分解为选择与提取两个阶段,通过令牌选择减少计算冗余;2)引入基于最近邻距离的令牌独特性评分进行选择预训练;3)利用视频和图像教师蒸馏进行提取预训练,学习视频内的变化表示。这些策略实现了高效且通用的视频表示学习。
Abstract: Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We introduce LookWhen, a selector-extractor framework that factorizes video recognition into learning when, where, and what to compute. Our shallow selector gets a scaled-down video and quickly scores all tokens across space-time, while our deep extractor gets the top-K selected tokens to approximate full-video representations without actually processing all the tokens. A key challenge is defining effective supervision for selection and extraction. For selection pre-training, we introduce a score on representations that ranks tokens by uniqueness using a simple nearest-neighbor distance. For extraction pre-training, we distill both a video teacher and an image teacher, for which we normalize its frame-wise representations to learn what changes within videos. Through these strategies, our selector-extractor learns general and efficient representations for feature extraction or fine-tuning to a task. Through experiments on Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades, we show that LookWhen achieves a better accuracy-computation trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks x 2 settings) and roughly matches on 3. In accuracy-throughput, measuring time in practice, LookWhen is more efficient still at 6.7x faster than InternVideo2-B at equal accuracy.
[44] Knowledge Transfer Scaling Laws for 3D Medical Imaging cs.CV | cs.AI | cs.LGPDF
Ho Hin Lee, Dongna Du, Chu Wang, Yuankai Huo, Shi Gu
TL;DR: 本文研究了在3D医学影像领域构建视觉基础模型时,不同成像模态(如CT、MRI、PET)在预训练中的知识转移规律。研究发现,知识转移具有不对称性,且损失和转移遵循可预测的幂律趋势。基于此,作者将数据分配问题形式化为一个缩放律优化问题,并提出了一种基于知识转移感知的数据混合策略。
Details
Motivation: 动机在于为3D医学影像构建统一的基础模型时,需要混合异构的成像域数据,而现有的混合策略大多是启发式的。作者旨在通过研究不同域之间的知识转移规律,来指导更有效的预训练数据分配。
Result: 实验表明,所提出的基于转移感知的数据分配策略,在预训练中比按数据比例采样的方法性能提升高达58%,并且在未见过的预算下泛化良好(r=0.989)。在下游疾病分类和器官/病灶分割任务上的验证也证实了该策略能提供更强的预训练表示。
Insight: 创新点在于首次系统地观察并量化了3D医学影像不同模态间知识转移的不对称性和幂律缩放规律,并据此将数据分配问题形式化为一个可优化的缩放律问题,揭示了数据分配中存在的“枢纽-孤岛”结构,为多模态基础模型预训练提供了理论指导和高效策略。
Abstract: Vision foundation models are increasingly moving beyond 2D to volumetric domains such as 3D medical imaging, where unified pretraining across different imaging modalities (i.e. CT, MRI, and PET) could provide foundational models for diverse clinical tasks. However, training such models requires mixing heterogeneous imaging domains, and current mixture strategies remain largely heuristic. In this work, we observe that different medical imaging domains scale at variable rates during pretraining, and knowledge transfer between domains is strongly asymmetric: training on one domain can substantially improve another, but the reverse may be much weaker. Interestingly, both MAE reconstruction loss and cross-domain transfer follow predictable power-law trends with domain-specific behaviors. Motivated by these findings, we formulate data allocation as a scaling-law optimization problem. The derived allocations reveal an interpretable hub-and-island structure: highly transferable domains emerge as hubs that benefit many others and deserve strategic allocation, while isolated domains act as islands requiring direct investment. Empirically, transfer-aware allocation outperforms data-proportional sampling by up to 58% and generalizes well to unseen budgets with r=0.989. Downstream validation on disease classification and organ/lesion segmentation further confirms that the derived transfer-aware mixtures provide stronger pretrained representations for clinical 3D medical imaging tasks.
[45] Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation cs.CVPDF
Ernie Chu, Vishal M. Patel
TL;DR: 本文提出了一种名为异构步数分配(HSA)的训练免费推理算法,用于高效视频生成。该方法根据时空令牌的速度动态,为不同令牌分配不同的去噪步数预算,并通过KV缓存同步机制和缓存的欧拉更新来维持全局上下文并高效跳过不活跃令牌,从而在显著减少计算成本的同时保持生成质量。
Details
Motivation: 标准扩散Transformer(DiT)推理对所有令牌均匀应用相同数量的去噪步数,计算成本巨大。鉴于人类视觉会忽略大量冗余运动,论文旨在解决模型为何要对每个时空令牌给予同等优先级的问题,以提升推理效率。
Result: 在Wan-2和LTX-2模型上,针对文本到视频(T2V)和图像到视频(I2V)生成任务进行评估。HSA显著优于之前最先进的缓存方法和原始Flow Matching基线,尤其是在高加速比(如50%和25%运行时间)下,实现了更优的质量-运行时帕累托前沿,且无需昂贵的离线分析。
Insight: 核心创新点在于根据令牌速度动态进行异构步数分配的推理策略,以及配套的KV缓存同步机制和缓存的欧拉更新。这提供了一种无需重新训练即可大幅加速视频生成DiT推理的思路,通过关注信息丰富的动态区域并高效跳过静态/冗余区域来优化计算资源分配。
Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art video generation quality, but they incur immense computational cost because standard inference applies the same number of denoising steps uniformly to every token in the sequence. It is well known that human vision ignores vast amounts of redundant motion. Why, then, do our densest models treat every spatiotemporal token with equal priority? In this paper, we introduce Heterogeneous Step Allocation (HSA), a training-free inference algorithm that assigns varying step budgets to different spatiotemporal tokens based on their velocity dynamics. To resolve the resulting sequence-length mismatch without sacrificing global context, HSA introduces a KV-cache synchronization mechanism that allows active tokens to attend to the full sequence while entirely bypassing inactive tokens. Furthermore, we derive a cached Euler update that advances the latent states of skipped tokens in a single operation without additional model evaluations. We evaluate HSA on the Wan-2 and LTX-2 models for both text-to-video (T2V) and image-to-video (I2V) generation. Our results demonstrate that HSA significantly outperforms previous state-of-the-art caching methods and the vanilla Flow Matching baseline, especially at aggressive acceleration regimes (e.g., 50% and 25% runtimes). Crucially, HSA achieves a superior quality-runtime Pareto frontier without the need for expensive offline profiling, robustly preserving structural integrity and generation quality even under tight computational budgets. Project page: https://ernestchu.github.io/hsa
[46] Advancing Reliable Synthetic Video Detection: Insights from the SAFE Challenge cs.CVPDF
Kirill Trapeznikov, Gabriel Mancino-Ball, Jonathan Li, Paul Cummer, Jai Aslam
TL;DR: 本文介绍了在ICCV 2025 APAI研讨会上举办的SAFE合成视频检测挑战赛,旨在评估和推进在完全盲测条件下区分真实与合成视频的算法。挑战包含两个主要任务:检测由多种SOTA模型生成的合成视频内容,以及检测经过常见后处理(如调整大小、重新压缩、运动模糊)后的合成内容。挑战使用了来自13个现代高质量合成视频模型和21个真实视频源的6000个样本,总计20小时视频数据。论文总结了挑战设计、数据集构建、评估方法和结果,提供了关于当前检测方法泛化性和鲁棒性的见解。
Details
Motivation: 生成式视频技术的扩散加剧了对可靠检测和表征合成媒体的需求,为了解决这一挑战,组织了SAFE挑战赛以评估和推进合成视频检测方法。
Result: 挑战赛收到了来自12个团队的超过600份提交,结果表明在跨生成器泛化方面取得了可衡量的进展,但对后处理伪影仍存在持续的脆弱性。
Insight: 创新点在于通过大规模、多样化的盲测挑战系统评估了合成视频检测技术的现状,揭示了当前方法在泛化能力(对未知生成器)和鲁棒性(对后处理操作)方面的具体优势和短板,为未来研究指明了方向。
Abstract: The proliferation of generative video technologies has intensified the need for reliable methods to detect and characterize synthetic media. To address this challenge, we organized the \href{https://safe-video-2025.dsri.org}{SAFE: Synthetic Video Detection Challenge}, co-located with the \textit{Authenticity and Provenance in the Age of Generative AI (APAI) Workshop }at ICCV 2025. The competition invited participants to develop and evaluate algorithms capable of distinguishing real from synthetic videos under fully blind evaluation conditions with over 600 submissions from 12 teams over a 90 day span. Hosted on the Hugging Face platform, the challenge comprised two primary tasks: (1) detection of synthetic video content generated by diverse state-of-the-art models, and (2) detection of synthetic content following common post-processing operations such as resizing, re-compression, motion blur and others. The challenge data consisted of 13 modern high quality synthetic video models with generated content matched to real videos from 21 diverse and challenge sources, all adding up to 20 hours of 6,000 video samples. This paper describes the challenge design, dataset construction, evaluation methodology, and outcomes, offering insights into the generalization and robustness of contemporary synthetic video detection methods. Our findings highlight measurable progress in cross-generator generalization but also persistent vulnerabilities to post-processing artifacts. https://safe-video-2025.dsri.org
[47] A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency cs.CV | cs.AIPDF
Do Xuan Long, Yale Song, Min-Yen Kan, Tomas Pfister, Long T. Le
TL;DR: 本文提出了A²RD(Agentic Autoregressive Diffusion)架构,用于解决生成长视频时存在的语义漂移和叙事崩溃问题。该方法将长视频合成建模为一个闭环的、分段式的检索-合成-精炼-更新过程,通过解耦创造性合成与一致性强制,并引入多模态视频记忆、自适应片段生成和分层测试时自改进三个核心组件来提升一致性。
Details
Motivation: 现有方法在生成长视频时,随着时间推移会出现语义漂移和叙事崩溃,难以保持长期的一致性和连贯性。本文旨在解决这一根本挑战。
Result: 在公开基准和作者新提出的LVBench-C基准(包含非线性实体和环境转换)上,针对1到10分钟的视频,A²RD在一致性方面比现有最佳基线高出30%,在叙事连贯性方面高出20%。人类评估也证实了这些提升,并指出其在运动和过渡平滑性方面有显著改进。
Insight: 主要创新点在于将长视频合成构建为一个闭环的、分段自回归的代理(agentic)过程,通过解耦合成与一致性强制,并利用多模态记忆和分层自改进机制来主动管理和修正错误,防止其传播,从而确保长期一致性。
Abstract: Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A$^2$RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A$^2$RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve–Synthesize–Refine–Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A$^2$RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.
[48] Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment cs.CVPDF
Yuchen Guo, Junli Gong, Yao Lu, Xintong Xu, Yiuming Cheung
TL;DR: 本文提出FuScore,一种基于多模态大语言模型(MLLM)的红外-可见光图像融合(IVIF)质量评估方法,通过生成连续质量分数而非离散等级来模拟人类视觉感知,实现对相似质量融合图像的细粒度区分,并在与人类视觉偏好的相关性上达到SOTA水平。
Details
Motivation: 现有IVIF评估方法过度依赖手工设计的无参考统计量或将源图像作为伪真值的全参考指标,而近期基于人类评分的奖励建模方法仅使用聚合分数的标量回归,既未利用MLLM的推理能力,也未在监督中编码每张图像的感知模糊性,直接使用离散one-hot监督会导致相似质量的融合图像被错误归类到不同等级。
Result: 大量实验表明,FuScore在多个基准测试中与人类视觉偏好实现了最先进(SOTA)的相关性。
Insight: 创新点包括:利用MLLM生成连续质量分数以实现细粒度评估;通过四个IVIF特定子维度的一致性构建每张图像的软标签,其锐度反映整体判断的一致性程度;引入结合每图像分布监督、源对内Thurstone保真度(用于方法级排序)和跨源对Thurstone保真度(用于跨场景场景级排序)的三部分目标函数。
Abstract: Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.
[49] TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations cs.CV | cs.LGPDF
Maria Despoina Siampou, Gengchen Mai, Ni Lao, Jinmeng Rao, Neha Arora
TL;DR: 本文提出了TrajGANR,一种以轨迹为中心的地理空间多模态自监督学习框架,通过连续神经表征对齐人类移动轨迹与静态街景图像,解决了现有方法无法有效处理连续轨迹数据的问题。
Details
Motivation: 现有地理空间多模态自监督学习方法主要针对静态模态对(如卫星图像、街景图像和文本),其基于相同或邻近位置对齐的假设不适用于表示连续路径移动的人类轨迹数据,而轨迹对于理解城市动态至关重要,因此需要新的框架来对齐连续移动模式与静态观测。
Result: 在四个城市移动性和道路理解任务上,TrajGANR持续优于现有的地理空间多模态自监督学习框架和一个专门针对轨迹的基础模型;消融研究表明,所提出的多模态自监督学习目标和框架是性能提升的主要驱动力。
Insight: 创新点在于引入了连续轨迹神经表征,实现了轨迹任意点与邻近街景图像的细粒度对齐,并联合对齐轨迹、街景图像和地理位置三种模态;客观来看,其将连续运动模式纳入地理空间多模态学习的思路,以及对细粒度地理空间对齐优于粗粒度聚合的强调,具有借鉴意义。
Abstract: Multimodal self-supervised learning (MSSL) has emerged as a key paradigm for pretraining geospatial foundation models. However, existing geospatial MSSL methods are mainly designed for static pairs of modalities, such as satellite imagery, street-view imagery, and text, where learning is driven by aligning observations from the same or nearby locations. This assumption breaks down for human mobility trajectories, which represent continuous movement along paths rather than discrete observations at individual locations. Although trajectories are important for urban understanding through their ability to capture human activity across roads, neighborhoods, and places over time, they remain largely underexplored in current geospatial MSSL frameworks. We present TrajGANR, a novel trajectory-centric geospatial MSSL framework that aligns continuous movement patterns with static, location-based observations. TrajGANR learns a continuous neural representation of trajectories at arbitrary points along each path, which enables fine-grained alignment with nearby street-view images, even when they are not co-located with any trajectory waypoints. We leverage this capability to introduce an MSSL objective that jointly aligns three modalities: trajectories, street-view images, and their geographic locations. We evaluate TrajGANR on four urban mobility and road understanding tasks. Across these tasks, TrajGANR consistently outperforms existing geospatial MSSL frameworks and a trajectory-specific foundation model. Ablation studies further demonstrate that our proposed MSSL objective and the multimodal learning framework are the primary drivers of these improvements, highlighting the importance of fine-grained geospatial alignment over coarser aggregation, as well as geospatial multimodal learning.
[50] LensVLM: Selective Context Expansion for Compressed Visual Representation of Text cs.CV | cs.AIPDF
Roy Xie, Dan Friedman, Donghan Yu, Bowen Pan, Christopher Fifty
TL;DR: 本文提出了LensVLM,一个用于视觉语言模型(VLM)的推理框架和后训练方法,旨在解决VLM在处理渲染文本图像时,因图像压缩导致字符分辨率过低而识别精度下降的问题。该方法的核心思想是让模型学会扫描压缩图像,并智能地选择性地将相关图像区域扩展到未压缩的高分辨率形式,从而在保持高压缩比的同时维持模型精度。
Details
Motivation: 动机在于解决VLM将文本作为渲染图像处理时面临的一个关键挑战:为了压缩视觉表示而降低图像分辨率会导致字符变得难以辨认,从而严重影响模型在文本理解任务上的准确性。
Result: 在七个文本问答基准测试上,基于Qwen3.5-9B-Base构建的LensVLM在4.3倍有效压缩下能达到与全文本上限相当的精度,并在高达10.1倍的有效压缩下,性能优于基于检索、文本压缩和视觉压缩的基线方法。该方法也成功泛化到多模态文档和代码理解任务,且压缩率越高,其相对于基线的精度优势越明显。
Insight: 主要创新点在于提出了一个选择性上下文扩展的推理框架,通过后训练让模型学会使用工具(文本扩展或高分辨率图像扩展)来弥补压缩带来的信息损失。客观分析认为,其核心洞察是:通过训练使视觉压缩对渲染选择具有鲁棒性,并且随着压缩增加,模型会更多地依赖扩展的可靠内容而非不可靠的视觉读取,这为高效的多模态信息处理提供了新思路。工具选择的指导原则(文本扩展用于渲染文本,高分辨率图像扩展用于保留布局信息的原生文档)也具有实践价值。
Abstract: Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of visual tokens, varying rendering resolution provides a fine-grained compression knob. However, accuracy deteriorates quickly as compression increases: characters shrink below the vision encoder’s effective resolution, making them indistinguishable. To address this, we propose LensVLM, an inference framework and post-training recipe that enables VLMs to scan compressed images, then selectively expand only the relevant images to their uncompressed form via learned tools. Building on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3x effective compression and outperforms retrieval-based, text- and visual-compression baselines up to 10.1x effective compression across seven text QA benchmarks. LensVLM also generalizes to multimodal document and code understanding tasks, with the accuracy gain over baselines growing as compression increases. Our analysis validates this approach: training makes visual compression robust to rendering choices, and as compression grows the model increasingly relies on expanded content rather than unreliable visual reading. The analysis also yields practical tool-choice guidance: text expansion is preferable for rendered text, while high-resolution image expansion suits native documents whose layout cues carry task-relevant information.
[51] Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness cs.CV | cs.AIPDF
Qiangqiang Wu, Grace McIlvain, Zhou Yu, Junhao Wen
TL;DR: 本文提出Pan-FM,一个针对七种器官(脑、心脏、脂肪、肝、肾、脾、胰腺)成像数据的全景器官基础模型,通过显著性引导掩码(SGM)技术解决真实世界中多模态生物医学数据非随机缺失导致的模型偏差问题,提升模型在器官缺失情况下的鲁棒性和跨器官表征学习能力。
Details
Motivation: 现有医学影像基础模型通常基于单一模态或孤立器官数据训练,而人体衰老和疾病涉及跨器官的协同生物过程,因此需要学习全身表征的多模态基础模型;但真实多模态生物数据常存在非随机缺失,会降低模型效能、泛化能力并引入偏差。
Result: 在UK Biobank数据集上,Pan-FM在13类疾病和14种单一疾病实体预测任务中,性能优于单器官和多器官基线模型,且在器官缺失设置下表现出更强的鲁棒性。
Insight: 创新点包括:提出显著性引导掩码(SGM)机制,利用模型注意力分布自适应掩码主导器官(如脂肪和心脏),以缓解主导器官的捷径学习偏差,促进更平衡的跨器官全身学习;该机制计算开销可忽略,并能无缝集成到现有自监督学习框架中。
Abstract: Foundation models (FMs) have shown great promise in medical imaging, but most FMs are trained on unimodal data within isolated domains, such as brain MRI alone. Human aging and disease arise through coordinated biological processes across organs, therefore motivating multimodal FMs that learn whole-body representations. A key challenge, however, is that real-world multimodal biomedical data are often missing not at random, which can reduce power, limit generalizability, and introduce bias. We propose Pan-FM, a pan-organ foundation model pre-trained on imaging from seven organs (Brain, Heart, Adipose, Liver, Kidney, Spleen, and Pancreas) under realistic missing-organ scenarios. Pan-FM uses a unified backbone that handles organ missingness during both training and inference, and is pre-trained with masking-based self-distillation. We find that naive multimodal pre-training leads to dominant-organ shortcut learning bias, with the model over-relying on dominant organs such as adipose and heart. To address this, we introduce Saliency-Guided Masking (SGM), which uses the model attention distribution to adaptively mask dominant organs during pre-training, thus encouraging more balanced cross-organ, whole-body learning. Notably, SGM introduces negligible computational overhead and can be seamlessly integrated into existing self-supervised learning frameworks to improve multi-organ representation learning. On the UK Biobank, Pan-FM achieves stronger prediction across 13 disease categories and 14 single disease entities than single-organ and multi-organ baselines, with improved robustness under missing-organ settings. Pan-FM serves as a scalable solution to realistic modality-missingness in multimodal learning in system neuroscience and as a step toward more generalizable whole-body FMs.
[52] Learning to Track Instance from Single Nature Language Description cs.CVPDF
Yaozong Zheng, Bineng Zhong, Qihua Liang, Shuimu Zeng, Haiying Xia
TL;DR: 本文提出了一种名为\tracker的新型自监督视觉语言跟踪器,它能够仅通过自然语言描述来跟踪视频序列中的任意目标,而无需依赖任何边界框标注。该方法通过动态令牌聚合模块,选择性地融合视觉和语言令牌,以增强语义对齐并减少噪声,从而在无标注视频中自主学习实例跟踪。
Details
Motivation: 解决如何在无需边界框标注的情况下,仅通过自然语言描述实现视觉语言跟踪的问题,即自监督VL跟踪,以降低对大规模标注数据的依赖。
Result: 在VL跟踪基准测试中,\tracker超越了当前最先进的自监督方法,展示了其有效性。
Insight: 创新点在于动态令牌聚合模块,它通过基于锚点令牌选择重要目标令牌、合并并聚合到语言令牌中,以及利用融合的语言令牌引导搜索帧中的目标提取,从而实现了无监督的语义对齐和时序提示增强,为自监督学习提供了新思路。
Abstract: How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\tracker} surpasses SOTA self-supervised methods.
[53] Learning Visual Feature-Based World Models via Residual Latent Action cs.CV | cs.AI | cs.LG | cs.ROPDF
Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She
TL;DR: 本文提出了一种基于视觉特征的世界模型RLA-WM,通过引入残差潜在动作(RLA)表示,并利用流匹配进行预测,显著提升了复杂交互场景下的预测性能。该方法在仿真和真实数据集上超越了当前最先进的基于特征和视频扩散的世界模型,且速度更快,并进一步应用于从无动作演示视频学习的世界动作模型和首个完全基于离线视频学习的视觉强化学习框架。
Details
Motivation: 现有基于视觉特征的世界模型主要依赖直接回归,在复杂交互中容易产生模糊或崩溃的预测,而高维特征空间的生成建模仍具挑战性。本文旨在通过残差潜在动作(RLA)解决这些问题,实现更高效、更可靠的未来视觉特征预测。
Result: RLA-WM在仿真和真实数据集上超越了当前最先进的基于特征和视频扩散的世界模型(SOTA),且比视频扩散模型快几个数量级。
Insight: 创新点在于从DINO残差中学习残差潜在动作(RLA)表示,该表示具有预测性、泛化性和时间进展编码能力,并基于RLA通过流匹配构建世界模型,避免了直接回归的缺陷,同时提出了无需在线交互或手工奖励的视觉强化学习新框架。
Abstract: World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as Residual Latent Action (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose RLA World Model (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm
[54] Task Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information cs.CV | cs.LGPDF
Houman Safaai, Andrew T. Landau, Celia C. Beron, Yasin Mazloumi, Bernardo L. Sabatini
TL;DR: 该论文提出了一种双轴视角来分析视觉网络中的通道重要性,将通道信息分解为任务相关性和局部可替换性两个独立维度。研究发现,在CIFAR-100数据集上的ResNet-18、VGG-16和MobileNetV2模型中,这两个维度弱相关且训练中快速分离,其中局部可替换性比任务相关性更能可靠预测通道剪枝效果。
Details
Motivation: 传统通道重要性评估方法仅用单一分数总结,掩盖了‘通道与任务的相关性’和‘通道被同层其他通道替代的可能性’这两个本质不同的问题,论文旨在通过双轴框架区分这两个维度。
Result: 在固定FLOPs匹配的剪枝协议下,局部轴指标(衡量输入捕获和同层重叠)比目标轴指标(衡量任务信息和目标冗余信息)在三个CIFAR-100骨干网络中更可靠地预测可移除性,该结论在CIFAR-10、Tiny-ImageNet、ImageNet-100和ConvNeXt-T/ImageNet-100的应力测试中保持一致;基于范数的基线方法在VGG-16等架构中仍具竞争力。
Insight: 创新点在于将通道重要性分解为任务相关性和局部可替换性两个正交轴,揭示了剪枝决策应更关注‘当同层通道可用时网络是否仍需该通道’(局部可替换性),而非单纯‘通道对任务的贡献’(任务相关性);高斯线性分析解释了梯度残差方向如何导致两轴分离,为通道剪枝提供了更精细的理论框架。
Abstract: Channel importance in vision networks is usually summarized by a single score. That summary hides two different questions: how much a channel is related to the task, and whether its function can be supplied by same-layer peers when the channel is removed. We call the second property local replaceability. We introduce a two-axis view that separates these questions. The local axis measures input capture and peer overlap, while the target axis measures task information and target-excess information. Across ResNet-18, VGG-16, and MobileNetV2 trained on CIFAR-100, the two axes are weakly aligned, induce different channel groupings, and separate rapidly during training despite being strongly coupled at random initialization. A Gaussian linear analysis accounts for how this separation can arise through residualized gradient directions, and lesion plus peer-replacement experiments show that peer support refines removability beyond input capture and task relevance alone. Under the fixed FLOPs-matched pruning protocol, local-axis metrics are more reliable predictors of removability than target-axis metrics across the three CIFAR-100 backbones, with the same direction preserved in stress tests on CIFAR-10, Tiny-ImageNet, ImageNet-100, and a ConvNeXt-T/ImageNet-100 pilot. These findings identify an axis-level distinction rather than a universal ranking of pruning scores: local replaceability is a more reliable guide to removability than target relevance, while norm-based baselines remain competitive in architectures such as VGG-16. Relevance-based scores ask what a channel says about the task; pruning asks whether the network still needs that channel when its peers remain available.
[55] Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition cs.CV | cs.AIPDF
Talha Ilyas, Deval Mehta, Zongyuan Ge
TL;DR: 本文提出了一种用于骨架动作识别的神经符号框架,将动作识别重新定义为基于运动基元的概念驱动一阶逻辑推理。该框架通过将一阶逻辑谓词锚定在可学习的时空运动概念中,桥接了表示学习与符号推理,旨在实现可解释的动作理解。
Details
Motivation: 现有骨架动作识别模型多为黑盒且难以解释,本文旨在通过引入神经符号推理范式,为动作识别提供明确、可解释的逻辑解释,从而提升模型的可解释性。
Result: 在NTU RGB+D 60/120和NW-UCLA数据集上的大量实验表明,该方法在保持有竞争力的识别性能的同时,能够提供基于逻辑结构的明确、可解释的解释。
Insight: 核心创新点在于将神经符号推理引入骨架动作识别,通过可微分一阶逻辑层学习人类可读的逻辑规则,并利用LLM衍生的原子运动基元描述来对齐骨架表示,从而建立感知与推理的共享概念空间,为可解释的时空动作理解提供了新范式。
Abstract: Skeleton-based human activity recognition has achieved strong empirical performance, yet most existing models remain black boxes and difficult to interpret. In this work, we introduce a neurosymbolic formulation of skeleton-based HAR that reframes action recognition as concept-driven first-order logical reasoning over motion primitives. Our framework bridges representation learning and symbolic inference by grounding first-order logic predicates in learnable spatial and temporal motion concepts. Specifically, we employ a standard spatio-temporal skeleton encoder to extract latent motion representations, which are then mapped to interpretable concept predicates via a spatio-temporal concept decoder that explicitly separates pose-centric and dynamics-centric abstractions. These concept predicates are composed through differentiable first-order logic layers, enabling the model to learn human-readable logical rules that govern action semantics. To impose semantic structure on the learned concepts, we align skeleton representations with LLM-derived descriptions of atomic motion primitives, establishing a shared conceptual space for perception and reasoning. Extensive experiments on NTU RGB+D 60/120 and NW-UCLA demonstrate that our approach achieves competitive recognition performance while providing explicit, interpretable explanations grounded in logical structure. Our results highlight neurosymbolic reasoning as an effective paradigm for interpretable spatio-temporal action understanding. Code: https://github.com/Mr-TalhaIlyas/REASON
[56] Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding cs.CV | cs.AIPDF
Yuan Yao, Qiushi Yang, Humen Zhong, Jiangning Wei, Yifang Men
TL;DR: 本文提出了Qwen3-VL-Seg,一个参数高效的框架,用于解决开放世界指称分割任务。该方法将多模态大语言模型预测的边界框作为语义基础的结构先验,并通过一个轻量级的框引导掩码解码器将其解码为像素级的分割结果。为了支持可扩展的训练,作者构建了SA1B-ORS数据集,并提出了ORS-Bench评估基准。实验表明,该方法在闭集和开放世界设置下均表现优异,并具有良好的泛化能力。
Details
Motivation: 现有基于MLLM的方法在开放世界视觉定位上表现良好,但其输出仅限于稀疏的边界框坐标,无法进行密集的视觉预测。而现有的MLLM分割方法要么难以重建连续物体边界,要么依赖外部分割基础模型,导致架构复杂和部署开销大。
Result: 在指称表达式分割、视觉定位和ORS-Bench上的大量实验表明,Qwen3-VL-Seg在闭集和开放世界设置下均表现强劲,在语言密集型指令和强分布外泛化方面具有明显优势。在通用多模态基准测试上的评估进一步表明,模型在面向分割的适配后,广泛保留了通用的多模态能力。
Insight: 核心创新在于将MLLM预测的边界框作为结构先验,并设计了一个轻量级(仅17M参数)的框引导掩码解码器,该解码器结合了多尺度空间特征注入、空间-语义查询构建、框引导高分辨率像素融合和迭代掩码感知查询细化。此外,构建了大规模开放世界训练数据集SA1B-ORS和评估基准ORS-Bench,为领域提供了数据支持。
Abstract: Open-world referring segmentation requires grounding unconstrained language expressions to precise pixel-level regions. Existing multimodal large language models (MLLMs) exhibit strong open-world visual grounding, but their outputs remain limited to sparse bounding-box coordinates and are insufficient for dense visual prediction. Recent MLLM-based segmentation methods either directly predict sparse contour coordinates, struggling to reconstruct continuous object boundaries, or rely on external segmentation foundation models such as the Segment Anything Model (SAM), introducing substantial architectural and deployment overhead. We present Qwen3-VL-Seg, a parameter-efficient framework that treats the MLLM-predicted box as a semantically grounded structural prior and decodes it into pixel-level referring segmentation. At its core, a lightweight box-guided mask decoder combines multi-scale spatial feature injection, spatial-semantic query construction, box-guided high-resolution pixel fusion, and iterative mask-aware query refinement, introducing only 17M parameters (about 0.4% of the base model). For scalable open-world training, we construct SA1B-ORS, an SA-1B-derived dataset with two subsets: SA1B-CoRS (category-oriented samples) and SA1B-DeRS (descriptive, instance-specific samples). For evaluation, we curate ORS-Bench, a manually screened benchmark with in-distribution and out-of-distribution subsets covering diverse referring expression types. Extensive experiments on referring expression segmentation, visual grounding, and ORS-Bench show that Qwen3-VL-Seg performs strongly across closed-set and open-world settings, with clear advantages on language-intensive instructions and strong out-of-distribution generalization. Evaluations on general multimodal benchmarks further show that the model broadly preserves general-purpose multimodal competence after segmentation-oriented adaptation.
[57] AGA3DNet: Anatomy-Guided Gaussian Priors with Multi-view xLSTM for 3D Brain MRI Subtype Classification cs.CVPDF
Peiyu Duan, Xueqi Guo, Sepehr Farhand, Mehmet Berk Sahin, Xinyuan Zheng
TL;DR: AGA3DNet是一个用于3D脑部MRI亚型分类的报告驱动框架,它通过从放射学报告中提取的简短解剖短语作为软解剖先验通道,并结合轻量级3D CNN和多视角xLSTM聚合,以融合局部解剖线索和长距离上下文推理。
Details
Motivation: 解决3D脑部MRI亚型分类中需要同时利用局部解剖线索和长距离上下文推理的问题,并避免对密集体素标注的需求。
Result: 在回顾性机构脑部MRI队列的异常亚型鉴别任务中,AGA3DNet相比可复现的3D分类基线,在多个性能指标上实现了更好的整体平衡,并通过先验通道支持临床可解释的定位。
Insight: 创新点在于将放射学报告中的解剖短语转化为高斯加权的平滑空间先验,作为可解释的解剖引导通道,并与多视角xLSTM结合进行特征聚合,无需密集标注即可提供解剖基础指导。
Abstract: Accurate 3D brain MRI subtype classification benefits from both localized anatomical cues and long-range contextual reasoning. We present AGA3DNet, a report-grounded framework that incorporates brief anatomical phrases extracted from radiology reports as a soft anatomical prior channel and fuses it with a lightweight 3D CNN and multi-view xLSTM aggregation. Specifically, extracted anatomical phrases are mapped to atlas-defined regions and converted into smooth spatial priors using a signed-distance transform followed by Gaussian weighting, providing interpretable, anatomy-grounded guidance without requiring dense voxel annotations. We evaluate AGA3DNet on a retrospective institutional brain MRI cohort for abnormal subtype discrimination and compare against reproducible 3D classification baselines. AGA3DNet achieves improved overall balance across performance metrics and supports clinically interpretable localization through the prior channel. We discuss limitations related to single-cohort evaluation and the lack of large-scale public brain MRI datasets paired with radiology reports under broadly usable terms.
[58] UniV2D: Bridging Visual Restoration and Semantic Perception for Underwater Salient Object Detection cs.CVPDF
Laibin Chang, Shaodong Wang, Yunke Wang, Xu Zhang, Kui Jiang
TL;DR: 本文提出了一种名为UniV2D的统一视觉到检测网络,用于水下显著目标检测。该网络通过一个相互受益的框架,联合优化视觉恢复和显著目标检测,打破了传统‘先增强后检测’的串行范式,解决了语义不一致的问题。
Details
Motivation: 传统水下显著目标检测采用‘先增强后检测’的串行流程,导致低级的视觉恢复与高级的语义感知脱节,可能产生语义不一致并为检测任务引入无关噪声。本文旨在打破这一瓶颈,实现视觉恢复与语义感知的联合优化。
Result: 在多个基准测试上的广泛实验表明,UniV2D在定量和定性评估上均显著优于现有最先进方法,为联合水下感知设立了新标准。
Insight: 核心创新在于提出了一个语义驱动的学习范式:高级显著性语义主动指导恢复过程,而恢复的视觉线索反过来增强显著性感知。具体通过分层双分支架构实现,包括自校准解码器、掩码感知恢复模块和配备跨层级调制的显著性引导细化模块,以对齐结构保真度与语义一致性。
Abstract: Underwater salient object detection (USOD) plays a vital role in marine vision tasks but remains fundamentally challenging due to severe visual degradation, such as selective absorption and medium scattering. Conventional pipelines typically adopt a sequential “enhance-then-detect” paradigm. However, isolating low-level visual restoration from high-level semantic perception often leads to semantic inconsistency, where the restored images may not be optimal for detection and can even introduce task-irrelevant noise. To break this sequential bottleneck, we propose UniV2D, a Unified Vision-to-Detection Network that jointly optimizes visual restoration and salient object detection within a mutually beneficial framework. Unlike traditional methods that rely on disjointed pipelines or rigid physical priors, UniV2D introduces a semantic-driven learning paradigm: high-level saliency semantics actively guide the restoration process, while the restored visual cues reciprocally enhance saliency perception. Specifically, UniV2D features a hierarchical dual-branch architecture. It first employs a self-calibrated decoder to predict initial saliency masks alongside a mask-aware restoration module to reconstruct image content. Subsequently, a saliency-guided refinement module equipped with cross-level modulation is utilized to align structural fidelity with semantic consistency. Extensive experiments across multiple benchmarks demonstrate that UniV2D significantly outperforms state-of-the-art methods in both quantitative and qualitative evaluations, establishing a new standard for joint underwater perception.
[59] Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models cs.CVPDF
Haoming Wang, Wei Gao
TL;DR: 该论文发现当前视觉语言模型(VLMs)的潜在表示中确实存在对3D场景拓扑结构的编码,但这种空间表示被非几何的视觉语义(如颜色和形状)所掩盖。通过跨场景线性特征提取方法,作者成功分离出一个干净的空间子空间,并证明其数学结构与场景的3D高斯核图拉普拉斯特征映射相对应。基于这一几何识别,作者进一步提出了一种基于狄利克雷能量的潜在正则化方法,仅需在简单合成数据上进行少量微调,即可显著提升VLM在真实世界空间理解任务上的性能。
Details
Motivation: 人类通过构建认知地图(即保持拓扑结构的3D空间表示)来导航环境,而现代视觉语言模型虽然能从2D自我中心输入中展现出空间推理能力,但其内部是否形成类似的3D表示尚不明确。本文旨在探究VLMs是否拥有潜在的3D场景拓扑表示,并解决其被非几何语义掩盖的问题。
Result: 通过在简单合成数据上进行仅500步的监督微调,并应用提出的狄利克雷能量正则化方法,模型在真实世界空间理解基准测试中取得了显著提升,在涉及场景拓扑理解的空间任务上,性能优于标准监督微调和竞争基线方法达12.1%。
Insight: 论文的核心创新点在于:1)揭示了VLMs潜在表示中存在的、与3D场景拓扑结构对应的空间子空间,并通过数学方法(拉普拉斯特征映射)证明了其与物理3D空间的对应关系;2)提出了一种基于狄利克雷能量的、数学原理驱动的潜在表示正则化方法,该方法能有效塑造模型的内部空间表示,仅需极少量微调即可泛化提升复杂真实任务性能,为理解和改进VLMs的空间推理能力提供了新途径。
Abstract: Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate that current VLMs do possess a latent topological map of 3D scenes, but it is heavily overshadowed by non-geometric visual semantics, such as color and shape. By isolating this spatial subspace through cross-scene linear feature extraction, we extract a clean spatial subspace that causally controls the model’s spatial outputs. We mathematically shape this latent representation and prove its correspondence to the Laplacian eigenmaps of the scene’s 3D Gaussian-kernel graph, converging to the physical 3D space in the continuous limit. Motivated by this geometric identification, we further introduce a mathematically principled latent regularization method for VLMs, based on Dirichlet energy. Applying this single-term regularizer to a minimal 500-step supervised VLM fine-tuning (SFT) on simple synthetic data yields significant improvements on real-world spatial benchmarks, outperforming standard SFT and competitive baselines by up to 12.1% in spatial tasks involving scene topology understanding. Source code is available at https://github.com/pittisl/vlm-latent-shaping
[60] Real-IAD MVN: A Multi-View Normal Vector Dataset and Benchmark for High-Fidelity Industrial Anomaly Detection cs.CVPDF
Wenbing Zhu, Jianing Liang, Linjie Cheng, Yurui Pan, Zhuhao Chen
TL;DR: 本文提出了Real-IAD-MVN数据集,一个用于高保真工业异常检测的大规模多视角法向量数据集,以解决现有方法在检测细微几何缺陷方面的不足。通过在五个不同视角捕获高保真表面法向量图,该数据集提供了微观细节层面的全面几何表征。实验表明,使用密集的多视角伪3D(表面法向量)数据比稀疏3D点云数据能显著提升检测性能。
Details
Motivation: 现有工业异常检测方法难以检测细微的几何缺陷,标准2D RGB图像对纹理和光照敏感,常遗漏精细几何异常,而3D点云虽能捕获宏观形状,但通常过于稀疏,无法检测划痕或凹坑等微观缺陷。本文旨在解决这一根本性的数据限制。
Result: 在新数据集上的实验首次证明,结合密集的多视角伪3D(表面法向量)数据比使用稀疏3D点云数据能获得显著更好的检测性能。此外,提出的基于重建的基线方法(从图像和法向量图流中学习提取跨模态统一原型)超越了现有的最先进多模态融合方法,为该数据集的有效性提供了有力基准。
Insight: 核心创新在于引入了首个大规模、高保真的多视角表面法向量工业异常检测数据集,它通过密集的几何表征使以往不可见的侧壁和遮挡缺陷变得可检测。方法上的创新在于提出了一种跨模态统一原型学习方法,有效融合了图像和法向量信息,为几何异常检测领域提供了新的数据基准和有效的融合策略。
Abstract: Industrial Anomaly Detection (IAD) is critical for quality control, but existing methods struggle with subtle, geometric defects. Standard 2D (RGB) images are sensitive to texture and lighting but often miss fine geometric anomalies. While 3D point clouds capture macro-shape, they are typically too sparse to detect micro-defects like scratches or pits. We address this fundamental data limitation by introducing Real-IAD-MVN (Multi-View Normal), a large-scale industrial dataset. By upgrading our acquisition system, Real-IAD-MVN captures high-fidelity surface normal maps from five distinct viewpoints, replacing sparse 3D data entirely. This provides a comprehensive geometric representation at a micro-detail level, making previously invisible side-wall and occluded defects explicitly detectable. Our experiments, conducted on this new dataset, first provide evidence that incorporating dense, multi-view pseudo-3D (surface normals) yields significantly better detection performance than using sparse 3D point cloud data. To further validate the dataset and provide a strong benchmark, we introduce a baseline method based on reconstruction, which learns to extract cross-modal unified prototypes from the image and normal map streams. We demonstrate that this unified prototype approach surpasses existing state-of-the-art multimodal fusion methods, highlighting the rich potential of our new dataset for advancing geometric anomaly detection.
[61] PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition cs.CVPDF
Yuchen He, Jing Zhang
TL;DR: 本文提出PRIMED方法,用于解决指代音频-视觉分割任务中多模态线索相关性动态变化的问题。该方法受认知神经科学中的偏向竞争理论启发,通过模态先验解码器估计表达依赖的主要模态,并利用令牌蒸馏器和竞争感知跨模态融合模块实现自适应模态抑制,从而更准确地分割目标对象。
Details
Motivation: 现有方法通常将多模态线索视为同质输入进行融合或推理,容易受到不相关或误导性模态的影响,而指代表达和场景中不同模态的相关性实际上是动态变化的,因此需要一种能自适应抑制无关模态的方法。
Result: 在Ref-AVS基准测试上的大量实验表明,PRIMED实现了最先进的整体性能。
Insight: 创新点包括:受偏向竞争理论启发,显式建模视觉感知和语言驱动的先验调制;引入模态先验解码器自适应生成模态先验;设计令牌蒸馏器提供分层全局上下文;提出空间感知语义对齐损失通过对比学习增强前景-背景区分。这些机制共同实现了对动态相关模态的自适应关注与抑制。
Abstract: Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.
[62] Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection cs.CVPDF
Kai Zheng, Hang-Cheng Dong, Jiatong Pan, Zhenkai Wu, Fupeng Wei
TL;DR: 本文提出S2M框架,通过从遥感变化检测的标注掩码中自动提取结构化文本特征(语义四元组:位置、类别、变化方式、数量),以零额外标注成本为单模态图像提供精确、密集、无噪声的多模态监督,从而提升变化检测的准确性。
Details
Motivation: 现有单模态深度学习方法易混淆真实语义变化与视觉相似但无关的变化,而现有引入文本作为辅助监督的多模态方法,其文本描述要么语义粗糙、非结构化,要么由模型生成而存在噪声。作者观察到,变化检测数据集的真实标注掩码本身已隐式编码了细粒度的变化语义信息,但被现有方法忽视。
Result: 在作者构建的新多类别变化检测数据集Gaza-Change-v2上,S2M方法取得了Sek为17.80%和Fscd为66.14%的结果,显著超越了包括利用大语言模型在内的多模态方法。
Insight: 核心创新点在于揭示了标注掩码本身可作为高质量、结构化文本信息的来源,并提出了自动从掩码生成语义四元组描述的方法,实现了零成本、高精度的多模态监督。此外,两阶段训练策略(先微调视觉特征,再引入双向对比损失进行多模态对齐)和构建的新数据集也具有借鉴价值。
Abstract: Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels at zero additional annotation cost. Specifically, each change region is automatically transcribed into a semantic quadruple (where, what, how, how many) and converted into several fixed-template text descriptions, providing precise, dense, and noise-free multimodal supervision. We adopts a two-stage training strategy to fine-tune on remote sensing imagery firstly for robust domain-specific representation, after which a multimodal decoder with a bi-directional contrastive loss is introduced to achieve deep alignment between visual features and structured textual embeddings. To validate our method, we construct Gaza-Change-v2, a new multi-class change detection (MCD) dataset about the Gaza Strip. On this MCD dataset, S2M achieves a Sek of 17.80% and an F$_{\text{scd}}$ of 66.14%, notably surpassing even multimodal methods that leverage large language models. Our work demonstrates that masks can indeed talk. They tell us exactly what, where, how, and how many changes have occurred.
[63] SatSurfGS: Generalizable 2D Gaussian Splatting for Sparse-View Satellite Surface Reconstruction cs.CVPDF
Min Chen, Wei Guo, Bin Wang, Wen Li, Tong Fang
TL;DR: 本文提出了SatSurfGS,一种基于2D高斯泼溅(2DGS)的、可泛化的稀疏视角卫星图像表面重建方法。该方法构建了一个从粗到细的高斯属性预测框架,并在特征学习、高斯参数估计和训练优化三个层面显式地建模局部几何可靠性,以应对卫星成像条件下多视图匹配可靠性空间异质性强的挑战。
Details
Motivation: 稀疏视角卫星图像表面重建极具挑战,主要因为卫星成像条件下多视图匹配的可靠性在空间上高度异质,受大光度差异、弱纹理和重复纹理影响,多视图几何约束稀疏、分布不均且局部不可靠。虽然2DGS比3DGS更适合连续表面的显式表示,但目前仍缺乏用于稀疏视角卫星表面重建的可泛化前馈式2DGS框架。
Result: 在卫星数据集上的实验表明,与代表性的可泛化基线方法和有竞争力的逐场景优化方法相比,所提方法在渲染质量、表面重建精度、跨数据集泛化能力和推理效率方面均取得了提升。
Insight: 创新点在于构建了一个从粗到细的高斯属性预测框架,并系统性地在三个层面(特征融合、参数优化、损失函数)引入置信度感知机制来建模局部几何可靠性。具体包括:置信度感知的单目多视图特征融合模块、跨阶段自一致性残差引导模块以及置信度双向路由损失。这为处理具有挑战性的稀疏、不可靠多视图约束提供了一种结构化的解决方案。
Abstract: Sparse-view satellite image surface reconstruction remains highly challenging, fundamentally because the reliability of multi-view matching under satellite imaging conditions is strongly spatially heterogeneous. Affected by large photometric differences, weak textures, and repetitive textures, multi-view geometric constraints are often sparse, unevenly distributed, and locally unreliable. Although 2D Gaussian Splatting (2DGS) is more suitable than 3D Gaussian Splatting (3DGS) for the explicit representation of continuous surfaces, research on generalizable feed-forward 2DGS frameworks for sparse-view satellite surface reconstruction is still lacking. To address this issue, we propose SatSurfGS, a generalizable sparse-view surface reconstruction method for satellite imagery based on 2DGS. The proposed method builds a coarse-to-fine Gaussian attribute prediction framework and explicitly models local geometric reliability at three levels: feature learning, Gaussian parameter estimation, and training optimization. Specifically, we propose a confidence-aware monocular multi-view feature fusion module to adaptively integrate monocular priors and multi-view matching features according to local confidence; a cross-stage self-consistency residual guidance module to stabilize stage-wise Gaussian parameter refinement using the residual between the rendered height map from the previous stage and the current-stage MVS height map, together with confidence information; and a confidence bidirectional routing loss to achieve differentiated allocation of geometric and appearance supervision. Experiments on satellite datasets show that the proposed method achieves improved rendering quality, surface reconstruction accuracy, cross-dataset generalization, and inference efficiency compared with representative generalizable baselines and competitive per-scene optimization methods.
[64] Attention Transfer Is Not Universally Effective for Vision Transformers cs.CV | cs.LGPDF
Huaiyuan Qin, Muli Yang, Gabriel James Goenawan, Peng Hu, Chen Gong
TL;DR: 本文重新评估了注意力迁移在视觉Transformer中的有效性,发现其并非普遍有效:在11个ViT家族中,7个成功迁移,但4个家族迁移失败,性能甚至低于无迁移基线。研究通过控制实验将失败原因定位到架构不匹配,并证明通过添加教师模型的架构组件可完全逆转失败。
Details
Motivation: 动机是重新验证注意力迁移在视觉Transformer中的普适性,挑战先前认为仅迁移注意力模式即可完全恢复教师模型预训练权效益处的结论。
Result: 在20个教师模型的综合基准测试中,4个ViT家族迁移失败,性能下降高达5.1%,低于无迁移基线;通过添加教师架构组件后,所有失败家族的性能得到完全恢复。
Insight: 创新点在于揭示注意力迁移的有效性依赖于学生与教师架构的匹配,而非仅注意力模式对齐;关键机制是架构不匹配导致迁移的注意力模式对学生无效,这修正了ViT表示中注意力作用的普遍理解。
Abstract: A recent work shows that Attention Transfer, which transfers only the attention patterns from a pre-trained teacher Vision Transformer (ViT) to a randomly initialized standard student ViT, is sufficient to recover the full benefit of the teacher’s pre-trained weights. We revisit this finding on a comprehensive benchmark of 20 teachers from 11 well-known ViT families and reveal that Attention Transfer is not universally effective. While 7 families transfer successfully, 4 consistently fail, falling up to 5.1% below the from-scratch no-transfer baseline. Further results demonstrate that this failure is family-consistent across model sizes, and persists under extended training durations, different transfer datasets, and out-of-distribution evaluations. Controlled analyses then consistently localize the problem to the attention-routing channel, indicating that the key issue is not whether the student can match the teacher’s attention patterns, but whether the matched patterns remain functional for the student. Crucially, we identify architectural mismatch between the pre-trained teacher and the standard student as the primary mechanism. By adding only the teacher’s native architectural components to the student in a randomly initialized state, we completely reverse the failure for all 4 families. Notably, these components alone do not improve from-scratch training, confirming that they specifically unlock the usability of the teacher’s attention. We further systematically show that this failure is not explained by the inadequate choice of transfer loss or by differences in pre-training recipes. Our findings refine the prevailing understanding of attention in ViT representations: attention is sufficient \textit{only} when the student architecture matches the teacher.
[65] Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models cs.CV | cs.AI | cs.LGPDF
Bincheng Peng, Guang Li, Ping Liu, Takahiro Ogawa, Miki Haseyama
TL;DR: 本文提出了一种名为CLP-DD的闭式线性探针数据集蒸馏方法,针对预训练视觉模型,通过利用冻结特征线性探针的闭式解,避免了基于轨迹的梯度匹配或神经正切核近似,从而高效生成合成数据集,在ImageNet-100和ImageNet-1K上显著提升了性能并大幅降低了计算成本。
Details
Motivation: 现有数据集蒸馏方法主要针对从头训练网络,而现代视觉迁移学习常使用冻结预训练编码器加轻量线性探针,现有方法要么依赖基于轨迹的梯度匹配,要么使用为从头训练设计的闭式神经正切核近似,未能充分利用冻结特征线性探针本身具有闭式解的特性。
Result: 在ImageNet-100上,CLP-DD相比无DSA的LGM方法有显著提升,接近有DSA的LGM性能,同时计算成本大幅降低;在ImageNet-1K上,CLP-DD在四个预训练骨干中的三个上匹配或超越了有DSA的LGM,运行速度约快14倍,GPU内存使用不到八分之一。
Insight: 创新点在于提出闭式线性探针内层求解器与判别性外层损失(温度缩放softmax交叉熵)的结合,避免了无限宽度近似和内层轨迹,利用分类器列作为特征空间中的学习类锚点;客观分析表明,外层损失的选择对性能至关重要,标准MSE损失表现不佳,而判别性损失能有效弥补差距。
Abstract: Dataset distillation compresses a large training set into a small synthetic set that preserves downstream training utility. While most existing methods target training networks from scratch, modern visual transfer learning often uses frozen pre-trained encoders followed by lightweight linear probing. Existing distillation methods for this setting either unroll iterative linear-probe updates with trajectory-based gradient matching, or rely on closed-form formulations originally designed for from-scratch training with neural-tangent-kernel (NTK) approximations. Neither route exploits the fact that frozen-feature linear probing admits a closed-form solution determined directly by the pre-trained features themselves, with no infinite-width approximation and no inner-loop trajectory. We propose Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel formulation that computes the linear probe induced by the synthetic set with a sample-space kernel ridge solver. The synthetic images are then updated by evaluating this induced classifier on real features through a temperature-scaled softmax cross-entropy, where the classifier columns act as learned class anchors in feature space. We further show that the choice of outer objective is decisive: pairing the closed-form inner solver with a standard MSE outer loss substantially underperforms trajectory-based methods, while the discriminative outer loss closes most of the gap. On ImageNet-100 with four pre-trained backbones, CLP-DD substantially improves over LGM without DSA and approaches LGM with DSA at a fraction of the computational cost. On ImageNet-1K, CLP-DD matches or surpasses LGM with DSA on three of four backbones while running roughly $14\times$ faster and using less than one-eighth of the GPU memory.
[66] From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting cs.CVPDF
Chamuditha Jayanga Galappaththige, Jason Lai, Timothy Patten, Donald Dansereau, Niko Suenderhauf
TL;DR: 本文提出了一种基于3D高斯泼溅的场景变化检测方法GS-DIFF,该方法直接在原始高斯图元(位置、协方差、颜色)层面进行比较,而非传统的先渲染再比较像素或特征的范式。通过引入几何与光度漂移的各向异性模型以及每个图元的可观测性项,解决了独立优化导致图元表示不一致的挑战,实现了多视图一致的变化检测,并能无监督地区分几何与外观变化。
Details
Motivation: 现有基于高斯泼溅的场景变化检测方法普遍遵循先渲染再比较的范式,将问题视为像素层面的比较。本文认为应将其视为图元层面的问题,旨在证明原生图元属性本身已携带足够的变化检测信号,并解决因表示欠约束导致图元属性在未变化区域也出现差异的难题。
Result: 在真实世界基准测试中,GS-DIFF在平均交并比(mIoU)上超越了先前的最先进方法约17%。
Insight: 创新点在于将变化检测问题从像素空间转移到图元空间,通过建模几何/光度漂移和图元可观测性,直接比较高斯图元属性,从而天然保证多视图一致性并实现无监督的几何与外观变化分离。这为基于神经场景表示的变化检测提供了更本质、更高效的思路。
Abstract: Scene change detection methods built on Gaussian splatting universally follow a render-then-compare paradigm: the pre-change scene is rendered into 2D and compared against post-change images via pixel or feature residuals. This change detection problem with Gaussian Splatting has been treated as a question about pixels; we treat it as a question about primitives. We provide direct evidence that native primitive attributes alone – position, anisotropic covariance, and color – carry sufficient signal for scene change detection. What makes primitive-space comparison hard is the under-constrained nature of Gaussian splatting representation: independent optimizations yield primitive solutions whose count, positions, shapes, and colors differ even where nothing has changed. We address this challenge with anisotropic models of geometric and photometric drift, complemented by a per-primitive observability term that reflects the extent to which each Gaussian is constrained by the camera geometry. Operating directly on primitives gives our method, GD-DIFF, two properties that distinguish it from render-then-compare methods. First, change maps are multi-view consistent by construction, where prior work had to learn this through an additional optimization objective. Second, geometric and appearance changes are scored separately, identifying not just where but what kind of change occurred, distinguishing structural changes (e.g., an added object) from surface-level ones (e.g., a color change) without supervision or external model dependencies. On real-world benchmarks, GS-DIFF surpasses the prior state-of-the-art approach by approximatelt 17% in mean Intersection over Union.
[67] DINO-MVR: Multi-View Readout of Frozen DINOv3 for Annotation-Efficient Medical Segmentation cs.CVPDF
Wei Jiang, Feng Liu, Nan Ye, Hongfu Sun
TL;DR: 本文提出DINO-MVR,一种用于标注高效医学图像分割的多视角读出框架。该方法在冻结的DINOv3骨干网络上,仅训练轻量级MLP探针来读取最后三个Transformer块的特征,无需微调骨干网络。推理时,通过互补的分辨率和测试时增强生成概率图,并采用熵加权融合和空间正则化进行优化,对于体数据还引入了高斯z轴平滑以提高切片间一致性。
Details
Motivation: 将基础模型适配到医学分割任务通常需要微调骨干网络或使用高容量的任务特定解码器,这在标注稀缺时难以可靠拟合。作者发现冻结的DINOv3特征已包含有用的结构和边界线索,主要瓶颈在于如何有效地读出这些特征。
Result: 在内窥镜、皮肤镜和MRI基准测试的固定评估协议下,DINO-MVR取得了强大的仅读出性能:在Kvasir-SEG上Dice得分为0.895,在ISIC 2018上为0.897,在BraTS FLAIR全肿瘤分割上为0.908。仅使用5个标注的BraTS患者数据,即可恢复使用40个患者数据的参考运行性能的98.4%。
Insight: 创新点在于提出了一种多视角读出框架,有效利用了冻结自监督视觉骨干网络的特征进行医学分割,避免了骨干微调,极大降低了标注需求。该方法通过熵加权融合、空间正则化和体数据平滑等技术,提升了分割的鲁棒性和一致性,为标注稀缺场景下的医学图像分析提供了高效解决方案。
Abstract: Adapting foundation models to medical segmentation typically requires either backbone fine-tuning or high-capacity task-specific decoders, both of which are difficult to fit reliably when annotations are scarce. We show that frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation, and that the main bottleneck lies in how these features are read out. We propose DINO-MVR, a Multi-View Readout framework for annotation-efficient medical segmentation. DINO-MVR trains only lightweight MLP probes on features from the final three transformer blocks of a frozen DINOv3 backbone, without updating the backbone itself. At inference, each input is interpreted through complementary resolutions and test-time augmentations, whose probability maps are combined by entropy-weighted fusion and refined with simple spatial regularization. For volumetric inputs, Gaussian z-axis smoothing further improves inter-slice consistency. Under fixed evaluation protocols on endoscopy, dermoscopy, and MRI benchmarks, DINO-MVR achieves strong readout-only performance, including 0.895 Dice on Kvasir-SEG, 0.897 Dice on ISIC 2018, and 0.908 Dice on BraTS FLAIR whole-tumor segmentation. With only five annotated BraTS patients, it recovers 98.4% of the performance obtained by the 40-patient BraTS reference run. These results suggest that frozen self-supervised vision backbones can support accurate medical segmentation when paired with an effective multi-view readout.
[68] Towards multi-modal forgery representation learning for AI-generated video detection and localization cs.CVPDF
Dat Le, Khoa Nguyen, Xin Wang, Shu Hu
TL;DR: 本文提出一种多模态伪造表征学习方法,用于AI生成视频的检测与定位。该方法通过融合LMM语义分支、时空视觉分支和多尺度部分伪造音频分支,实现对部分篡改视频的同时检测与细粒度时间定位。实验表明该方法优于现有SOTA方法。
Details
Motivation: 生成式AI的进步使得大规模视频创作普及,但AI生成视频(包括跨视觉和音频通道的部分篡改片段)带来了语义失真和滥用的风险,现有检测器受限于单模态或部分模态数据建模,且缺乏细粒度时间伪造定位能力。
Result: 大量实验表明,该方法在AI生成视频检测与定位任务上优于现有最先进方法。
Insight: 创新点在于提出一个核心架构,联合集成LMM语义分支、时空视觉分支和多尺度部分伪造音频分支,实现多模态融合以同时进行检测和细粒度时间定位,解决了现有方法模态不完整和定位粒度粗的问题。
Abstract: Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.
[69] Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment cs.CV | cs.AIPDF
Zhixue Song, Boyan Han, Yiwei Wang, Chi Zhang
TL;DR: 本文发现多模态大语言模型(MLLMs)在处理视觉压缩(如降低分辨率)后的文本图像时,其安全防御机制会急剧恶化,导致模型容易被越狱攻击。作者将此归因于‘认知过载’,并提出了一种‘结构化认知卸载’策略来缓解此风险。
Details
Motivation: 动机是揭示基于视觉的上下文压缩范式(如将文本渲染为图像)中存在的关键安全漏洞,即图像质量下降会意外地绕过MLLMs的安全对齐机制。
Result: 实验表明,在图像分辨率降低等视觉退化情况下,SOTA模型的安全防御性能会急剧下降,即使文本仍然可读,这种现象在各种视觉扰动(如噪声、几何畸变)中普遍存在。
Insight: 创新点在于首次系统性地揭示了视觉退化作为MLLMs安全漏洞的普遍性,并提出了‘认知过载’的理论解释和‘结构化认知卸载’的防御策略,为未来MLLMs的安全设计提供了关键洞见。
Abstract: Recent advancements in visual context compression enable MLLMs to process ultra-long contexts efficiently by rendering text into images. However, we identify a critical vulnerability inherent to this paradigm: lowering image resolution inadvertently catalyzes jailbreaking. Our experiments reveal that the safety defenses of SOTA models deteriorate sharply as resolution degrades, surprisingly persisting even when text remains legible. We attribute this to Cognitive Overload'', hypothesizing that the effort required to decipher degraded inputs diverts attentional resources from safety auditing. This phenomenon is consistent across various visual perturbations, including noise and geometric distortion. To address this, we propose a simple Structured Cognitive Offloading’’ strategy that mitigates these risks by enforcing a serialized pipeline to decouple visual transcription from safety assessment. Our work exposes a significant risk in vision-based compression and provides critical insights for the secure design of future MLLMs.
[70] High-Fidelity Surface Splatting-Based 3D Reconstruction from Multi-View Images cs.CV | cs.GRPDF
Nandhana Sunil, Abhirami R Iyer, Avirup Mandal
TL;DR: 本文提出了一种基于高保真表面抛光的3D重建方法,通过引入紧凑多项式核函数和随机正则化技术,从多视角图像中直接重建高质量几何与外观。
Details
Motivation: 解决现有多视角网格重建方法(如3DGS和NeRF)依赖后处理提取网格、以及隐式移动最小二乘法(IMLS)中指数核函数难以捕捉高频细节的问题。
Result: 在多视角数据上实现了最先进的表面重建和渲染性能,在几何精度和视觉清晰度方面均达到SOTA水平。
Insight: 创新点包括设计具有局部支撑的紧凑多项式核函数以灵活控制频率内容,并结合拉普拉斯滤波的随机正则化来增强高频细节保留,支持端到端的几何与外观联合优化。
Abstract: Multi-view mesh reconstruction remains a core challenge in computer graphics and vision, especially for recovering high-frequency geometry from sparse observations. Recent methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) rely on post-processing for mesh extraction, thereby limiting joint optimization of geometry and appearance. Implicit Moving Least Squares (IMLS) instead enables direct conversion of point clouds into signed distance and texture fields, supporting end-to-end reconstruction and rendering. However, existing IMLS formulations use exponential kernels that struggle with high-frequency detail. We introduce a compact polynomial kernel with local support and greater flexibility, allowing better control over frequency content and improved geometric fidelity. To further enhance fine details, we incorporate stochastic regularization with Laplacian filtering. Together, these improve the preservation of high-frequency structure while maintaining stable optimization. Experiments show state-of-the-art performance in both surface reconstruction and rendering, yielding more accurate geometry and sharper visuals from multi-view data.
[71] TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts cs.CVPDF
Jeimin Jeon, Hyunju Lee, Bumsub Ham
TL;DR: 本文提出了一种名为TAS-LoRA的新型Transformer架构搜索方法,通过引入参数高效的LoRA(低秩适应)技术来解决现有TAS方法中存在的特征崩溃问题。该方法采用混合LoRA专家(MoLE)策略,利用轻量级路由器根据子网架构动态分配LoRA专家,并引入分组路由器初始化技术以促进训练早期专家间的多样化特征学习。
Details
Motivation: 现有Transformer架构搜索方法存在特征崩溃问题,即超网中的子网由于共享权重而无法学习到子网特有的特征,限制了单个子网的性能。本文旨在解决这一问题,实现子网特定的特征学习,同时保持计算效率。
Result: 在ImageNet以及CIFAR-10/100、Flowers、CARS和INAT-19等多个迁移学习基准上的大量实验表明,TAS-LoRA有效缓解了特征崩溃,其性能显著优于最先进的TAS方法。
Insight: 创新点在于将参数高效的LoRA技术引入架构搜索,并结合MoLE策略和分组路由器初始化技术,实现了子网特定的特征学习,同时保持了参数和计算效率。这为缓解超网训练中的特征共享问题提供了一种新思路。
Abstract: Transformer architecture search (TAS) discovers optimal vision transformer (ViT) architectures automatically, reducing human effort to manually design ViTs. However, existing TAS methods suffer from the feature collapse problem, where subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights in a supernet, limiting the performance of individual subnets. To address this, we propose TAS-LoRA, a novel method that introduces parameter-efficient low-rank adaptation (LoRA) to enable subnet-specific feature learning, while maintaining computational efficiency. TAS-LoRA incorporates a Mixture-of-LoRAExperts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures, and introduces a group-wise router initialization technique to encourage diverse feature learning across experts early in training. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate that TAS-LoRA mitigates feature collapse effectively, improving performance over state-of-the-art TAS methods significantly.
[72] Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning cs.CVPDF
Qiaoyi Yang, Chaoyi Zhou, Xi Liu, Run Wang, Minghui Xu
TL;DR: Sat3R是一个前馈框架,通过使用尺度不变对数损失对Depth Anything V2模型进行RPC感知的度量深度微调,将单目深度基础模型适配到卫星图像领域,无需逐场景优化,从而在卫星数字表面模型重建中实现了高精度与高效率的平衡。
Details
Motivation: 解决卫星图像DSM重建中基于优化的方法计算耗时与通用几何基础模型因RPC相机模型和深度尺度分布不匹配导致的领域泛化差之间的根本权衡问题。
Result: 在DFC2019基准测试上,Sat3R比零样本前馈基线平均绝对误差降低了38%,并与基于优化的方法达到了相当的精度,同时实现了超过300倍的加速。
Insight: 核心创新在于利用RPC几何构建物理一致的伪深度监督,对预训练深度基础模型进行领域适配,证明了前馈模型经过适当适配后,能以极低计算成本达到优化方法的精度,为大规模实用化卫星DSM重建开辟了道路。
Abstract: Accurate Digital Surface Model (DSM) reconstruction from satellite imagery is critical for applications such as disaster response, urban planning, and large-scale geographic mapping. Existing approaches face a fundamental trade-off: optimization-based methods achieve strong accuracy but require hours of per-scene computation, while generalizable geometry foundation models offer near-instant inference but fail to generalize to satellite imagery due to the domain gap introduced by the Rational Polynomial Camera (RPC) model and mismatched depth scale distributions. We present Sat3R, a feed-forward framework that bridges this gap via RPC-aware metric depth fine-tuning of Depth Anything V2 using the Scale-Invariant Logarithmic (SiLog) loss. By constructing physically consistent pseudo depth supervision from RPC geometry, Sat3R adapts a monocular depth foundation model to the satellite domain without per-scene optimization. Experiments on the DFC2019 benchmark demonstrate that Sat3R reduces MAE by 38% over zero-shot feed-forward baselines and achieves competitive accuracy against optimization-based methods, while delivering over 300x speedup. Sat3R demonstrates that feed-forward models, when properly adapted to the satellite domain, can match optimization-based accuracy at a fraction of the computational cost, paving the way for practical large-scale satellite DSM reconstruction.
[73] From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG cs.CV | cs.AIPDF
Jiaju Han, Chao Li, Chengyin Hu, Qike Zhang, Xuemeng Sun
TL;DR: 本文提出了CloudWeb攻击方法,通过向遥感图像叠加参数化的云和雾状模式,并优化其嵌入向量,以劫持遥感多模态RAG系统中的视觉-语言检索阶段,将天气相关证据注入检索结果,并导致下游生成出现天气幻觉和语义偏移。
Details
Motivation: 现有RAG对抗研究主要针对检索语料库或记忆进行操纵,而针对视觉-语言和遥感模型的攻击通常针对最终任务预测,遥感多模态RAG中证据检索阶段的输入空间威胁尚未得到充分探索。
Result: 在包含七个数据集的遥感RAG基准测试中,使用五种CLIP风格检索器(包括GeoRSCLIP、RemoteCLIP等)进行评估,CloudWeb在向排名靠前的结果注入天气相关证据方面,一致优于干净检索、手工制作的大气基线、随机云扰动和固定变体。例如,在GeoRSCLIP ViT-B/32上,Weather@5从0.71%提升至43.29%。下游生成进一步显示出可测量的天气幻觉和语义偏移。
Insight: 创新点在于首次研究了遥感多模态RAG中检索阶段的大气证据劫持攻击,揭示了仅修改输入图像即可在生成开始前破坏证据检索的实用故障模式,表明自然外观的大气变化可能危及RAG系统的可靠性。
Abstract: Multimodal RAG systems increasingly rely on vision-language retrievers to ground visual queries in external textual evidence. Existing adversarial studies on RAG mainly manipulate the retrieval corpus or memory, while attacks on vision-language and remote sensing models typically target end-task predictions. Input-space threats to the evidence retrieval stage of remote sensing multimodal RAG remain underexplored. To address this gap, we introduce CloudWeb, an atmospheric retrieval hijacking attack that modifies only the input image while keeping the retriever, generator, and knowledge base fixed at deployment. CloudWeb overlays parameterized cloud- and haze-like patterns on remote sensing images and optimizes them with a retrieval-oriented objective that pulls adversarial image embeddings toward target atmospheric evidence, suppresses source-scene evidence, enforces rank separation, and regularizes naturalness and coverage. To the best of our knowledge, this is the first study of retrieval-stage atmospheric evidence hijacking in remote sensing multimodal RAG. We evaluate CloudWeb on a seven-dataset remote sensing RAG benchmark with five CLIP-style retrievers, including GeoRSCLIP, RemoteCLIP, OpenAI CLIP, and OpenCLIP, together with downstream vision-language generators. Across retrievers, CloudWeb consistently outperforms clean retrieval, handcrafted atmospheric baselines, random cloud perturbations, and fixed variants in injecting weather-related evidence into top-ranked results. On GeoRSCLIP ViT-B/32, Weather@5 increases from 0.71% to 43.29%. Downstream generation further shows measurable weather hallucination and semantic shift, indicating that retrieval-stage hijacking can propagate to the final RAG response. These findings reveal a practical failure mode: natural-looking atmospheric changes can compromise evidence retrieval before generation begins.
[74] Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training cs.CV | cs.AIPDF
Jiaxuan Gao, Yongjian Guo, Zhong Guan, Wen Huang, Wanlun Ma
TL;DR: 本文提出Sword框架,一种通过动态潜在引导和结构引导风格增强来提升世界模型作为视觉语言动作(VLA)策略后训练模拟器的鲁棒性和泛化能力的方法,旨在解决现有世界模型在特定环境(如LIBERO基准)中因初始状态扰动和长时程误差累积导致的泛化差和幻觉问题。
Details
Motivation: 现有世界模型作为生成模拟器时,对初始状态扰动(如颜色、光照变化)敏感,易产生级联幻觉,且长时程误差累积会降低预测状态的质量和保真度,限制了其作为VLA策略后训练模拟器的可靠性。
Result: 在LIBERO基准上的大量实验表明,该方法在泛化性、生成质量、鲁棒性、保真度以及VLA模型强化学习后训练的成功率方面显著优于基线WoVR。
Insight: 创新点包括:1)结构引导风格增强,将交互环境的视觉纹理与任务相关动态解耦以提升泛化;2)动态潜在引导,在保持训练与推理一致性的同时控制内存消耗。从客观角度看,该方法通过解耦风格与动态、优化潜在表示一致性,有效提升了世界模型作为模拟器的实用性和稳定性。
Abstract: The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within “imagination.” However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure-Guided Style Augmentation to disentangle the visual textures of interactive environments from task-relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.
[75] EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams cs.CV | cs.AIPDF
Dongchuan Ran, Linyu Ou, Xueheng Li, Wenwen Tong, Chenxu Guo
TL;DR: 本文提出了EgoPro-Bench,一个基于流式第一人称视频的新型基准测试,用于训练和评估个性化主动交互能力。该基准包含评估集的2400个视频和训练集的12000多个视频,利用模拟用户档案生成多样化的用户意图,并在12个不同领域构建高保真的人机交互数据。作者还提出了一种专门的评估协议和指标,训练了用于流式视频数据高效推理和低延迟交互的主动交互模型,并引入了“短思考,更好交互”的原则以提升性能。
Details
Motivation: 现有的多模态大语言模型主要是反应式的,无法持续感知环境或主动协助用户。现有的主动交互基准大多局限于警报场景,忽视了个性化上下文,且未能评估人机交互的精确时机。
Result: 实验表明,EgoPro-Bench显著增强了MLLMs的意图理解能力,并能够准确识别人机交互的适当时机,为下一代以用户为中心的主动交互智能体奠定了坚实基础。
Insight: 创新点包括:1) 引入了首个专注于流式第一人称视频中个性化主动交互的综合性基准测试;2) 利用模拟用户档案生成多样化意图和高保真交互数据;3) 提出了“短思考,更好交互”的交互原则,在意图识别前分配有限的令牌预算以优化性能;4) 设计了专门的评估协议和低延迟交互模型。从客观角度看,该工作将主动交互评估从静态或有限场景扩展到动态、连续的流式视频环境,并强调了时机和个性化上下文的关键作用。
Abstract: Existing Multimodal Large Language Models (MLLMs) remain primarily reactive, failing to continuously perceive environments or proactively assist users. While emerging benchmarks address proactivity, they are largely confined to alert scenarios, neglect personalized context, and fail to evaluate the precise timing of human-machine interactions (HMI).In this paper, we introduce EgoPro-Bench, a novel benchmark for training and evaluating proactive interaction capabilities based on streaming egocentric videos; it comprises 2,400 videos in the evaluation set and over 12,000 videos in the training set.Unlike previous works, EgoPro-Bench leverages simulated user profiles to generate diverse user intentions and to construct high-fidelity HMI data across 12 distinct domains.Subsequently, we propose a specialized evaluation protocol and metrics, train proactive interaction models designed for efficient reasoning and low-latency interaction on streaming video data, and conduct comprehensive evaluations.Furthermore, we introduce an interaction principle termed “short thinking, better interaction”, which allocates a limited token budget prior to intent recognition, thereby enhancing interaction performance.The experiments demonstrate that EgoPro-Bench substantially enhances the intention understanding capabilities of MLLMs and enables accurate identification of appropriate timings for HMI, thereby laying a solid foundation for next-generation user-centric proactive interactive agents.
[76] Amortized-Precision Quantization for Early-Exit Vision Transformers cs.CV | cs.AIPDF
Rui Fang, Hsi-Wen Chen, Ming-Syan Chen
TL;DR: 本文针对视觉Transformer(ViT)在低精度早退推理中的不稳定性问题,提出了摊销精度量化(APQ)方法,该方法考虑层间量化噪声的随机暴露,揭示了深度与精度之间的权衡。基于APQ,作者进一步提出了互适应量化与早退(MAQEE)的双层优化框架,联合优化退出阈值和比特宽度,在明确风险控制下提升推理稳定性。MAQEE在分类、检测和分割任务上建立了更优的帕累托前沿,在保持精度的同时将BOPs降低高达95%,并超越强基线方法达20%。
Details
Motivation: 现有量化方法假设静态全深度执行,当早退决策受到量化噪声扰动时,会导致动态推理路径上的误差放大,使得ViT在低精度早退部署中表现脆弱。本文旨在解决这一不稳定性问题。
Result: MAQEE在分类、检测和分割任务上均优于强基线方法,最高提升20%性能,同时将BOPs减少高达95%,在精度与效率权衡中建立了更优的帕累托前沿。
Insight: 创新点在于提出了APQ这一考虑层间随机量化噪声的利用率感知公式,以及MAQEE双层优化框架,通过联合优化退出阈值和比特宽度并实施风险控制,有效提升了动态早退推理的稳定性,为高效ViT部署提供了新思路。
Abstract: Vision Transformers (ViTs) achieve strong performance across vision tasks, yet their deployment with low-precision early exiting remains fragile. Existing quantization methods assume static full-depth execution, making them unstable when exit decisions are perturbed by quantization noise, which can amplify errors along dynamic inference paths. In this paper, we introduce Amortized-Precision Quantization (APQ), a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, we propose Mutual Adaptive Quantization with Early Exiting (MAQEE), a bi-level framework that jointly optimizes exit thresholds and bit-widths under explicit risk control to improve inference stability. MAQEE establishes a superior Pareto frontier in the accuracy-efficiency trade-off, reducing BOPs by up to 95% while maintaining accuracy and outperforming strong baselines by up to 20% across classification, detection, and segmentation tasks.
[77] GEM: Generating LiDAR World Model via Deformable Mamba cs.CVPDF
Yang Wu, Zhaojiang Liu, Qiang Meng, Youquan Liu, Renliang Weng
TL;DR: 本文提出GEM,一种基于可变形Mamba架构的生成式LiDAR世界模型,旨在解决LiDAR点云的无序性和动态-静态对象分离难题,通过定制化LiDAR场景分词器、动态-静态分离器以及三路径可变形Mamba模块,提升了对世界演化的时空理解能力,并支持自主推演和假设场景生成。
Details
Motivation: 解决基于LiDAR的世界模型发展滞后的问题,核心挑战在于LiDAR点云的无序性以及难以区分动态对象与静态结构。
Result: 在多个基准测试和评估设置中达到了最先进(SOTA)的性能,证明了其优越性和有效性。
Insight: 创新点包括利用序列激光扫描与Mamba处理机制的结构相似性进行LiDAR扫描的紧凑表示,通过无监督解耦分离动态与静态特征,并引入三路径可变形Mamba进行选择性扫描和自适应门控融合,从而增强时空建模能力;模型还可集成规划器和BEV布局控制器以探索自主推演和生成假设场景的潜力。
Abstract: World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of LiDAR point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose GEM: a Generative LiDAR world model that leverages deformable mamba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba’s processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separator, a tri-path deformable Mamba is introduced to perform selective scanning and adaptive gating fusion over the disentangled features, leading to enhanced spatial-temporal understanding of the world evolution. Optionally, a planner and a BEV layout controller can be integrated to explore the model’s capability for autonomous rollout and its potential to generate ``what-if” scenarios. Extensive experiments show that GEM achieves state-of-the-art performances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness. Project page: https://github.com/wuyang98/GEM.
[78] RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation cs.CVPDF
Junwei Wen, Deshui Miao, Guangming Lu, Xin Li, Wenjie Pei
TL;DR: 本文提出了RCoT-Seg框架,用于视频推理分割任务。该框架将任务分解为时序视频推理和关键帧目标感知两个阶段,通过一个基于强化学习的智能体模块来选择关键帧,并利用SAM2进行高分辨率分割和掩码传播,以提升复杂场景下的时序理解和空间精度。
Details
Motivation: 现有基于MLLM的视频推理分割方法通常通过简单采样或辅助模型选择帧,其有限的监督和基于帧-语言相似度的规则导致关键帧选择范围狭窄,削弱了对整体时序的理解,在复杂的多目标场景中定位脆弱。本文旨在解决这些问题。
Result: 广泛的实验结果表明,所提出的RCoT-Seg方法在性能上优于最先进的方法。
Insight: 主要创新点在于将视频推理分割任务明确分解为时序推理和空间感知两个子任务,并引入了一个通过强化学习(GRPO)优化的智能体关键帧选择模块,该模块能够自我评估和重新选择关键帧,从而增强时序定位能力。同时,利用SAM2进行高分辨率分割和掩码传播,替代了启发式采样和外部选择器,提高了空间精度和帧间一致性。
Abstract: Video Reasoning Segmentation (VRS) aims to segment target objects in videos based on implicit instructions that convey human intent and temporal logic. Existing MLLM-based methods predict masks with a [SEG] token after selecting frames via simple sampling or an auxiliary MLLM, where limited supervision and frame-language similarity rules often yield narrow-scope keyframe choices that weaken holistic temporal understanding and lead to brittle localization in complex multi-object scenes. To address these issues, we introduce RCoT-Seg, a video-of-thought framework that factorizes VRS into temporal video reasoning (TVR) and keyframe target perception (KTP), explicitly separating temporal reasoning from spatial perception. Specifically, in the TVR stage, an agentic keyframe selection module, initialized with a curated CoT-start corpus and refined by GRPO under task-aligned rewards, is proposed to generate and reselect the keyframe through self-evaluation, strengthening moment localization and temporal reasoning. In the KTP stage, RCoT-Seg performs high-resolution segmentation on the selected frame and propagates masks with SAM2-based methods across the sequence, replacing heuristic sampling and external selectors while improving spatial precision and inter-frame consistency. Extensive experimental results demonstrate that the proposed RCoT-Seg achieves favorable performance against the state-of-the-art methods. The code and models will be publicly released at https://github.com/Victor-wjw/RCoT-Seg.
[79] ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs cs.CVPDF
Ziheng Zhou, Yang Wang, Nan Wang, Chengliang Wu, Jun Yan
TL;DR: 本文构建了ShellfishNet,一个专门针对海洋软体动物视觉识别的领域特定基准数据集,包含8,691张图像,涵盖32个分类群,并带有描述性标注。基于该数据集,系统评估了80种代表性神经网络模型(包括CNN、ViT、SSM和SSL方法)以及细粒度视觉分类模型和多模态大语言模型的图像描述能力,同时引入图像损坏基准测试以模拟水下退化场景并评估模型的鲁棒性。
Details
Motivation: 全球贝类生物多样性下降对沿海生态系统构成严重威胁,现有海洋底栖数据集往往难以适应真实水下环境的复杂性(如多变光照条件和多样物种姿态),导致视觉模型在实际生态监测中鲁棒泛化能力不足。
Result: 在构建的ShellfishNet基准上,系统评估了80种模型,包括CNN、ViT、SSM和SSL方法,并测试了细粒度视觉分类模型及多模态大语言模型的图像描述性能,同时通过图像损坏基准模拟水下退化场景以评估模型鲁棒性。
Insight: 创新点在于构建了一个针对真实世界生态监测约束的综合性图像基准数据集,并系统评估了多种先进视觉模型在该领域特定任务上的性能,特别是引入了图像损坏测试来模拟水下环境挑战,为底栖生物智能监测提供了数据基础和模型评估基准。
Abstract: The decline of global shellfish biodiversity poses a severe threat to coastal ecosystems. Although artificial intelligence (AI) technologies show potential for automated ecological monitoring, existing marine benthic datasets often lack adaptation to the complexities of real underwater environments (e.g., variable lighting conditions and diverse species postures), posing challenges for the robust generalization of vision models in practical ecological monitoring. To address this problem, we construct ShellfishNet, a comprehensive image benchmark dataset designed specifically for real-world ecological monitoring constraints. Comprising 8,691 images across 32 taxa, this dataset includes a curated subset annotated with descriptive captions. It is constructed through field photography and web scraping, encompassing samples from complex real-world environments. Based on this benchmark, we systematically evaluate 80 representative neural network models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), State Space Models (SSMs), and Self-Supervised Learning (SSL) methods. Furthermore, we evaluate the performance of fine-grained visual categorization (FGVC) models and investigate the image captioning capabilities of several mainstream multimodal large language models (MLLMs). Meanwhile, we introduce image corruption benchmark tests to simulate common underwater degradation scenarios (turbidity, severe weather) and assess the robustness of vision models, enabling trustworthy decisions on ecological protection in the wild. ShellfishNet is dedicated to providing a data foundation and a model-evaluation benchmark for the intelligent monitoring of benthic organisms.
[80] SoLAR: Error-Resilient Streamable Long-Horizon Free-Viewpoint Video Reconstruction with Anchor Activation and Latent Recalibration cs.CVPDF
Haotian Zhang, Xu Mo, Yixin Yu, Guanhua Zhu, Jian Xue
TL;DR: 本文提出了SoLAR,一个面向长时域自由视点视频(LFVV)的、具有错误恢复能力的可流式传输重建框架。它通过动态锚点激活和潜在表示重校准机制,在无需分组图像划分的情况下,实现了对长序列视频稳定且高质量的重建,同时保持了最小的存储开销。
Details
Motivation: 现有的自由视点视频方法主要针对短序列,在处理长时域自由视点视频时性能会显著下降。本文旨在解决长序列重建中的性能退化问题,并实现可流式传输和错误恢复能力。
Result: 大量实验表明,SoLAR在长时域自由视点视频重建任务上取得了最先进的性能,同时保持了最小的存储开销。
Insight: 论文的核心创新点在于:1. 从率失真优化框架分析动态锚点表示,提出了无需分组图像划分的稳定重建框架;2. 提出了锚点激活动态机制,通过动态激活信息丰富的锚点并抑制冗余锚点来建模非刚性变换;3. 提出了潜在差异感知重校准机制,通过识别潜在表示间的差异并重校准网络编码的对应关系,有效缓解了长序列中的错误传播,且不影响实时性能或存储紧凑性。
Abstract: Free-Viewpoint Video (FVV) has emerged as a cornerstone of next-generation immersive media systems and attracted widespread attention. Previous methods primarily focus on short video sequences and suffer from significant performance degradation when processing long-horizon free-viewpoint video (LFVV). Motivated by bit allocation theory, we analyze dynamic-anchor-based volumetric video representation within a rate-distortion optimization framework and propose \textbf{SoLAR}, which is the first error-resilient streamable FVV framework that maintains stable reconstruction quality on long sequences without requiring group-of-pictures partitioning. We propose the Anchor Activation Dynamics (AAD), which enables dynamic anchors to model non-rigid transformations by dynamically activating informative anchors and suppressing redundant ones. Furthermore, we introduce Latent Discrepancy Aware Recalibration (LaDAR), which is a mechanism to identify discrepancies between latent representations and recalibrate the correspondences encoded in the network, effectively mitigating error propagation in LFVV without compromising real-time performance or storage compactness. Extensive experiments demonstrate that \textbf{SoLAR} achieves state-of-the-art reconstruction performance while maintaining minimum storage overhead, which provides a new direction for LFVV reconstruction and advances the practical deployment of immersive systems. Demo free-viewpoint videos are provided in the supplementary material.
[81] TTF: Temporal Token Fusion for Efficient Video-Language Model cs.CV | cs.AIPDF
Simin Huo, Ning LI
TL;DR: 本文提出了Temporal Token Fusion (TTF),一种无需训练、即插即用的视频语言模型前LLM令牌压缩框架。它通过利用视频中的结构化时间冗余,自动选择锚定帧并进行局部窗口相似性搜索,融合超过阈值的令牌,从而显著减少视觉令牌数量,同时保持模型精度。
Details
Motivation: 视频语言模型面临推理成本随视频长度快速增加的问题,视觉令牌数量过多导致LLM预填充成为吞吐瓶颈。现有方法依赖全局相似性或注意力引导压缩,存在性能损失。
Result: 在Qwen3-VL-8B模型上,使用阈值t=0.70时,TTF移除了约67%的视觉令牌,同时保留了99.5%的基线准确率,仅引入约0.16 GFLOPs的匹配开销。
Insight: 创新点在于利用视频的时间冗余进行局部窗口相似性搜索和令牌融合,无需训练即可实现高效压缩,并通过坐标重对齐保持位置一致性,与现有VLM管道无缝集成。
Abstract: Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at $448{\times}448$ resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose \textbf{Temporal Token Fusion (TTF)}, a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g.,$3\times 3$), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM pipelines. On Qwen3-VL-8B with threshold t=0.70, TTF removes about 67% of visual tokens while retaining 99.5% of the baseline accuracy and introducing only ${\approx}0.16$,GFLOPs of matching overhead. Overall, TTF offers a practical, efficient solution for video understanding. The code is available at \href{https://github.com/Cominder/ttf}{https://github.com/Cominder/ttf}
[82] UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition cs.CVPDF
Shuai Zhang, Zhecheng Shi, Zhuxiao Li, Jing Ou, Tengxi Wang
TL;DR: 本文提出UniD-Shift,一个用于联合2D-3D语义分割的统一多模态框架。该方法通过可解释的共享-私有多模态分解,将图像和点云特征显式分解为共享语义子空间和私有模态特定子空间,并使用轻量级注意力融合模块聚合共享特征,以实现稳定的跨模态对齐与融合。
Details
Motivation: 解决激光雷达点云稀疏采样与图像观测中视角依赖的几何畸变导致的跨模态对齐困难,阻碍稳定融合的问题。受2D图像是3D世界表征的启发,认识到2D与3D分割学习到的特征既共享部分语义,又保留模态特异性,从而推动构建统一的多模态分割框架。
Result: 在SemanticKITTI和nuScenes基准测试上,相比代表性的多模态基线方法,分割精度获得了一致的提升,并保持了有竞争力的计算效率。在nuScenes USA-Singapore的跨域评估中,在分布偏移下表现出稳定的性能,证明了强大的泛化能力。
Insight: 核心创新点在于将多模态特征显式分解为共享与私有子空间的统一框架,并通过正则化训练目标确保语义对齐与子空间独立性。这提供了一种可解释的、能增强跨模态一致性与泛化性的多模态表示学习方法。
Abstract: Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention-based fusion module aggregates the shared features into a consistent cross-modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross-domain evaluation on nuScenes USA-Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: https://github.com/shuaizhang69/UniD-Shift.
[83] UniISP: A Unified ISP Framework for Both Human and Machine Vision cs.CVPDF
Hanxi Li, Yao Cheng, Bo Zhang, Li Zeng
TL;DR: 本文提出UniISP,一种统一的图像信号处理框架,旨在同时满足人类视觉感知和计算机视觉应用的需求。通过引入混合注意力模块和特征适配器,该方法能生成视觉上吸引人的图像,同时有效保留对下游任务至关重要的信息特征。
Details
Motivation: 传统ISP流程为人类视觉生成美观的RGB图像,但可能引入信息损失;而现有直接使用原始数据的计算机视觉方法往往忽略视觉质量。本文旨在解决这一矛盾,开发一个能兼顾两者需求的统一框架。
Result: 在多种场景和多个数据集上的广泛实验表明,该方法实现了最先进的性能,证明了其泛化能力和有效性。
Insight: 创新点在于提出一个统一的ISP框架,通过混合注意力模块提升视觉质量,并通过特征适配器模块将信息丰富的特征有效传播给下游网络,从而同时优化人类感知和机器分析。
Abstract: Compared to RGB images, raw sensor data provides a richer representation of information, which is crucial for accurate recognition, particularly under challenging conditions such as low-light environments. The traditional Image Signal Processing (ISP) pipeline generates visually pleasing RGB images for human perception through a series of steps, but some of these operations may adversely impact the information integrity by introducing compression and loss. Furthermore, in computer vision tasks that directly utilize raw camera data, most existing methods integrate minimal ISP processing with downstream networks, yet the resulting images are often difficult to visualize or do not align with human aesthetic preferences. This paper proposes UniISP, a novel ISP framework designed to simultaneously meet the requirements of both human visual perception and computer vision applications. By incorporating a carefully designed Hybrid Attention Module (HAM) and employing supervised learning, the proposed method ensures that the generated images are visually appealing. Additionally, a Feature Adapter module is introduced to effectively propagate informative features from the ISP stage to subsequent downstream networks. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across various scenarios and multiple datasets, proving its generalizability and effectiveness.
[84] RELO: Reinforcement Learning to Localize for Visual Object Tracking cs.CV | cs.AIPDF
Xin Chen, Chuanyu Sun, Jiao Xu, Houwen Peng, Dong Wang
TL;DR: 论文提出RELO方法,将视觉目标跟踪中的目标定位问题建模为马尔可夫决策过程,通过强化学习直接优化IoU和AUC等评价指标,替代传统基于手工设计空间先验(如热图)的定位方式,并引入层对齐时序令牌传播以提升跨帧语义一致性。
Details
Motivation: 传统跟踪器使用手工设计的空间先验(如热图)进行目标定位,这些先验仅提供间接监督,与跟踪的优化目标和评价指标(如IoU和AUC)不一致,限制了性能。
Result: 在多个基准测试中,RELO取得了优越结果,在LaSOText基准上无需模板更新即达到57.5%的AUC。
Insight: 核心创新在于用强化学习策略直接学习定位动作,奖励函数结合了帧级IoU和序列级AUC,使优化与最终评价指标对齐;同时,层对齐时序令牌传播机制以可忽略的计算开销提升了帧间语义一致性,为跟踪定位提供了奖励驱动的新范式。
Abstract: Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.
[85] ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation cs.CVPDF
Haonan Wang, Hanyu Zhou, Tao Gu, Luxin Yan
TL;DR: 本文提出了ST-Gen4D,一个基于4D时空认知世界模型的4D生成框架。该框架通过时空表示、认知、推理和生成四个关键设计,将全局外观结构与局部动态拓扑融合,以保障4D生成的结构合理性与拓扑一致性,并在构建的ST-4D数据集上验证了其在3D和4D生成任务上的优越性。
Details
Motivation: 现有4D生成模型通常直接嵌入宏观尺度约束以增强整体时空一致性,但仅能保证全局外观连贯性,未能揭示物理世界的局部动态。论文旨在通过融合全局外观与局部动态的4D时空认知,实现具有时空规律性的4D生成。
Result: 在聚合公开4D数据集和自建子集构成的ST-4D数据集上进行了大量实验,结果表明ST-Gen4D在3D和4D生成任务上均表现出优越性。
Insight: 创新点在于将4D时空认知(通过全局外观图和局部动态图融合为4D认知图)深度集成到生成先验中,并提出了一个包含时空表示、认知、推理和生成四阶段的世界模型框架,从而同时保证了生成结果的全局结构合理性与局部拓扑一致性。
Abstract: Generative models have achieved success in producing apparently coherent 2D videos, but remain challenging in the physical world due to lack of 4D spatiotemporal scale. Typically, existing 4D generative models directly embed macro scale constraints to enhance overall spatiotemporal consistency. However, these methods only ensure global appearance coherence and fail to reveal the local dynamics of the physical world. Our insight is that global appearance structure and local dynamic topology empower 4D spatiotemporal cognition, thereby enabling 4D generation with spatiotemporal regularities. In this work, we propose ST-Gen4D, a 4D generation framework with 4D spatiotemporal cognition-based world model. Our model is guided by four key designs: 1) Spatiotemporal representation. We encode various modalities into multiple representations as a feature basis. 2) Spatiotemporal cognition. We sculpture these representations into global appearance graph and local dynamic graph, and fuse them via semantic-bridged spatiotemporal fusion to obtain a 4D cognition graph. 3) Spatiotemporal reasoning. We utilize a world model to derive future state based on the 4D cognition. 4) Spatiotemporal generation. We leverage the derived cognition as condition to guide latent diffusion for 4D Gaussian generation. By deeply integrating 4D intrinsic cognition with generative priors, our model guarantees the structural rationality and topological consistency of 4D generation. Moreover, we propose ST-4D datasets by aggregating public 4D datasets and self-built subset. Extensive experiments demonstrate the superiority of our ST-Gen4D across 3D and 4D generation tasks.
[86] BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning cs.CV | cs.AIPDF
Shaokai Ye, Vasileios Saveris, Yihao Qian, Jiaming Hu, Elmira Amirloo
TL;DR: 本文提出了一种名为BalCapRL的平衡强化学习框架,用于优化多模态大语言模型(MLLM)的图像描述生成任务。该框架通过联合优化实用性感知的正确性、参考覆盖率和语言质量,解决了现有RL方法在追求单一指标时导致的描述质量失衡问题(如过度详细导致幻觉或过于通用缺乏实用性)。
Details
Motivation: 现有基于强化学习的图像描述方法及其评估指标往往只强调描述质量的某个狭窄方面(例如,追求实用性可能产生冗长、幻觉或噪声描述,而追求竞技场式评分则可能偏好流畅但通用的描述),导致在描述的核心维度(如正确性、覆盖率和流畅性)之间产生权衡。本文旨在解决这种不平衡问题。
Result: 在LLaVA-1.5-7B和Qwen2.5-VL 3B/7B基础模型上,该方法一致提升了描述质量,在不同模型上取得了峰值提升:DCScore +13.6,CaptionQA +9.0,CapArena +29.0。
Insight: 主要创新点包括:1)提出了一个平衡的多目标奖励公式,联合优化实用性、覆盖率和语言质量;2)将GDPO风格的奖励解耦归一化应用于连续值描述奖励,相比原始GRPO提升了性能;3)引入了长度条件奖励掩码,为描述任务提供了更合适的长度惩罚机制。从客观角度看,其核心贡献在于系统性地定义了描述质量的多个核心维度并设计了相应的联合优化框架,而非仅优化单一指标。
Abstract: Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.
[87] Exposing and Mitigating Temporal Attack in Deepfake Video Detection cs.CV | cs.AIPDF
Zheyuan Gu, Minghao Shao, Zhen Wang, Yusong Wang, Mingkun Xu
TL;DR: 该论文揭示了当前时空深度伪造检测器虽然具有较高的AUC值,但容易受到规避攻击,因为它们过度依赖脆弱的时频谱线索而非鲁棒的语义因果关系。为此,论文提出了SpInShield防御框架,通过可学习的频谱对抗器动态合成严重频谱畸变来模拟极端攻击,并采用捷径抑制优化策略,迫使编码器提取可靠的取证线索并剔除潜在空间中的不稳定频谱统计信息。
Details
Motivation: 解决现有深度伪造视频检测器因过度拟合脆弱的时频谱线索而容易受到规避攻击的漏洞,旨在提升检测模型对语义因果关系的鲁棒性。
Result: 在广泛使用的数据集上取得了有竞争力的性能,在模拟的幅度谱攻击下,AUC比最强基线高出21.30个百分点。
Insight: 创新点在于提出了一个时频谱不变的防御框架(SpInShield),通过动态合成频谱畸变的对抗性训练和捷径抑制优化,强制模型学习鲁棒的语义运动特征,而非可操纵的频谱伪影,从而增强对规避攻击的防御能力。
Abstract: While spatiotemporal deepfake detectors achieve high AUC, our experiments reveal their susceptibility to evasion attacks. These models tend to overfit on fragile temporal spectrum cues, rather than learning robust semantic causality. To mitigate this vulnerability, we propose SpInShield, a temporal spectral-invariant defense framework explicitly designed to decouple semantic motion from manipulatable spectral artifacts. We propose a learnable spectral adversary that dynamically synthesizes severe spectral deformations, simulating extreme attack scenarios. By employing a shortcut suppression optimization strategy, SpInShield compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space. Experiments show that SpInShield obtains competitive performance on widely used datasets and outperforms the strongest baseline by 21.30 percentage points in AUC under simulated amplitude spectral attacks.
[88] GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization cs.CVPDF
Yu Pan, Andi Zhang, Yi Wang, Sibei Yang, Wenjie Wang
TL;DR: 本文提出了一种针对扩散视觉语言模型(dVLMs)的新型越狱攻击方法GPO-V。研究发现,dVLMs虽然对传统的固定前缀优化(FPO)攻击表现出抵抗力,但其独特的渐进式拒绝模式暴露了新的攻击面。GPO-V通过全局概率优化(GPO)范式,在去噪轨迹上操纵全局生成动态,从而生成具有高跨模型可迁移性的隐蔽扰动,成功绕过dVLMs的安全防护。
Details
Motivation: 扩散视觉语言模型(dVLMs)因其非自回归的生成范式,被认为对传统的基于前缀固定的越狱攻击具有内在鲁棒性。然而,作者发现这种鲁棒性是虚假的,dVLMs独特的拒绝模式(即时拒绝和渐进拒绝)实际上暴露了新的潜在攻击面,这促使作者研究如何利用这一漏洞。
Result: 实证结果表明,所提出的GPO-V方法能生成隐蔽的扰动,并具有卓越的跨模型可迁移性。该攻击揭示了非序列生成架构(即扩散模型)中一个关键的安全漏洞,成功绕过了现有防护。
Insight: 主要创新点在于:1)首次识别并分类了dVLMs的独特拒绝模式(即时拒绝与渐进拒绝);2)提出了专门针对掩码扩散模型去噪轨迹的通用越狱范式——全局概率优化(GPO),它通过操纵全局生成动态而非局部前缀来绕过防护;3)构建了首个针对dVLMs的视觉模态越狱框架GPO-V。从客观角度看,该研究揭示了扩散模型安全对齐的紧迫性,并表明当前防御范式需要进行根本性的重新评估。
Abstract: Diffusion Vision-Language Models (dVLMs), built upon the non-causal foundations of Diffusion Large Language Models (dLLMs), have demonstrated remarkable efficacy in multimodal tasks by departing from the traditional autoregressive generation paradigm. While dVLMs appear inherently robust against conventional jailbreak tactics, which we categorize as Fixed Prefix Optimization (FPO) (e.g., anchoring responses with “Sure, here is”), this perceived resilience is deceptive. Our investigation into the safety landscape of dVLMs reveals a unique refusal pattern: Immediate Refusal and Progressive Refusal. We find that while FPO-based attacks often fail by triggering the latter, the progressive refinement process itself uncovers a novel, latent attack surface. To exploit this vulnerability, we propose Global Probability Optimization (GPO), a general jailbreak paradigm designed specifically for the denoising trajectory of masked diffusion models. Unlike prefix-based methods, GPO manipulates the global generative dynamics to bypass guardrails in diffusion language models. Building on this, we introduce GPO-V, the first visual-modality jailbreak framework tailored for dVLMs. Empirical results demonstrate that GPO-V produces stealthy perturbations with exceptional cross-model transferability, revealing a critical security gap in non-sequential generative architectures. Our findings underscore the critical urgency of addressing safety alignment in dVLMs. These results necessitate an immediate and fundamental re-evaluation of current defense paradigms to mitigate the unique risks of diffusion-based generation. Our code is available at: https://anonymous.4open.science/r/GPO-V-0250.
[89] ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring cs.CV | cs.CLPDF
Tianhao Niu, Ziyu Han, Qingfu Zhu, Wanxiang Che
TL;DR: 本文提出了ChartREG++基准测试,旨在解决现有图表指代表达定位任务中存在的局限性,包括定位精度低、仅支持单/双目标、线索单一以及图表类型覆盖窄等问题。该基准支持多种定位形式、多目标引用、多样化线索和图表类型。作者还开发了一个代码驱动的合成流水线来生成精确的像素级实例掩码,并训练了一个实例分割模型,集成到一个通用的多模态定位框架中,显著提升了性能。
Details
Motivation: 现有图表指代表达定位基准存在定位精度低(主要使用边界框)、仅支持单/双目标、过度依赖文本或数据排序线索以及图表类型覆盖有限等问题,限制了该任务的发展。
Result: 在提出的ChartREG++基准上,代表性多模态大模型存在显著性能差距;所提出的系统在基准测试中持续优于基线方法,并在ChartQA衍生的真实图表定位基准上表现出良好的泛化能力。
Insight: 创新点包括:1) 构建了一个系统性的图表指代表达定位基准,支持多定位形式、多目标、多样化线索和图表类型;2) 利用绘图程序与渲染图表元素之间的固有对齐关系,通过代码驱动合成流水线生成像素级精确的实例掩码,提升了定位精度;3) 将实例分割模型集成到通用多模态定位框架中,实现了性能提升和良好泛化。
Abstract: Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.
[90] Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs cs.CV | cs.AI | cs.CL | cs.LGPDF
Hao Wang, Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh, Daisuke Kawahara
TL;DR: 本文提出了一种基于稀疏自编码器(SAE)的轻量级对抗攻击检测框架SAEgis,用于提升视觉语言模型(VLM)的安全性。该方法通过在预训练的VLM中插入SAE模块并训练其重构能力,使学习到的稀疏潜在特征能够捕获攻击相关信号,从而可靠地检测输入图像是否受到对抗性扰动,且无需额外的对抗训练。
Details
Motivation: 视觉语言模型(VLM)在实际应用中部署日益广泛,但其安全性研究相对有限,现有模型对对抗攻击高度脆弱,导致下游应用面临重大风险。
Result: 大量实验表明,SAEgis在域内、跨域和跨攻击设置下均表现出色,尤其在跨域泛化能力上相比现有基线有显著提升;结合多层信号进一步增强了鲁棒性和稳定性。
Insight: 创新点在于首次将稀疏自编码器(SAE)作为即插即用机制用于VLM的对抗攻击检测,无需对抗训练、开销极小,为实际VLM系统的安全防护提供了实用方案。
Abstract: Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.
[91] EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement cs.CVPDF
Zitong Xu, Huiyu Duan, Yifei Nie, Mingda Du, Sijing Wu
TL;DR: 本文提出了EditRefiner,一个用于图像编辑精修的人本对齐智能体框架。该框架通过构建包含细粒度人工反馈的数据集EditFHF-15K,并设计一个包含感知、推理、行动和评估四个智能体的分层循环,来检测和修正文本引导图像编辑模型产生的细粒度伪影和编辑失败问题。
Details
Motivation: 现有文本引导图像编辑模型生成的图像常存在细粒度问题(如不自然的物体、光照不匹配),而现有精修方法要么依赖成本高昂的迭代重生成,要么使用空间定位能力弱的视觉语言模型,容易导致语义漂移和不可靠的局部修正。
Result: 大量实验表明,EditRefiner在失真定位、诊断准确性和人类感知对齐方面持续优于最先进方法,为图像编辑建立了自我修正和感知可靠的新范式。
Insight: 主要创新点包括:1) 构建了大规模细粒度人工反馈数据集EditFHF-15K;2) 提出了一个分层、可解释、人本对齐的智能体框架,将后编辑修正模拟为类人的感知-推理-行动-评估循环;3) 框架中的四个智能体(感知、推理、行动、评估)分工协作,实现了可靠的局部化再编辑和迭代优化。
Abstract: Recent text-guided image editing (TIE) models have made remarkable progress, yet edited images still frequently suffer from fine-grained issues such as unnatural objects, lighting mismatch, and unexpected changes. Existing refinement approaches either rely on costly iterative regeneration or employ vision-language models (VLMs) with weak spatial grounding, often resulting in semantic drift and unreliable local corrections. To address these limitations, we first construct EditFHF-15K, a dataset of fine-grained human feedback for edited images, comprising (1) 15K images from 12 TIE models spanning 43 editing tasks, (2) 60K annotated artifact regions and 80K editing failure regions, each accompanied by textual reasoning, and (3) 45K mean opinion scores (MOSs) assessing perceptual quality, instruction following, and visual consistency. Based on EditFHF-15K, we propose EditRefiner, a hierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a human-like perception-reasoning-action-evaluation loop. Specifically, we introduce: (1) a perception agent that detects contextual saliency maps of artifacts and editing failures, (2) a reasoning agent that interprets these perceptual cues to perform human-aligned diagnostic inference, (3) an action agent that uses the reasoning output to plan and execute localized re-editing, and (4) an evaluation agent that assesses the re-edited image and guides the action agent on whether further refinements are required. Extensive experiments demonstrate that EditRefiner consistently outperforms state-of-the-art methods in distortion localization, diagnose accuracy and human perception alignment, establishing a new paradigm for self-corrective and perceptually reliable image editing. The code is available at https://github.com/IntMeGroup/EditRefiner.
[92] A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images cs.CVPDF
Ioan-Tudor-Alexandru Anghel, Ciprian-Mihai Ceausescu, Elena Dana Nedelcu, Elena Raluca Stirban, Camelia Croitoru
TL;DR: 本文提出了一种用于超声图像中脂肪胰腺检测与分类的统一端到端框架。该方法采用基于TransUNet的分割架构来勾画胰腺和脾静脉,然后通过解剖引导的patch提取和基于成对纹理比较的患者级分类,模拟临床推理过程。
Details
Motivation: 非酒精性脂肪胰腺病(NAFPD)是一种诊断不足的疾病,目前诊断依赖于临床医生对超声图像的主观视觉评估,因此需要一种自动、客观的分类方法。
Result: 在包含214张腹部超声图像(107例专家标注病例)的临床数据集上,使用5折交叉验证,SVM(RBF核)的平均准确率达到89.7% ± 1.8%,F1分数为0.898 ± 0.019;无监督K-Means基线准确率达到87.8%。
Insight: 创新点在于首次提出了一个用于超声图像脂肪胰腺分类的端到端自动化框架,其核心是通过分割引导的纹理分析,并利用解剖结构(脾静脉周围脂肪与胰腺实质)进行可解释的特征工程,同时通过从肝脏分割任务进行领域特定迁移学习来初始化模型。
Abstract: Non-alcoholic fatty pancreas disease (NAFPD) is an underdiagnosed condition associated with metabolic syndrome, insulin resistance, and increased risk of pancreatic cancer. Diagnosis typically relies on subjective visual assessment of ultrasound images by clinicians. We propose an end-to-end framework for automatically classifying normal versus fatty pancreas from abdominal ultrasound images. Our method employs a TransUNet-based segmentation architecture with a ResNet encoder and transformer bottleneck to delineate the pancreas and the splenic vein, followed by anatomically-guided patch extraction and patient-level classification through pairwise texture comparison. The feature engineering mimics clinical reasoning by comparing the echogenicity of peri-venous fat to the pancreatic parenchyma, providing an interpretable signal for classification. The segmentation models are initialized via domain-specific transfer learning from a liver segmentation task. We validate the full pipeline on a clinical dataset of 214 abdominal ultrasound images with 107 expert-labeled cases using 5-fold cross-validation. SVM with RBF kernel achieves a mean cross-validated accuracy of 89.7%,$\pm$,1.8% and F1 of 0.898,$\pm$,0.019, while the unsupervised K-Means baseline reaches 87.8% accuracy, demonstrating that the proposed features capture the relevant clinical signal even without labeled training data. To our knowledge, this is the first end-to-end automated framework for fatty pancreas classification from ultrasound using segmentation-guided texture analysis.
[93] ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations cs.CV | cs.AIPDF
Yuhao Zhou, Yunpeng Zhu, Yang Zhou, Jindi Lyu, Jian Lan
TL;DR: 本文提出了ForgeVLA,一个无需语言标注的联邦视觉-语言-动作(VLA)学习框架,旨在利用分布式机器人产生的视觉-动作对数据,通过客户端指令分类器恢复缺失的语言模态,并结合对比规划损失与自适应聚合策略来缓解特征崩溃问题,从而高效地训练通用机器人VLA模型。
Details
Motivation: 解决大规模VLA模型训练因标注数据成本高昂而受限的问题,并应对分布式机器人原始数据因隐私、带宽等约束无法集中、且存在严重异构性的挑战。
Result: 在多个基准测试上的广泛实验表明,ForgeVLA显著优于其他基线方法,消融研究进一步验证了各组成部分的有效性。
Insight: 创新点在于提出联邦VLA训练框架,通过客户端指令分类器自动构建VLA三元组,并首次指出并缓解了联邦VLA中的视觉-语言特征崩溃问题,采用了客户端对比规划损失与服务器端自适应聚合策略的组合方案。
Abstract: Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet construction, we also identify vision-language feature collapse as a critical challenge that has been largely overlooked in prior federated VLA research. To mitigate this issue, ForgeVLA combines a client-side contrastive planning loss with a server-side adaptive aggregation strategy to learn task-discriminative representations efficiently. Extensive experiments across multiple benchmarks show that ForgeVLA significantly outperforms other baselines, and ablation studies further validate the contribution of each component.
[94] ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning cs.CVPDF
Honghua Chen, Zitong Xu, Huiyu Duan, Xinyun Zhang, Xiongkuo Min
TL;DR: 该论文提出了ReasonEdit,一个基于强化学习的可解释图像编辑评估框架,包含首个大规模数据集ReasonEdit-22K(含22K编辑图像和113K思维链样本)、一个多模态大语言模型奖励模型RE-Reward,以及通过GRPO算法训练的评估模型,旨在解决现有图像编辑评估方法缺乏可解释性的问题。
Details
Motivation: 现有文本引导图像编辑(TIE)模型的评估方法大多依赖标量分数,缺乏可解释性,主要由于缺乏高质量的解释数据集和有效的奖励模型来训练可解释评估器。
Result: 在公开基准测试中,ReasonEdit在人类偏好对齐方面表现出色,并展现出强大的泛化能力,能够生成高质量的可解释评估文本。
Insight: 创新点包括构建首个结合思维链样本的大规模可解释图像编辑评估数据集,以及利用MLLM构建奖励模型并结合GRPO算法训练可解释评估器,为图像编辑提供了更透明和可信的评估方法。
Abstract: Recent text-guided image editing (TIE) models have achieved remarkable progress, however, many edited results still suffer from artifacts, unintended modifications, and suboptimal aesthetics. Although several benchmarks and evaluation methods have been proposed, most existing approaches rely on scalar scores and lack interpretability. This limitation largely stems from the absence of high-quality interpretation datasets for TIE and effective reward models to train interpretable evaluators. To address these challenges, we introduce ReasonEdit-22K, the first dataset that combines 22K edited images with 113K Chain-of-Thought (CoT) samples, along with 1.3M human judgments assessing these interpretations in terms of logicality, accuracy, and usefulness. Building upon this dataset, we propose RE-Reward, a multimodal large language model (MLLM)-based reward model designed to provide human-aligned feedback for evaluating interpretable reasoning in image editing. Furthermore, we develop ReasonEdit, which is trained using reward signals derived from RE-Reward and the Group Relative Policy Optimization (GRPO) algorithm to learn an interpretable evaluation model. Extensive experiments demonstrate that ReasonEdit achieves superior alignment with human preferences and exhibits strong generalization across public benchmarks. In addition, it is capable of generating high-quality interpretable evaluation text, enabling more transparent and trustworthy assessment for image editing. The code is available at https://github.com/IntMeGroup/ReasonEdit.
[95] AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models cs.CVPDF
Kai Zheng, Zejian Kang, Rui Mao, Hongyuan Zou, Yuanchen Fei
TL;DR: 本文提出AudioFace,一种语言辅助的语音驱动面部动画框架,通过利用多模态大语言模型的先验知识,结合转录文本和音素级线索,将语音信号与可解释的面部动作关联起来,以生成更准确的面部混合形状。
Details
Motivation: 解决现有语音驱动面部动画方法直接映射音频到面部系数时,忽略语音产生的语言学和语音学结构,导致口部运动不准确的问题。
Result: 在多个评估指标上,AudioFace均取得了优越的性能,验证了语言辅助和多模态先验引导方法的有效性。
Insight: 创新点在于将口部相关面部系数预测视为由语言和发音信息引导的结构化生成问题,利用多模态大语言模型的先验知识,引入转录和音素级线索来桥接语音信号与面部动作。
Abstract: Speech-driven facial animation requires accurate correspondence between acoustic signals and facial motion, especially for articulation-related mouth movements. However, directly mapping speech audio to facial coefficients often overlooks the linguistic and phonetic structure underlying speech production. In this paper, we propose AudioFace, a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, our method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics, validating the effectiveness of language-assisted and multimodal-prior-guided speech-driven facial animation.
[96] How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings cs.CVPDF
Zhiheng Li, Zongyang Ma, Jiaxian Chen, Jianing Zhang, Zhaolong Su
TL;DR: 本文提出了PureDocBench,一个程序化生成、可溯源的文件解析基准测试集,包含干净、数字退化和真实退化三个版本,共4425张图像。通过评估40个模型,发现文件解析远未解决,最佳模型得分仅约74/100,且专用解析器在参数较少的情况下性能可与大型通用视觉语言模型相媲美,但公式识别仍是共同瓶颈。
Details
Motivation: 现有基准测试集OmniDocBench存在标注错误和污染风险,其排名可靠性受到质疑,因此需要一个新的、高质量的基准测试集来更准确地评估文件解析模型的性能。
Result: 在PureDocBench上评估了40个模型,最佳模型总体得分约74/100,模型间最大差距达44.6分。专用解析器(≤4B参数)性能与5-100倍大的通用视觉语言模型相当或更好,但所有模型在公式识别上的平均得分均不超过67%。通用视觉语言模型在数字/真实退化下性能下降较小(0.99/8.52分),而流程专用模型下降较大(4.90/14.21分),导致仅基于干净数据的评估具有误导性。
Insight: 创新点在于提出了一个程序化生成、可溯源且包含多种退化场景的基准测试集PureDocBench,解决了现有基准的标注质量和污染问题。客观分析认为,该工作强调了评估基准质量的重要性,并揭示了模型在退化数据和特定任务(如公式识别)上的性能瓶颈,对模型部署具有指导意义。
Abstract: The past year has seen over 20 open-source document parsing models, yet thefield still benchmarks almost exclusively on OmniDocBench, a 1,355-pagemanually annotated dataset whose top scores have saturated above 90%. Athree-stage audit pipeline we run on OmniDocBench screens its 21,353evaluator-scored blocks and confirms 2,580 errors (12.08%); combined with overa year of public availability, both annotation quality and contamination riskcall its rankings into question. To address these issues, we presentPureDocBench, a programmatically generated, source-traceable benchmark thatrenders document images from HTML/CSS and produces verifiable annotations fromthe same source, covering 10 domains, 66 subcategories, and 1,475 pages, eachin three versions: clean, digitally degraded, and real-degraded (4,425 imagestotal). Evaluating 40 models spanning pipeline specialists, end-to-endspecialists, and general-purpose VLMs, we find: (i) document parsing is farfrom solved: the best model scores only ~74 out of 100, with a 44.6-point gapbetween the strongest and weakest models; (ii) specialist parsers with <=4Bparameters rival or surpass general VLMs that are 5-100x larger, yet formularecognition remains a shared bottleneck where no model exceeds 67% whenaveraging the formula metric across all three tracks; (iii) general VLMs loseonly 0.99/8.52 Overall points under digital/real degradation versus 4.90/14.21for pipeline specialists, producing ranking reversals that make clean-onlyevaluation misleading for deployment. All data, code, and artifacts arepublicly released.
[97] DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models cs.CVPDF
Mengxin Qin, Xiang Zhang, Xi Wang, Kun Wei, Xu Yang
TL;DR: 本文提出了DIMoE-Adapters框架,一种用于视觉语言模型持续学习的动态增量专家混合适配器。它通过自校准专家进化和原型引导专家选择两个协作组件,动态管理专家池,以平衡模型在新任务上的适应能力(可塑性)和旧任务上的记忆能力(稳定性),从而解决多领域任务增量学习中的稳定性-可塑性困境。
Details
Motivation: 解决多领域任务增量学习中,由于领域差异大而加剧的稳定性-可塑性困境。现有方法多采用固定架构和静态参数分配,限制了新领域的适应能力并加剧了灾难性遗忘。
Result: 广泛的实验表明,DIMoE-Adapters在各种设置下均优于先前的最先进方法。
Insight: 创新点在于提出了动态专家进化范式,通过专家池的动态构建与进化(SCEE)和基于原型的专家选择(PGES)协同工作,自适应地平衡稳定性和可塑性,避免了固定架构的局限性。
Abstract: Continual learning enables vision-language models to accumulate knowledge and adapt to evolving tasks without retraining from scratch. However, in multi-domain task-incremental learning, large domain shifts intensify the stability-plasticity dilemma. Most existing methods rely on fixed architectures with statically allocated parameters, which limits adaptation to new domains and aggravates catastrophic forgetting. To address these challenges, we propose DIMoE-Adapters, a Dynamic Incremental Mixture-of-Experts Adapters framework that introduces a dynamic expert evolution paradigm to balance stability and plasticity. This paradigm is implemented through two collaborative components: Self-Calibrated Expert Evolution (SCEE) and Prototype-Guided Expert Selection (PGES). SCEE constructs and evolves a sparse expert pool through expert optimization dynamics, improving plasticity while reducing redundant capacity. PGES controls expert utilization based on the pool shaped by SCEE, improving stability across both previously encountered and unseen tasks. Extensive experiments show that DIMoE-Adapters outperforms previous state-of-the-art methods across various settings.
[98] Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers cs.CVPDF
Jingyuan Zhu, Biaolong Chen, Le Zhang, Aixi Zhang, Hao Jiang
TL;DR: 本文提出Diffusion-APO算法,一种轨迹感知的直接偏好优化方法,用于对齐大规模视频扩散模型与人类意图。它通过同步训练噪声与推理时去噪路径来解决训练与推理之间的分布差异问题,并引入一个统一的模块化RLHF框架,支持在线排序、半在线锚定、离线精炼和蒸馏感知的漂移校正,以实现灵活的多阶段偏好对齐。
Details
Motivation: 现有方法如DPO和GRPO在高效对齐视频扩散模型时,要么依赖有偏差的复杂奖励模型,要么存在次优的时间步采样问题,无法有效弥合训练噪声分布与推理轨迹之间的固有差异。
Result: 大量实验表明,Diffusion-APO在视觉质量和指令遵循方面持续优于标准基线方法,并在模型加速过程中有效保持了生成保真度,为可扩展的视频扩散对齐提供了一个鲁棒的端到端解决方案。
Insight: 核心创新在于提出了轨迹感知的优化算法,通过同步训练与推理路径来最大化梯度信号效率;同时设计了一个不依赖标量奖励策略梯度的统一模块化RLHF框架,实现了灵活、多阶段的偏好对齐,适用于不同数据和计算约束。
Abstract: Efficiently aligning large-scale video diffusion models with human intent requires a scalable and trajectory-aware pathway that bridges the inherent discrepancy between training noise distributions and practical inference trajectories. While existing paradigms such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) attempt to address this, they are often hindered by either reliance on bias-prone, complex reward models or suboptimal timestep sampling. In this paper, we propose Diffusion-APO (Aligned Preference Optimization), a trajectory-aware algorithm that resolves this misalignment by synchronizing training noise with inference-time denoising paths to maximize gradient signal efficacy. To translate this algorithmic innovation into a practical solution, we introduce a unified and modular RLHF framework that integrates online ranking, half-online anchoring, offline refinement, and distillation-aware drift correction. This framework enables flexible, multi-stage preference alignment across diverse data and computational constraints without relying on scalar-reward-based policy gradients. Through extensive experiments, we demonstrate that Diffusion-APO consistently outperforms standard baselines in visual quality and instruction following, while effectively preserving generative fidelity during model acceleration, providing a robust, end-to-end pathway for scalable video diffusion alignment.
[99] InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search cs.CV | cs.CL | cs.IRPDF
Bohan Hou, Jiuning Gu, Jiayan Guo, Ronghao Dang, Sicong Leng
TL;DR: 本文提出了InterLV-Search基准,用于评估交错式语言-视觉智能体搜索任务,其中文本和视觉证据被反复用于指导后续搜索。该基准包含三个难度级别共2061个样本,并引入了涉及多实体比较的多模态多分支样本。作者还提供了标准化的评估工具InterLV-Agent。实验表明,当前多模态智能体在此任务上表现不佳,最佳模型总体准确率低于50%。
Details
Motivation: 现有基准主要评估多模态搜索和视觉浏览,但视觉证据要么局限于输入,要么被视为答案终点,而非作为交错搜索轨迹的一部分。本文旨在填补这一空白,创建一个更贴近真实、需要智能体在搜索过程中动态交错利用多模态证据的评估基准。
Result: 在专有和开源多模态智能体上的实验结果表明,当前系统远未解决交错式多模态搜索问题,最佳模型的总体准确率低于50%,突显了在视觉证据寻求、搜索控制和多模态证据整合方面的挑战。
Insight: 主要创新点在于构建了一个包含三个渐进难度级别、特别是包含多模态多分支样本的交错式多模态搜索基准。从客观角度看,其构建方法(结合自动化与人工监督的开放网络流程)和标准化的评估工具包,为未来研究提供了一个系统且具有挑战性的评估框架。
Abstract: Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench
[100] Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models cs.CVPDF
Mengxin Qin, Xiang Zhang, Kun Wei, Xu Yang, Cheng Deng
TL;DR: 该论文提出了一种名为HDSD的分层双空间解耦框架,用于解决视觉语言模型中的持续学习问题,通过将参数空间分解为通用和任务特定子空间,并引入通用融合模块和分层学习模块来减少子空间干扰和参数漂移,从而缓解灾难性遗忘。
Details
Motivation: 现有方法主要限制参数更新,但忽视了高维空间中的结构特性,导致不同任务引起的更新位于多个重叠的低秩子空间中,产生跨任务子空间干扰和严重遗忘。
Result: 在常规基准测试上的大量实验表明,HDSD取得了最先进(SOTA)的结果。
Insight: 创新点在于从子空间视角出发,通过轻量级特征调制模块显式分解参数空间,并结合通用融合模块的自适应阈值捕获稳定可迁移知识,以及分层学习模块的奇异值分解和缩放机制约束不同子空间尺度内的更新,从而有效解耦任务干扰。
Abstract: Class-incremental learning aims to continuously acquire new knowledge while preserving previously learned information, thereby mitigating catastrophic forgetting. Existing methods primarily restrict parameter updates but often overlook their structural properties in high-dimensional spaces. From a subspace perspective, updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference and severe forgetting. To address this issue, we propose HDSD, a Hierarchical Dual-Subspace Decoupling framework for continual learning in vision-language models. Specifically, we introduce a lightweight Feature Modulation Module (FMM) that explicitly decomposes the parameter space into general and task-specific subspaces. Building on this design, we develop two complementary components. First, a General Fusion Module (GFM) evaluates relative parameter changes across tasks and uses an adaptive threshold to capture stable and transferable knowledge. Second, a Hierarchical Learning Module (HLM) performs structured parameter decomposition via Singular Value Decomposition (SVD) and uses a scaling mechanism to constrain updates within distinct subspace scales. Together, these designs reduce subspace interference and parameter drift. Extensive experiments on conventional benchmarks show that HDSD achieves state-of-the-art results.
[101] Implicit Preference Alignment for Human Image Animation cs.CV | cs.AIPDF
Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng, Bing Ma, Kai Yu
TL;DR: 本文提出了一种名为隐式偏好对齐(IPA)的数据高效后训练框架,用于提升人体图像动画中手部运动生成的质量。该方法基于隐式奖励最大化理论,通过最大化自生成高质量样本的似然并惩罚与预训练先验的偏差来对齐模型,无需构建严格的偏好数据对。此外,还引入了手部感知局部优化机制来显式引导对齐过程专注于手部区域。
Details
Motivation: 人体图像动画中,由于手部自由度极高且运动复杂,生成高保真手部运动一直是一个持续挑战。基于人类反馈的强化学习(如直接偏好优化)虽提供了一种潜在解决方案,但需要构建严格的偏好数据对,而对于动态手部区域,逐帧不一致性使得构建此类数据对成本极高且通常不切实际。
Result: 实验表明,该方法实现了有效的偏好优化,提升了手部生成质量,同时显著降低了构建偏好数据的门槛。
Insight: 主要创新点在于提出了无需配对偏好数据的隐式偏好对齐(IPA)框架,以及手部感知局部优化机制。从客观角度看,其理论基于隐式奖励最大化,通过自生成高质量样本进行对齐,为数据稀缺或标注成本高的细粒度生成任务(如手部动画)提供了一种高效的后训练优化思路。
Abstract: Human image animation has witnessed significant advancements, yet generating high-fidelity hand motions remains a persistent challenge due to their high degrees of freedom and motion complexity. While reinforcement learning from human feedback, particularly direct preference optimization, offers a potential solution, it necessitates the construction of strict preference pairs. However, curating such pairs for dynamic hand regions is prohibitively expensive and often impractical due to frame-wise inconsistencies. In this paper, we propose Implicit Preference Alignment (IPA), a data-efficient post-training framework that eliminates the need for paired preference data. Theoretically grounded in implicit reward maximization, IPA aligns the model by maximizing the likelihood of self-generated high-quality samples while penalizing deviations from the pretrained prior. Furthermore, we introduce a Hand-Aware Local Optimization mechanism to explicitly steer the alignment process toward hand regions. Experiments demonstrate that our method achieves effective preference optimization to enhance hand generation quality, while significantly lowering the barrier for constructing preference data. Codes are released at https://github.com/mdswyz/IPA
[102] Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views cs.CVPDF
Grzegorz Wilczynski, Mikołaj Zielinski, Bartosz Świrta, Dominik Belter, Przemysław Spurek
TL;DR: 本文提出了一种名为GLADOS的通用模块化框架,用于解决从无重叠视角进行生成式三维重建的新范式问题。该框架通过生成桥接、鲁棒粗重建和迭代上下文扩展与一致性优化三个阶段,利用基础模型合成中间视角连接不连续输入,并吸收生成过程中的局部矛盾以建立全局对齐的几何支架。
Details
Motivation: 传统3D视觉系统严重依赖视觉重叠进行几何对齐或多视图一致性,这在分布式机器人或众包数据收集等无法捕获重叠视角的现实场景中具有根本性限制。本文旨在解决从完全不重叠的视角进行生成式三维重建的挑战。
Result: 在专门为零重叠场景建立的综合数据集和评估指标上的基准测试表明,现有最先进方法在此任务上完全失败,产生断开连接的几何或语义不一致的重建。GLADOS框架旨在解决这些失败。
Insight: 创新点在于提出了“从无重叠视角进行生成式重建”这一新范式,并设计了GLADOS这一架构无关的三阶段框架,通过生成模型桥接视角差距并吸收其局部矛盾以建立全局一致的粗几何,为未来生成、重建和修复技术的集成提供了通用方案。
Abstract: 3D vision systems are fundamentally constrained by their reliance on visual overlap: reconstruction methods require it for geometric alignment, while generative models use it to enforce multi-view consistency. This limitation is particularly acute in real-world scenarios such as distributed swarm robotics or crowd-sourced data collection, where capturing overlapping perspectives, both in terms of spatial and appearance overlap, is often impossible. We introduce Generative Reconstruction from Disjoint Views as a new paradigm, establish a comprehensive dataset, and propose specialized evaluation metrics for zero-overlap scenarios. Our benchmarking demonstrates that existing state-of-the-art methods fail catastrophically on this task, producing disconnected geometries or semantically incoherent reconstructions. To address these limitations, we propose GLADOS, a general, modular framework that operates through three stages: (1) Generative Bridging, where foundation models synthesize intermediate perspectives to connect disjoint inputs; (2) Robust Coarse 3D Reconstruction, that establish coarse geometric scaffold via global alignment which absorbs local contradictions from generative process; and (3) Iterative Context Expansion and Consistency Optimization to fill missing regions and unify the reconstruction. As an architectureagnostic framework, GLADOS enables seamless integration of future advances in generation, reconstruction, and inpainting. The source code is available at: https://github.com/gwilczynski95/GLADOS.
[103] VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network cs.CVPDF
Zepeng Yang, Junxuan Bai, Hao Li, Ju Dai, Junjun Pan
TL;DR: 本文提出了一种名为VIMCAN的混合架构,用于视觉-惯性多模态3D人体姿态估计。该架构结合了Mamba的高效序列建模能力和交叉注意力的空间推理能力,旨在解决现有基于Transformer的SOTA方法在处理长序列时因二次复杂度而无法实时运行的问题。
Details
Motivation: 现有最先进的3D人体姿态估计方法主要依赖Transformer,其二次计算复杂度使得处理长序列时难以实现实时性。Mamba虽能高效处理序列,但在多模态设置中难以捕捉复杂的空间依赖关系。因此,本文旨在设计一个能同时高效处理时序和空间信息、并融合视觉与惯性数据的架构。
Result: 在TotalCapture数据集上,VIMCAN取得了17.2 mm的MPJPE误差;在3DPW数据集上,取得了45.3 mm的MPJPE误差。该方法超越了先前基于Transformer的及其他SOTA方法,并在消费级硬件上实现了超过60 FPS的实时推理速度。
Insight: 论文的创新点在于将Mamba(用于高效时序建模)与交叉注意力(用于提取空间依赖)进行混合,构建了一个能同时处理RGB关键点和可穿戴IMU数据的鲁棒融合框架。从客观角度看,这种混合架构为多模态序列任务提供了一种兼顾效率与性能的新思路,特别是在需要实时处理长序列的场景中具有借鉴价值。
Abstract: The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba’s dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code is available on GitHub.
[104] Dynamic Mode Decomposition along Depth in Vision Transformers cs.CVPDF
Nishant Suresh Aswani, Saif Eddin Jabari
TL;DR: 该论文研究了Vision Transformer(ViT)中连续块之间的动态行为,通过动态模式分解(DMD)方法拟合一个线性算子K,以预测跳过多个块后的隐藏状态。研究发现,在短跨度内(如p≤4),K^p能准确跟踪未约束的端点映射(余弦相似度达0.02),并恢复中间激活,但局部线性化保真度未能有效转移到下游任务。
Details
Motivation: 动机在于探索ViT深度是否实现近似自主线性动态,即连续块是否可由单一线性算子K递归应用,以简化模型并理解其计算结构。
Result: 在四个预训练的DINO ViT上测试,短跨度内K^p与未约束端点映射的余弦相似度在0.02以内,早期块中拟合算子可压缩到远小于d的秩,且cls标记最易线性化,但这些属性随深度单调衰减;下游任务中,线性化优势不明显,最终隐藏状态的基线方法(如恒等映射)变得有竞争力。
Insight: 创新点在于将DMD应用于ViT深度分析,揭示了局部线性动态的存在性和局限性,为模型压缩和可解释性提供新视角;客观分析表明,该方法在短跨度内有效,但需进一步研究如何将局部线性化扩展到全局或下游任务。
Abstract: Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately \textit{autonomous linear} dynamics, admitting a single operator $K$ applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits $K$ from selected, consecutive hidden-state pairs and predicts $p$ steps ahead via $K^p$. On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans ($p \leq 4$), $K^p$ tracks an unconstrained endpoint map to within $0.02$ cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank $\ll d$ with minimal calibration data, and across tokens, \texttt{cls} is most amenable to linearization; both properties decay monotonically with depth. Yet this local fidelity does not transfer downstream. At the final hidden state, after propagating through the remaining blocks, an identity baseline becomes competitive.
[105] Multimodal Stepwise Clinically-Guided Attention Learning for Pathological Complete Response Prediction in Breast Cancer cs.CVPDF
Alice Natalina Caragliano, Valerio Guarrasi, Michela Gravina, Carlo Sansone, Paolo Soda
TL;DR: 本文提出了一种用于乳腺癌病理完全缓解(pCR)预测的多模态逐步临床引导注意力学习框架。该框架利用乳腺磁共振成像(MRI)和临床变量,通过模仿医生诊断的逐步训练策略(先学习全局影像模式,再引入注意力机制聚焦肿瘤区域,最后整合临床变量)来提升模型性能,旨在解决类别严重不平衡和跨机构泛化能力差的问题。
Details
Motivation: 准确预测乳腺癌新辅助治疗后的病理完全缓解(pCR)对患者预后和治疗个性化至关重要,但现有方法面临严重的类别不平衡(应答者样本少)以及在不同临床数据集上泛化能力有限的问题。
Result: 在多个异质性MRI队列上进行外部验证,与无引导的单阶段基线模型相比,所提方法在保持竞争力的特异性的同时提高了敏感性,并生成了具有解剖学一致性的注意力图以支持模型预测的可解释性。
Insight: 创新点在于提出了一种受医生推理过程启发的逐步训练策略,将临床先验知识(如肿瘤区域)以空间引导的形式融入注意力机制,这有助于模型聚焦任务相关特征、减少对数据集特定模式的依赖,从而提升模型在类别不平衡情况下的识别能力及跨机构泛化性能。
Abstract: Pathological complete response (pCR) is a key prognostic factor in breast cancer patients undergoing neoadjuvant therapy, strongly associated with long-term survival and treatment personalization. However, accurate pre-treatment pCR prediction remains challenging due to severe class imbalance and limited generalizability across diverse clinical settings. In this work, we propose a multimodal stepwise clinically-guided attention learning framework for pCR prediction from breast magnetic resonance imaging (MRI), designed to address these limitations through medically grounded spatial guidance and multimodal integration. The approach follows a stepwise training strategy inspired by physician reasoning: the model first learns global discriminative imaging patterns, then attention mechanisms are introduced to constrain the network toward tumor regions, and finally clinical variables are integrated to refine decision-making. This guidance strategy encourages prioritization of task-relevant features, improving identification of responders despite their limited representation in the dataset. Moreover, grounding attention in anatomically consistent tumor regions reduces reliance on dataset-specific patterns, thereby enhancing cross-institutional generalization. The framework is evaluated through external validation across heterogeneous MRI cohorts. Compared to non-guided single-stage baselines, the proposed approach improves sensitivity while maintaining competitive specificity, and produces anatomically coherent attention maps that support interpretation of the model’s predictions. These findings highlight the potential of clinically-guided multimodal attention learning for robust and generalizable pCR prediction in breast cancer.
[106] Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs cs.CVPDF
Song Zhang, Yanlong Chen, Yilin Li, Yining Chen, Zili Yi
TL;DR: 本文提出了ScaleEarth框架,用于解决遥感视觉语言模型(RS-VLM)在处理不同地面采样距离(GSD)时面临的尺度不匹配问题。该框架基于Qwen3-VL,通过连续尺度条件化机制(CS-HLoRA)动态调制模型计算路径,并引入轻量级子头(SSE-U)从视觉特征预测GSD及其不确定性。同时,构建了GeoScale-VQA数据集进行监督训练,在多个遥感基准测试中达到了最先进水平。
Details
Motivation: 遥感图像中同一地理对象在不同GSD(跨越多个数量级)下视觉证据差异巨大,而现有RS-VLM要么丢弃GSD信息,要么将其作为离散文本标记注入,导致单一静态参数集难以适应整个尺度谱,因此需要一种能连续条件化处理GSD的方法。
Result: 在8B骨干网络上使用QLoRA训练后,ScaleEarth在覆盖多样化地球系统任务的遥感基准测试(包括XLRS-Bench和OmniEarth-Bench)上取得了最先进(SOTA)的结果。
Insight: 创新点在于将GSD视为连续条件变量来动态调制LoRA子空间(CS-HLoRA),实现了计算路径的物理尺度驱动路由;同时,通过视觉特征预测GSD及其不确定性(SSE-U)消除了对传感器元数据的依赖,并与专门构建的尺度分层VQA数据集(GeoScale-VQA)形成方法-数据闭环,提升了模型对尺度变化的适应能力。
Abstract: Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model’s computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.
[107] Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs cs.CV | cs.CLPDF
Peitao Han, Fei Cheng, Lis K. Pereira, Qianying Liu, Shigeru Kitazawa
TL;DR: 本文研究了视频大语言模型(Video-LLMs)在时间箭头(AoT)任务上的性能差距,发现瓶颈在于时间信息在模型架构中的流动受阻,而非视觉编码器本身。通过分析,作者指出投影器设计是关键因素,并构建了一个结合时间感知编码器、时间保留投影器和AoT监督的Video-LLM,在AoT任务上超越了人类性能,并提升了其他时间推理任务的性能。
Details
Motivation: 解决前沿视频大语言模型在判断视频正放还是倒放(时间箭头任务)时性能远低于人类的问题,探究性能差距的根源是视觉编码器未能编码时间信息,还是模型架构中存在信息瓶颈。
Result: 提出的方法在AoT$_{PPB}$数据集上达到98.1%的准确率,超越了人类性能;在VITATECS-Direction和TVBench基准上分别提升了6.0和1.3个百分点。
Insight: 创新点在于系统性地追踪了时间信息在Video-LLM各组件(编码器、投影器、LLM)中的流动,并发现投影器设计(如Q-Former会破坏时间信息,而时间保留的MLP投影能改善信息传递)是影响时间推理的关键瓶颈;通过结合时间感知编码器和时间保留投影器,有效提升了模型的时间推理能力。
Abstract: The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM’s access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.
[108] PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models cs.CVPDF
Yuliang Li, Chu Zhou, Heng Guo, Boxin Shi, Imari Sato
TL;DR: PolarVLM是一种新型视觉语言模型,通过整合偏振成像的物理参数来解决主流VLM在反射和透明物体等光学模糊场景中的识别困难。该模型采用双流架构和两阶段渐进训练策略,有效避免物理误解并保持通用视觉能力。同时,作者构建了首个偏振感知VQA基准PolarVQA,包含7.5万个针对反射和透明场景的物理基础指令调优对。实验表明,PolarVLM在五个评估任务上整体超越RGB基线25.4%,在反射识别和玻璃计数任务中分别取得26.6%和34.0%的显著提升,实现了物理感知的语义理解。
Details
Motivation: 主流视觉语言模型因依赖标准RGB输入而难以处理反射和透明物体等光学模糊场景,而偏振成像虽能捕捉解决这些模糊的物理参数,但现有方法受限于固定格式输出且缺乏开放式推理能力。本文旨在弥合语义与物理之间的鸿沟。
Result: 在五个评估任务上,PolarVLM整体性能超越RGB基线25.4%,其中反射识别任务提升26.6%,玻璃计数任务提升34.0%。实验基于作者构建的PolarVQA基准进行,该基准包含7.5万个物理基础指令对,结果表明模型成功实现了物理感知的语义理解。
Insight: 创新点包括:首次将偏振物理参数整合到VLM中,提出双流架构和渐进两阶段训练策略以平衡物理信息与通用能力,并构建了首个偏振感知VQA基准PolarVQA。从客观角度看,该研究通过多模态物理信息融合有效解决了VLM在特定场景下的根本性局限,为增强模型对真实世界物理属性的理解提供了新途径。
Abstract: Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.
[109] Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding cs.CV | cs.AIPDF
Ke Ma, Jiaqi Tang, Bin Guo, Xueting Han, Ruonan Xu
TL;DR: 论文提出Response-G1框架,通过显式的场景图建模来解决流式视频理解中的主动响应时机决策问题。该框架包含三个无需微调的阶段:在线查询引导的场景图生成、基于记忆的语义相关历史场景图检索以及检索增强的触发提示,用于逐帧决定是否响应。
Details
Motivation: 现有方法在流式视频理解中决定何时响应时表现不佳,因为它们对视觉证据的建模是隐式且与查询无关的。本文旨在通过显式、结构化的场景图对齐累积视频证据与查询的预期响应条件,以解决这一问题。
Result: 在已建立的基准测试上的实验结果表明,该方法在主动和被动任务中均表现出优越性,验证了显式场景图建模和检索在流式视频理解中的优势。
Insight: 创新点在于将证据和条件都基于共享的图表示进行对齐,实现了更可解释和准确的响应时机决策。从客观角度看,其无需微调的三阶段框架设计,特别是结合检索增强的触发提示,为流式视频理解提供了一种新颖的结构化推理方法。
Abstract: Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query’s expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame “silence/response” decisions.By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.
[110] TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CVPDF
Hengyi Feng, Hao Liang, Mingrui Chen, Bohan Zeng, Meiyi Qiang
TL;DR: 本文提出了TraceAV-Bench,这是首个用于评估长音频-视频轨迹上的多跳推理能力和多模态幻觉鲁棒性的基准测试。该基准包含578个长视频(总计339.5小时)上的2200个经过严格验证的多选题,涵盖4个评估维度和15个子任务。每个问题都基于一个平均跨度为15.1分钟、平均包含3.68跳的显式推理链构建。对多个代表性OmniLLM的评估表明,该基准对所有模型都构成了持续挑战,最强的闭源模型(Gemini 3.1 Pro)在通用任务上仅达到68.29%,最佳开源模型(Ming-Flash-Omni-2.0)达到51.70%,仍有很大提升空间。
Details
Motivation: 现实世界的视听理解需要将稀疏、时间分散且分布在视觉和听觉流中的证据链式连接起来,而现有的基准测试大多未能评估这种能力。它们将视频限制在短片段、孤立模态或将问题简化为单跳感知。
Result: 在TraceAV-Bench基准上,最强的闭源模型(Gemini 3.1 Pro)在通用任务上达到68.29%的准确率,最佳开源模型(Ming-Flash-Omni-2.0)达到51.70%。此外,研究发现多模态幻觉的鲁棒性与通用多模态推理性能在很大程度上是解耦的。
Insight: 论文的创新点在于构建了首个专注于长音频-视频多跳轨迹推理和幻觉鲁棒性评估的基准测试。其核心贡献包括:1)提出了一个需要跨长时程、多模态证据链式推理的新颖评估框架;2)通过半自动化流程和严格质量保证构建了大规模、高质量的数据集;3)揭示了当前OmniLLM在此类复杂任务上的显著不足,以及幻觉鲁棒性与通用推理性能的解耦现象,为未来研究指明了方向。
Abstract: Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, or reduce questions to one-hop perception. We introduce TraceAV-Bench, the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. TraceAV-Bench comprises 2,200 rigorously validated multiple-choice questions over 578 long videos, totaling 339.5 hours, spanning 4 evaluation dimensions and 15 sub-tasks. Each question is grounded in an explicit reasoning chain that averages 3.68 hops across a 15.1-minute temporal span. The dataset is built by a three-step semi-automated pipeline followed by a strict quality assurance process. Evaluation of multiple representative OmniLLMs on TraceAV-Bench reveals that the benchmark poses a persistent challenge across all models, with the strongest closed-source model (Gemini 3.1 Pro) reaching only 68.29% on general tasks, and the best open-source model (Ming-Flash-Omni-2.0) reaching 51.70%, leaving substantial headroom. Moreover, we find that robustness to multimodal hallucination is largely decoupled from general multimodal reasoning performance. We anticipate that TraceAV-Bench will stimulate further research toward OmniLLMs that can reason coherently and faithfully over long-form audio-visual content.
[111] LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation cs.CV | cs.AIPDF
Jun Wang, Fengpeng Li, Hang Dong, Tianjin Huang, Wei Han
TL;DR: 该论文提出了LithoBench,一个用于评估遥感岩性解释中地质语义理解的多层次基准。该基准包含12个代表性岩性类别的10,000个专家标注实例,分为4,000个多项选择题和6,000个开放式任务,涵盖从识别描述到综合推理的五个认知层次。论文还开发了一个专家参与、基于知识的半自动构建流程,并通过对多个大型视觉语言模型的实验揭示了它们在高级地质语义理解任务上的显著局限性。
Details
Motivation: 遥感岩性解释是知识密集型任务,现有大型多模态模型的评估因缺乏包含岩性标注、多层次地质语义和专家评估的基准而受限,因此需要构建专门的基准来推动该领域的可靠自动化解释。
Result: 对多个大型视觉语言模型的实验表明,它们在地质语义理解方面存在显著不足,尤其是在高阶解释、应用和推理任务上。
Insight: 创新点在于构建了首个专注于遥感岩性解释的多层次专家标注基准(LithoBench),并设计了结合专家知识和结构化描述的半自动构建流程,为评估模型的地质认知能力提供了标准化工具,揭示了当前模型在复杂专业领域理解上的瓶颈。
Abstract: Remote sensing lithology interpretation is fundamental to geological surveys, mineral exploration, and regional geological mapping. Unlike general land-cover recognition, lithology interpretation is a knowledge-intensive task that requires experts to infer rock types from various features, e.g., subtle visual, spectral, textural, geomorphological, and contextual cues, making reliable automated interpretation highly challenging. Geological knowledge-guided large multimodal models offer new opportunities, yet their evaluation remains constrained by the lack of benchmarks that capture lithological annotations, multi-level geological semantics, and expert-informed assessment. Here, we propose LithoBench, a multi-level benchmark for evaluating geological semantic understanding in remote sensing lithology interpretation. LithoBench contains 10,000 expert-annotated interpretation instances across 12 representative lithological categories, including 4,000 multiple-choice and 6,000 open-ended tasks organized into five cognitive levels: Identification and Description, Comparative Analysis, Mechanism Explanation, Practical Application, and Comprehensive Reasoning. We further develop an expert-in-the-loop, knowledge-grounded semi-automated construction pipeline, coupling multi sub-processes, e.g., structured geological image descriptions, to enhance geological validity and evaluation reliability. Experiments with multiple large vision-language models eveal substantial limitations in geological semantic understanding, particularly on higher-order explanation, application, and reasoning tasks.
[112] EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting cs.CVPDF
Jaeyoung Choi, Hyeondong Kim, Yujin Kim, Daehee Park
TL;DR: 本文提出了EggHand,一个基于基础模型的多模态框架,用于从第一人称视角视频中预测未来的3D手部姿态序列。该框架结合了视觉-语言-动作模型的行动解码器(用于捕捉手部运动的结构化时序动态)和第一人称视频-文本编码器(用于提供从大规模第一人称视频中学习的视角感知上下文信息),从而在存在剧烈自我运动的情况下实现对手部运动、上下文和高层意图的联合推理。
Details
Motivation: 从第一人称视频预测未来3D手部姿态序列对于理解人类意图和实现AR/VR辅助、人机交互等具身应用至关重要。然而,该任务极具挑战性,因为第一人称手部运动由复杂的人类意图驱动,表现出高度灵巧的关节活动,并且在由自我运动引起的剧烈视角变化下被观察。
Result: 在EgoExo4D数据集上的实验表明,EggHand在预测准确性上达到了新的SOTA水平,在剧烈的自我运动下保持鲁棒性,并且能够通过基于语言的任务提示实现可控预测。
Insight: 主要创新点在于将多模态语义推理与动态运动建模相统一,通过结合VLA模型的动作解码器和第一人称视频-文本编码器,克服了通用视觉编码器在自我运动下的脆弱性,实现了无需依赖身体姿态或外部跟踪的联合推理。从客观角度看,其利用大规模第一人称视频学习视角感知上下文信息的方法,以及对语言提示实现可控预测的能力,是值得借鉴的方向。
Abstract: Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: https://jyoun9.github.io/EggHand
[113] Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models cs.CV | cs.AI | cs.ROPDF
Berkehan Ünal, Dierend Hauke, Fazlija Dren, Plachetka Christopher
TL;DR: 本文探讨了利用视觉语言模型(VLMs)作为零样本“运行设计域(ODD)传感器”的可行性,以支持自动驾驶系统等安全关键应用中的适应性感知。研究通过实证分析评估了四种VLMs在自定义数据集和Mapillary Vistas上的零样本ODD分类与检测性能,并进行了失败分析、优化策略消融实验以及可重用提示模板开发。
Details
Motivation: 随着自动驾驶系统等自主系统向实际应用转化,遵守以运行设计域(ODD)为核心的安全法规至关重要。当前需要一种无需任务特定训练数据、能灵活适应ODD定义演变的感知方法,而视觉语言模型(VLMs)结合了视觉识别与语言推理能力,有望成为零样本ODD感知的解决方案。
Result: 在自定义数据集和Mapillary Vistas基准上的实验表明,采用定义锚定的思维链提示与角色分解的方法性能最佳,而其他方法可能导致召回率下降。研究提供了零样本优化策略的成本-性能概览,并开发了一套可重用的提示模板。
Insight: 论文的创新点在于首次系统评估了VLMs作为零样本ODD传感器的潜力,提出了定义锚定的思维链提示与角色分解的优化策略,显著提升了零样本感知性能。其开发的提示模板套件为安全关键应用中的透明、有效ODD感知提供了实用工具,减少了模型对特定训练数据的依赖,增强了系统适应性。
Abstract: Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot “ODD sensors” that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.
[114] Aquatic Neuromorphic Optical Flow cs.CV | eess.IVPDF
Pei Zhang, Yunkai Liang, Kaiqiang Wang
TL;DR: 本文提出了一种基于脉冲神经网络的自监督框架,用于从异步事件流中估计像素级光流,以解决水下环境成像资源受限的问题。该方法通过神经形态视觉技术,在无需大量标注数据的情况下,实现了高效的水下运动感知。
Details
Motivation: 水下环境对传统成像系统限制严苛,需要平衡高质量感知与资源效率;事件相机虽具潜力,但在水下场景的应用尚未充分探索。本文旨在利用神经形态视觉,研究作为敏捷水下感知关键媒介的运动场。
Result: 广泛评估表明,该方法在视觉和定量结果上与领先技术相当,同时具有更优的计算效率,适用于资源受限的水下边缘平台。
Insight: 创新点包括:基于脉冲神经网络的自监督光流估计框架,优雅地绕过了水下数据稀缺的长期瓶颈;将神经形态传感与水下智能结合,为轻量、实时、低成本的感知开辟了新前沿。
Abstract: Underwater environments impose severe constraints on conventional imaging systems and demand solutions that balance high-quality sensing with strict resource efficiency. While emerging event cameras offer a promising alternative, their potential in aquatic scenarios remains largely unexplored. Through the lens of neuromorphic vision, this work pioneers the investigation of motion fields that serve as key media for agile underwater perception. Built upon spiking neural networks, we introduce a self-supervised framework to estimate per-pixel optical flow from asynchronous event streams, elegantly bypassing the long-standing bottleneck of underwater data scarcity. Extensive evaluations demonstrate that our method achieves competitive visual and quantitative results against leading techniques while operating with superior computational efficiency. By bridging neuromorphic sensing and aquatic intelligence, this work opens new frontiers for lightweight, real-time, and low-cost perception on resource-constrained underwater edge platforms.
[115] Towards Billion-scale Multi-modal Biometric Search cs.CV | cs.AIPDF
Arka Koner, Chetan S. Naik, Lokesh Kurre, Vivek Raghavan, Barada P. Sabut
TL;DR: 本文介绍了名为Bharat ABIS的十亿级多模态生物特征搜索系统,该系统基于开源架构,整合了指纹、人脸和虹膜三种模态,通过预处理、质量评估、呈现攻击检测和嵌入学习等阶段,为每个人生成一个13.5KB的级联模板,旨在实现高效的大规模1:N搜索(去重)。
Details
Motivation: 解决国家级别身份系统中,对十亿级记录的多模态生物特征数据库进行搜索时,在采集、预处理、特征提取、准确性、匹配速度、呈现攻击检测及处理特殊情况(如手指缺失)等方面面临的极限挑战。
Result: 在从印度Aadhaar数据库中随机抽取的2.2亿身份(分层抽样)的图库上,针对成人(18岁以上)探针,在FPIR为0.5%时,FNIR达到0.3%。与三个最先进的商用现成系统在2000万图库上的对比中,Bharat ABIS表现出色。在单服务器(8xNvidia H100 GPU, 2TB RAM)上,对4000万图库的搜索吞吐量达到每秒100次。
Insight: 创新点在于首次详细阐述了如此大规模的多模态生物特征搜索系统的端到端架构,并展示了基于开源组件实现高性能(高精度、高吞吐)的可行性。其多模态融合(指纹、人脸、虹膜)与高效模板设计(13.5KB/人)为大规模身份去重提供了可借鉴的系统级解决方案。
Abstract: Searching a multi-biometric database of a billion records for a country-level identity system requires pushing the limits of all aspects of a biometric system, including acquisition, preprocessing, feature extraction, accuracy, matching speed, presentation attack detection, and handling of special cases (e.g., missing finger digits). This is the first paper that gives insights into such a large-scale multimodal biometric search system, called Bharat ABIS, based on open-source architectures. The end-to-end pipeline of Bharat ABIS processes fingerprint, face and iris modalities through modality-specific stages of preprocessing (segmentation), quality assessment, presentation attack detection, and learning an embedding (feature extraction), producing a concatenated template of 13.5KB per person. We present a detailed analysis of the modalities and how they are integrated to create an efficient and effective solution for 1:N search (de-duplication). Evaluations on a demographically stratified gallery of 220 million identities, randomly sampled from 1.55 billion records in India’s Aadhaar database, yield an FNIR of 0.3% at an FPIR of 0.5%, for adult probes (over 18 years). We also compare the performance of Bharat ABIS against three state-of-the-art COTS systems on a 20M gallery. Our system achieves a throughput of 100 searches per second on a gallery of 40M on a single server (8xNvidia H100 GPUs, 2TB RAM).
[116] OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos cs.CVPDF
Ritul Jangir, Arkya Jyoti Bagchi, Aiman Farooq, Mangalton Okram, Saurabh Seetaram Korgaonkar
TL;DR: 本文提出OphEdit,一种无需训练、基于文本引导的眼科手术视频编辑框架。该方法利用二阶常微分方程确定性反演管道捕获原始视频的注意力值张量,并在去噪阶段将其选择性地注入条件分类器自由引导分支,从而在保持眼部精细解剖结构的同时,实现文本驱动的语义修改。
Details
Motivation: 解决高保真手术视频生成中,由于严格的解剖和时间约束,难以精确编辑手术属性(如器械组织交互或手术阶段)的挑战。
Result: 临床评估表明,OphEdit能有效处理复杂的手术变换(如器械交换和手术步骤变化),在结构保真度和时间一致性上优于自然领域视频编辑器,是首个在眼科手术领域应用的无训练视频编辑方法。
Insight: 创新点在于将确定性ODE反演与条件CFG分支的注意力值张量选择性注入相结合,实现了对解剖结构严格约束下的手术视频的精确、无需训练的文本引导编辑,为生成多样化标注医疗数据集提供了可扩展方案。
Abstract: High-fidelity surgical video generation can greatly improve medical training and the development of AI, adapting these generative models for precise video editing remains a formidable challenge. Modifying surgical attributes, such as instrument tissue interactions or procedural phases is challenging due to the strict anatomical and temporal constraints. In this paper, we propose OphEdit, a novel training-free framework for the text-guided editing of ophthalmic surgical videos. Our approach leverages a deterministic second-order ODE inversion pipeline to capture Attention Value (V) tensors from the original video. By selectively injecting these stored tensors into the conditional Classifier-Free Guidance (CFG) branch during the denoising phase, OphEdit rigorously preserves the intricate anatomical geometry of the eye while seamlessly mapping text-driven semantic modifications onto the video stream. Clinical evaluations demonstrates that OphEdit effectively handles complex surgical transformations, such as instrument swaps and procedural variations, with superior structural fidelity and temporal consistency compared to natural-domain video editors. Our work represents the first application of training-free video editing in the ophthalmic surgical domain, offering a scalable solution for generating diverse, annotated medical datasets without the need for exhaustive manual recording or costly model fine-tuning. The code and prompts can be accessed at https://github.com/ophedit/OphEdit
[117] Head Similarity: Modeling Structured Whole-Head Appearance Beyond Face Recognition cs.CVPDF
Yingfeng Wang, Yuxuan Xiao, Shengcai Liao
TL;DR: 本文提出了一种名为’头部相似性’的新建模方法,旨在超越传统人脸识别,对包含发型、姿态等外观变化的整个头部进行结构化相似性建模。该方法通过分层监督和身份感知蒸馏,联合建模身份判别和外观敏感相似性,并在大规模长视频基准上验证了其有效性。
Details
Motivation: 解决传统人脸识别模型因强制身份内不变性而忽略发型、姿态等外观变化,导致在非正面视角或面部线索缺失时无法满足外观敏感应用需求的问题。
Result: 实验表明,传统人脸识别模型无法捕捉外观依赖的相似性,而本文方法证明了结构化全头部相似性建模的可行性,并在构建的大规模长视频基准(涵盖多样姿态、遮挡和时间变化)上进行了验证。
Insight: 创新点在于将身份中心识别扩展为结构化全头部相似性建模,通过显式捕获身份内外观变化并强制执行跨身份和外观状态的分层相似性排序,从而在遮挡或后视条件下实现有意义的比较。其提出的弱监督外观状态标注和分层监督框架具有借鉴意义。
Abstract: Many vision applications require identity consistency beyond strict biometric recognition, especially under non-frontal views or when facial cues are missing. However, conventional face recognition models enforce intra-identity invariance, collapsing appearance variations such as hairstyle or styling changes into a single representation, limiting their use in appearance-sensitive scenarios. To address this limitation, we introduce Head Similarity, a new formulation that extends identity-centric recognition to structured whole-head similarity modeling. Our approach explicitly captures intra-identity appearance variation and enforces hierarchical similarity ordering across identity and appearance states, enabling meaningful comparison even under occlusion or rear-view conditions. We construct a large-scale benchmark from long-form videos with weakly-supervised appearance states, covering diverse poses, occlusions, and temporal changes. As a first step, we develop a simple yet effective framework that jointly models identity discrimination and appearance-sensitive similarity through hierarchical supervision and identity-aware distillation. Experiments show that conventional face recognition models fail to capture appearance-dependent similarity, while our approach demonstrates the feasibility of structured whole-head similarity modeling.
[118] Radiologist-Guided Causal Concept Bottleneck Models for Chest X-Ray Interpretation cs.CVPDF
Amy Rafferty, Rishi Ramaesh, Ajitha Rajan
TL;DR: 本文提出了XpertCausal,一种放射科医生引导的因果概念瓶颈模型,用于胸部X光片解读。该模型使用概率噪声-OR框架建模病理到概念的关系,并通过贝叶斯推理从预测的概念中估计病理概率,同时利用专家知识约束模型结构以提高临床合理性。
Details
Motivation: 现有医学影像中的概念瓶颈模型(CBMs)通常将概念视为病理标签的判别性预测因子,而未明确建模疾病产生可观察影像学表现的临床生成过程,这限制了模型的临床解释性和合理性。
Result: 在MIMIC-CXR数据集上的评估表明,与非因果CBM基线及无约束学习的因果消融模型相比,XpertCausal在病理分类性能(AUROC)、校准度和解释质量方面均有提升,且学习到的概念-病理关系更符合专家知识。
Insight: 创新点在于将临床驱动的因果结构和专家领域知识整合到CBMs中,通过生成式建模和贝叶斯推理实现更准确、可解释且临床对齐的模型,这为医学影像分析提供了可借鉴的因果推理框架。
Abstract: Concept Bottleneck Models (CBMs) in medical imaging aim to improve model interpretability by predicting intermediate clinical concepts before final diagnoses. However, most existing CBMs treat concepts as discriminative predictors of pathology labels, without explicitly modelling the underlying clinical generative process where diseases produce observable radiographic findings. We propose XpertCausal, a radiologist-guided causal CBM for chest X-ray interpretation which models pathology-to-concept relationships using a probabilistic noisy-OR framework. This generative model is then inverted via Bayesian inference to estimate pathology probabilities from predicted concepts. Radiologist-curated concept-pathology associations are used to constrain model structure to radiologist-defined clinically plausible reasoning pathways. We evaluate XpertCausal on MIMIC-CXR across pathology classification performance, calibration, explanation quality, and alignment with radiologist-defined reasoning pathways. Compared with both a non-causal CBM baseline and a causal ablation with unconstrained learned associations, XpertCausal achieves improved AUROC, calibration, and clinically relevant explanation quality, while learning concept-pathology relationships that more closely align with expert knowledge. These results demonstrate that incorporating clinically motivated causal structure and expert domain knowledge into CBMs can lead to more accurate, interpretable, and clinically aligned models for CXR interpretation.
[119] SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models cs.CVPDF
Jiesong Lian, Zixiang Zhou, Ruizhe Zhong, Yuan Zhou, Qinglin Lu
TL;DR: 本文提出SARA(语义自适应关系对齐)方法,用于改进视频扩散模型(VDMs)的细粒度文本跟随能力。SARA通过引入文本条件显著性机制,动态选择对提示词相关的令牌对进行监督,从而优化令牌关系蒸馏过程。该方法在Wan2.2持续训练设置中,在文本对齐和运动质量方面优于现有方法如SFT、VideoREPA和MoAlign。
Details
Motivation: 当前视频扩散模型在合成视频时存在实体丢失、属性误绑定和交互弱化等问题。现有表示对齐方法(如VideoREPA和MoAlign)基于视觉或运动线索分配监督,未充分考虑令牌对与提示词的相关性,因此需要更语义自适应的对齐策略。
Result: 在Wan2.2持续训练设置下,SARA在13维VLM评估标准、公开VBench基准测试和盲用户研究中,均超越了SFT、VideoREPA和MoAlign,在文本对齐和运动质量上取得改进。
Insight: 创新点包括引入文本条件显著性机制,通过轻量级第一阶段对齐器结合SAM 3.1掩码监督和InfoNCE正则化,并设计配对路由算子将监督聚焦于主体-主体和主体-背景对,而非背景-背景对,从而提升语义对齐效率。
Abstract: Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study.
[120] GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning cs.CV | cs.AI | cs.CLPDF
Brown Ebouky, Gabriele Carrino, Niccolo Avogaro, Christoph Studer, Andrea Bartezzaghi
TL;DR: GazeVLM是一种新型多模态架构,通过内部注意力控制模拟人类主动视觉机制,使视觉语言模型能够自主生成注视标记(
Details
Motivation: 解决现有视觉语言模型被动处理视觉信息、依赖静态大量上下文导致空间推理能力弱和语言幻觉的问题,通过模拟人类主动视觉中的自上而下目标导向注意力机制来增强模型的空间选择注意力。
Result: 在HRBench-4k和HRBench-8k基准测试中,4B参数的GazeVLM超越了同参数规模的最先进视觉语言模型近4%,并优于基于图像思维的代理多模态流程5%以上。
Insight: 创新点在于将元认知控制内化到推理循环中,通过自主生成的注视标记实现动态注意力偏置调节,无需依赖外部裁剪工具或增加视觉标记;采用定制的Group Relative Policy Optimization训练方法奖励有效接地,实现了高效的注意力资源部署。
Abstract: Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{
[121] Explainable Part-Based Vehicle Classifier with Spatial Awareness cs.CVPDF
Andreas Caduff, Klaus Zahn, Jonas Hofstetter, Martin Rechsteiner, Patrick Flaig
TL;DR: 本文提出了一种可解释的、基于部件且具有空间感知能力的车辆分类方法,该方法将传统的端到端CNN分解为三个步骤:首先使用CNN检测语义显著的车辆部件,然后构建包含部件空间概率分布的特征,最后通过决策树或softmax回归进行分类。该方法在保持与先进端到端CNN相当精度的同时,显著提升了模型的鲁棒性和可解释性。
Details
Motivation: 为了解决智能交通系统中细粒度车辆分类任务中,传统端到端卷积神经网络存在的’黑盒’特性、难以扩展新类别以及缺乏可解释性的问题,并提升模型对部件误检的鲁棒性。
Result: 在未明确指定的基准测试中,该方法与最先进的端到端CNN相比,达到了可比的分类精度,同时显著提高了对部件误检的鲁棒性,有效挑战了精度与可解释性之间存在权衡的普遍假设。
Insight: 主要创新点在于将部件检测与空间概率建模相结合,用部件的空间概率分布图替代简单的二元存在特征,这增强了特征的表征能力和模型的空间感知。从客观角度看,这种’分解-重组’的架构设计,通过分离部件检测与分类决策,为实现高精度且可解释的计算机视觉模型提供了一条有效路径,特别是在需要模型透明度和可扩展性的应用领域。
Abstract: In the area of Intelligent Transportation Systems (ITS), fine-grained vehicle classification systems play an essential role. Recently, the authors have presented a novel vision-based classification approach in which standard end-to-end Convolutional Neural Networks (CNNs) have been decomposed into 1) a CNN-based detector for semantically strong vehicle parts, followed by 2) feature construction and 3) final classification by a decision tree. In contrast to conventional CNNs, this allows both easy extensibility to new vehicle categories - without the need to fully retrain the part detector - and an important step towards the interpretability of the model, removing partially the black-box nature inherent to CNNs. Here we present an important extension of this approach that now incorporates spatial awareness of the vehicle parts: while the feature construction 2) of the previous approach used a binary decision for each feature (present vs. absent), now a full spatial probability map is constructed to condition the presence of each individual part with respect to a given vehicle category. The classification is performed using a softmax regression approach for the overall vehicle probabilities. This method shows a considerably improved robustness against false (part-)detections, a point that is crucial for practical application. Comparative analyses with a state-of-the-art end-to-end CNN indicate that our part-based methods achieve comparable accuracy, effectively challenging the presumed trade-off between accuracy and explainability. This research represents a significant advance in vehicle classification for ITS and forms the basis for systems that combine high accuracy with intuitive interpretability.
[122] EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding cs.CVPDF
Lang Zhang, JinYi Yoon, Matthew Corbett, Abhijit Sarkar, Bo Ji
TL;DR: 本文提出了EyeCue,一个利用注视点增强的自我中心视频理解框架,用于检测驾驶员的认知分心。其核心观点是认知分心体现在注视点与视觉环境的交互中。为捕捉这种交互,EyeCue将注视点与自我中心视频相结合,实现对驾驶员随时间变化的注意力的上下文感知建模。此外,为解决现有数据集规模和多样性不足的问题,本文还引入了CogDrive数据集。
Details
Motivation: 驾驶员认知分心是导致道路碰撞的主要原因,且难以检测。与手动或视觉分心不同,认知分心源于与驾驶无关的思绪,即使驾驶员看起来视觉专注且没有明显的身体动作。现有方法难以有效捕捉这种内在状态。
Result: 在提出的CogDrive数据集上进行广泛评估,EyeCue达到了74.38%的最高准确率,优于来自6个模型家族的11个基线方法超过7%。值得注意的是,EyeCue在各种驾驶场景(不同道路类型、一天中的时间、天气条件)下均能达到超过70%的准确率,表现出很强的泛化能力。
Insight: 主要创新点在于认识到认知分心体现在注视点与视觉环境的交互中,并提出了一个整合注视点与自我中心视频的框架来建模这种跨模态交互。此外,构建了一个综合的多场景数据集CogDrive,为领域研究提供了重要资源。从客观角度看,将注视点作为关键模态并与视觉上下文进行交互建模,是解决这一内在状态检测问题的有效且有洞察力的途径。
Abstract: Driver cognitive distraction is a major cause of road collisions and remains difficult to detect. Unlike manual or visual distraction, cognitive distraction is diverted by thoughts unrelated to driving, even when the driver appears visually attentive and exhibits no explicit physical movements. In this work, we propose EyeCue, a gaze-empowered egocentric video understanding framework, to detect driver cognitive distraction. A key insight is that cognitive distraction manifests in the interaction between eye gaze and visual context. To capture this interaction, EyeCue integrates eye gaze with egocentric video to enable context-aware modeling of the driver’s attention over time. Furthermore, to tackle the limited scale and diversity of existing datasets, we introduce CogDrive, a comprehensive multi-scenario dataset that augments four existing driving datasets with cognitive distraction annotations. Through extensive evaluations on CogDrive, we show that EyeCue achieves the highest accuracy of 74.38%, outperforming 11 baselines from 6 model families by over 7%. Notably, EyeCue can achieve an accuracy of over 70% across various driving scenarios (different road types, times of day, and weather conditions) with strong generalizability. These results highlight the importance of modeling gaze-context interactions and the effectiveness of cross-modal interaction modeling for multimodal cognitive distraction detection. Our codes and CogDrive dataset resources are available at https://github.com/langzhang2000/EyeCue.
[123] From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data cs.CVPDF
Yue Yu, Jiayu Wang, Jiajia Shi, Jingjing Chen, Yu-Gang Jiang
TL;DR: 本文提出了一种名为ConsistentBeauty的数据生成流程和RealBeauty的合成到真实后训练框架,用于解决化妆迁移任务中身份一致性保持和合成与真实数据域差距的问题,并在多个基准测试中达到了最先进的性能。
Details
Motivation: 解决现有基于合成数据的化妆迁移方法在身份一致性保持和泛化到复杂真实场景时性能下降的问题。
Result: 在多个基准测试上达到了最先进的性能,特别是在身份保持和复杂真实场景处理上展现出明显优势。
Insight: 创新点在于提出了一个确保合成数据中化妆保真度和严格身份一致性的数据生成流程,以及一个结合强化学习和可验证奖励的合成到真实后训练框架,以利用真实化妆模式超越合成监督的局限。
Abstract: Makeup transfer aims to apply the makeup style of a reference portrait to a source portrait while preserving identity and background. Early methods formulate this task as unsupervised image-to-image translation, relying on surrogate objectives and often yielding limited performance. Recent diffusion- and flow-based approaches instead exploit synthetic data for supervised training, leading to significant improvements. However, these methods still face two critical challenges: synthetic supervision frequently fails to faithfully preserve identity, and the domain gap between synthetic and real data limits generalization, resulting in degraded performance in complex real-world scenarios. To address these issues, this paper first proposes ConsistentBeauty, a novel data curation pipeline that ensures makeup fidelity and strict identity consistency within the synthesized data. Second, we propose RealBeauty, a synthetic-to-real post-training framework. Beyond supervised learning on curated synthetic data, we further adapt the model to real-world scenarios through reinforcement learning and design novel verifiable rewards tailored to the makeup transfer task. It allows the model to further benefit from real makeup patterns beyond synthetic supervision. In addition, we establish a new diverse benchmark for makeup transfer, covering a wide range of skin tones, ages, genders, poses, and makeup styles, thereby enabling a more comprehensive evaluation of model performance under diverse real-world conditions. Extensive experiments show that our method achieves state-of-the-art performance on multiple benchmarks and demonstrates clear advantages in identity preservation and performance on complex real-world cases.
[124] Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models cs.CV | cs.AIPDF
Yuancheng Wei, Linli Yao, Lei Li, Haojie Zhang, Hao Zhou
TL;DR: 该论文提出了一个用于视频理解奖励建模的统一框架,包括基准测试设计、数据构建和模型训练。作者引入了包含2100个偏好对和长链思维推理的Video Understanding Reward Bench (VURB)基准,并通过自动化流程构建了Video Understanding Preference Dataset (VUP-35K)数据集。基于此,训练了判别式模型VideoDRM和生成式模型VideoGRM,两者在多个基准上均达到了最先进的性能。
Details
Motivation: 解决视频理解奖励建模领域因缺乏鲁棒的评估基准和高质量偏好数据而进展有限的问题。
Result: VideoDRM和VideoGRM在VURB和VideoRewardBench基准上均取得了最先进的性能;VUP-35K数据提升了奖励性能和模型推理能力;模型在best-of-N测试时缩放下带来了显著增益。
Insight: 创新点在于构建了首个包含长链思维推理和多数投票评估的视频理解奖励基准VURB,以及通过全自动流程创建的大规模高质量偏好数据集VUP-35K;客观来看,其统一的框架设计和对判别式与生成式奖励模型的并行探索,为视频领域的奖励建模提供了系统性的解决方案和可靠的数据支撑。
Abstract: Multimodal reward models have advanced substantially in text and image domains, yet progress in video understanding reward modeling remains severely limited by the lack of robust evaluation benchmarks and high-quality preference data. To address this, we propose a unified framework spanning benchmark design, data construction, and reward model training. We introduce Video Understanding Reward Bench (VURB), a benchmark featuring 2,100 preference pairs with long chain-of-thought reasoning traces (averaging 1,143 tokens) and majority voting evaluation across general, long, and reasoning-oriented video tasks. We further construct Video Understanding Preference Dataset (VUP-35K) via a fully automated pipeline, providing large-scale high-quality supervision for video reward training. Building on the data, we train VideoDRM and VideoGRM, a discriminative and a generative reward model, both achieving state-of-the-art performance on VURB and VideoRewardBench. Further analysis confirms that VUP-35K enhances both reward performance and model reasoning capability, while VideoDRM and VideoGRM yield significant gains under best-of-$N$ test-time scaling.
[125] Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding cs.CV | cs.AIPDF
Hang Wu, Sherin Mary Mathews, Yujun Cai, Ming-Hsuan Yang, Yiwei Wang
TL;DR: 本文提出了SAVEMem,一种无需训练的双阶段框架,用于在线流式视频理解中的内存管理。该框架通过引入语义感知构建三层流式内存,并基于查询自适应调整检索范围,从而在固定内存预算下提升模型性能并降低GPU内存占用。
Details
Motivation: 在线流式视频理解面临无界视频流和不可预测查询时间的挑战,现有方法通常基于视觉相似性压缩或事后检索,缺乏语义整合且两阶段协调困难。
Result: 在Qwen2.5-VL上无需训练,SAVEMem将OVO-Bench总分从52.27提升至62.69,在StreamingBench和ODV-Bench上持续提升,并在128帧时降低48%的峰值GPU内存。
Insight: 创新点在于将语义先验引入内存生成(通过固定伪问题库),并设计锚点条件的新近性门实现查询自适应的检索范围调整,使内存管理更智能高效。
Abstract: Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48% at 128 frames over the backbone.
[126] One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction cs.CVPDF
Yulong Chen, Xiaoyun Dong, Haoyu Zhang, Zongxian Yang, Lewei Xie
TL;DR: 该论文提出了一种名为DUST的解耦时空高斯场景图方法,用于解决车路协同自动驾驶(VICAD)数据中因车辆与路侧摄像头时钟不同步(时间异步)而导致的动态场景重建难题。该方法为每个动态智能体(如车辆、行人)维护一个共享外观的规范高斯集,同时解耦并分别对齐到不同数据源真实时间戳的位姿轨迹,从而消除了跨数据源优化时的梯度冲突和重影现象。
Details
Motivation: 现有基于高斯场景图的方法隐含地假设观测是同步的,为每个智能体每帧分配单一姿态。然而,在车路协同设置中,车辆和路侧基础设施的摄像头时钟独立,导致对同一动态目标的观测存在时间偏移。这种时间异步性使得现有方法的单时间线表示失效,在优化时产生不可消除的光度损失和严重的动态目标重影。
Result: 在V2X-Seq数据集的26个序列上,DUST方法取得了最先进的性能。与最强的基线方法相比,它在动态区域的PSNR指标上提升了3.2 dB,并将Fréchet Video Distance(FVD)降低了37.7%,同时在更大的时间异步下保持了鲁棒性。
Insight: 论文的核心创新在于从表示层面解决了时间异步问题,提出了解耦时空表示的理论框架。具体而言,它证明了单时间线表示会引入与智能体速度和跨源时间偏移平方成正比的光度损失,而解耦表示(共享外观规范集与解耦位姿轨迹)则能使位姿梯度核矩阵块对角化,从而完全消除跨源干扰。此外,提出的基于静态锚点的位姿校正流程和位姿正则化联合优化方案,分别解决了空间标注错位和训练早期轨迹抖动/漂移的实践挑战,使方法具有实用性。
Abstract: Reconstructing dynamic scenes from Vehicle-to-Infrastructure Cooperative Autonomous Driving (VICAD) data is fundamentally complicated by temporal asynchrony: vehicle and infrastructure cameras operate on independent clocks, capturing the same dynamic agent such as cars and pedestrians at different physical times. Existing Gaussian Scene Graph methods implicitly assume synchronized observations and assign a single pose per agent per frame, which is an assumption that breaks in cooperative settings, where the resulting gradient conflicts cause severe ghosting on dynamic agents. We identify this as a representation-level failure, not an optimization artifact: we prove that any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. To resolve this, we propose Dust (DecoUpled Spatio-Temporal) Gaussian Scene Graph for 4D Cooperative Driving Reconstruction. DUST Gaussian Scene Graph shares a canonical Gaussian set per agent for appearance consistency, while maintaining decouple pose trajectories aligned to each source’s true capture timestamps. We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely. To make Dust practical, we further introduce a static anchor-based pose correction pipeline that corrects spatio misalignment between vehicle and infrastructure annotations, and a pose-regularized joint optimization scheme that prevents trajectory jitter and drift during early training. On 26 sequences from V2X-Seq, DUST achieves state-of-the-art performance, improving dynamic-area PSNR by 3.2 dB over the strongest baseline and reducing Fréchet Video Distance by 37.7%, with keeping robustness under larger temporal asynchrony. Code is available at https://anonymous.4open.science/r/DUST-6A55.
[127] MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence cs.CVPDF
Hanqi Jiang, Junhao Chen, Yi Pan, Lifeng Chen, Weihang You
TL;DR: 该论文提出了MedVIGIL评估套件,用于评估医学视觉语言模型(VLMs)在视觉证据被破坏(如错误前提、文本扰动、知识改写或ROI损坏图像)时是否能够识别并拒绝回答,而非提供流畅但错误的答案。该套件包含300个临床案例,由放射科医生全程监督构建,并引入了MedVIGIL综合评分(MCS)等审计指标。
Details
Motivation: 解决医学VLMs在临床可信应用中的关键问题:模型必须能够识别其回答所依赖的视觉证据是否失效,而不仅仅是在完整图像-问题对上进行评估。
Result: 在MedVIGIL基准上评估了16个具备视觉能力的模型和2个纯文本基线。最强的审计模型(Claude Opus)MCS得分为69.2,而独立放射科医生作为人类参考基线得分为83.3(静默失败率5.8%),表明模型性能存在显著差距。
Insight: 创新点在于构建了一个由临床医生全程监督、专注于评估模型在证据失效情况下“静默失败”风险的医学VQA基准。这为评估医学AI的可信性(尤其是安全拒绝能力)提供了新的、更严格的视角和标准化工具。
Abstract: Medical vision–language models (VLMs) are usually evaluated on intact image–question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2{,}556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.
[128] One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV | cs.AIPDF
Zuojin Tang, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jin, De Ma
TL;DR: 这篇论文提出了OneWM-VLA模型,通过自适应注意力池化将每帧视觉信息压缩为单个语义token,并使用单一流匹配目标联合生成潜在流和动作轨迹,从而在冻结骨干网络和有限适应预算下,显著提升了视觉-语言-动作模型在长视野任务中的性能。
Details
Motivation: 现有基于世界模型的VLA方法通常以高视觉带宽处理每帧视觉流,并将世界模型的推演作为动作预测的副产品,这导致在冻结骨干网络和有限适应参数下,每帧表示和潜在动作耦合未能得到充分优化。
Result: 在MetaWorld MT50上,平均成功率从47.9%提升至61.3%;在LIBERO-Long上达到95.6%(基线为85.2%);在真实Piper机械臂的长视野可变形任务Fold Cloth上达到60.0%(基线为20.0%)。
Insight: 创新点在于将每帧视觉带宽压缩为单个token,并通过单一目标联合优化潜在流和动作生成,这挑战了高视觉带宽的必要性,并为在有限参数下高效构建世界模型提供了新思路。
Abstract: Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).
[129] Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision cs.CVPDF
Jiacheng Chen, Songze Li, Han Fu, Baoquan Zhao, Wei Liu
TL;DR: Delta-Adapter是一种基于示例的图像编辑方法,它通过单对图像监督学习可迁移的编辑语义,无需文本指导。该方法利用预训练视觉编码器提取编码图像间视觉变换的语义增量,并通过基于Perceiver的适配器将其注入预训练图像编辑模型,从而实现了在大规模编辑数据集上的训练和更好的泛化能力。
Details
Motivation: 解决现有基于示例的图像编辑方法需要成对监督(即两对共享相同编辑语义的图像)导致训练数据难以大规模构建且泛化能力受限的问题。
Result: 在多个基准测试中,Delta-Adapter在已见编辑任务上相比四个强基线模型,在编辑准确性和内容一致性方面均取得一致提升,并且在未见编辑任务上也展现出更有效的泛化能力。
Insight: 核心创新在于提出单对监督范式,通过提取和注入语义增量来学习编辑语义,避免了直接暴露目标图像给模型,并引入了语义增量一致性损失来确保变换的忠实传递。这为无需文本指导的可扩展图像编辑提供了新思路。
Abstract: Exemplar-based image editing applies a transformation defined by a source-target image pair to a new query image. Existing methods rely on a pair-of-pairs supervision paradigm, requiring two image pairs sharing the same edit semantics to learn the target transformation. This constraint makes training data difficult to curate at scale and limits generalization across diverse edit types. We propose Delta-Adapter, a method that learns transferable editing semantics under single-pair supervision, requiring no textual guidance. Rather than directly exposing the exemplar pair to the model, we leverage a pre-trained vision encoder to extract a semantic delta that encodes the visual transformation between the two images. This semantic delta is injected into a pre-trained image editing model via a Perceiver-based adapter. Since the target image is never directly visible to the model, it can serve as the prediction target, enabling single-pair supervision without requiring additional exemplar pairs. This formulation allows us to leverage existing large-scale editing datasets for training. To further promote faithful transformation transfer, we introduce a semantic delta consistency loss that aligns the semantic change of the generated output with the ground-truth semantic delta extracted from the exemplar pair. Extensive experiments demonstrate that Delta-Adapter consistently improves both editing accuracy and content consistency over four strong baselines on seen editing tasks, while also generalizing more effectively to unseen editing tasks. Code will be available at https://delta-adapter.github.io.
[130] SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere cs.CVPDF
Chao Huang, Penfei Wei, Wei Wang, Jie Wen, Zhihua Wang
TL;DR: 本文提出SphereVAD,一种完全无需训练、零样本的视频异常检测框架。该方法利用预训练多模态大语言模型中间层特征中已编码的异常语义,通过将异常判别重新定义为在单位超球面上的von Mises-Fisher似然比测地线推理,实现无需新表征学习的几何判别。
Details
Motivation: 现有视频异常检测方法普遍依赖大规模标注或特定任务的训练过程,限制了其在新场景中的快速部署。作者观察到预训练MLLMs的中间层特征已编码丰富的异常语义,但现有方法依赖语言输出路径,未能利用这些表征中潜在的几何可判别性。
Result: SphereVAD在三个主要基准测试(UCF-Crime、XD-Violence、ShanghaiTech)上,在无需训练的方法中取得了新的最先进(SOTA)结果,并且与完全监督的基线方法相比仍具有竞争力。
Insight: 核心创新在于将异常检测问题重新表述为在单位超球面上的几何推理问题(vMF测地线推理),而非学习新表征。具体技术点包括:使用Frechet均值中心化展开特征分布并消除域偏差,采用整体场景注意力利用跨视频先验增强特征一致性,以及通过球面测地线拉齐将模糊片段与球面流形上的方向原型对齐。整个流程无需训练,仅需少量合成图像进行校准。
Abstract: Video anomaly detection (VAD) aims to automatically identify events that deviate from normal patterns in untrimmed surveillance videos. Existing methods universally depend on large-scale annotations or task-specific training procedures, severely limiting their rapid deployment to novel scenes. We observe that intermediate-layer features of pre-trained multimodal large language models (MLLMs) already encode rich anomaly semantics, yet existing approaches rely on the language output pathway and fail to exploit the geometric discriminability latent in these representations. Based on this finding, we propose SphereVAD, a fully training-free, zero-shot VAD framework that recasts anomaly discrimination as von Mises-Fisher (vMF) likelihood-ratio geodesic inference on the unit hypersphere, unleashing latent discriminability through principled geometric reasoning rather than learning new representations. Specifically, SphereVAD first applies Frechet mean centering to unfold feature distributions and eliminate domain biases, then employs Holistic Scene Attention (HSA) to reinforce feature consistency using cross-video priors, and finally performs vMF-guided Spherical Geodesic Pulling (SGP) to align ambiguous segments with directional prototypes on the spherical manifold. This training-free pipeline requires only minimal synthetic images for calibration. SphereVAD establishes new state-of-the-art results among training-free approaches on three major benchmarks and remains competitive with fully supervised baselines. Code will be available upon acceptance.
[131] STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation cs.CV | cs.LGPDF
Ying Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang, Yuyang Wang
TL;DR: STARFlow2提出了一种基于自回归归一化流的统一多模态生成框架,通过将预训练视觉语言模型与TarFlow流垂直交错连接,在共享因果掩码下实现文本与图像的协同生成,解决了现有方法中自回归文本生成与迭代视觉去噪之间的结构不匹配问题。
Details
Motivation: 现有统一多模态生成系统通常结合自回归语言模型与基于扩散的图像生成器,导致文本生成与视觉生成在结构上不匹配;本文旨在利用自回归归一化流与LLMs共享的因果掩码、KV缓存机制等特性,构建真正统一的生成范式。
Result: 实验表明,STARFlow2在图像生成和多模态理解基准上表现出色,验证了自回归流作为统一多模态建模基础的有效性。
Insight: 创新点在于将自回归归一化流与LLMs结构对齐,通过Pretzel架构实现文本与视觉生成的深度集成,并利用统一FAE潜在空间和深度-浅层流设计支持缓存友好的交错生成,避免了重新编码的开销。
Abstract: Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers–sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs–making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.
[132] Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models cs.CVPDF
Kaidi Jia, Yujie Lin, Chengyi Yang, Jiayao Ma, Jinsong Su
TL;DR: 该论文提出了一种名为HFRU的强化学习遗忘框架,旨在解决视觉语言模型(VLMs)中敏感知识遗忘的问题。该方法通过两阶段策略,在视觉编码器上进行深度语义移除,结合对齐破坏和基于GRPO的优化,使用包含抽象奖励的复合奖励来减少对象幻觉,从而在对象识别和人脸身份任务中实现高效遗忘和保留性能。
Details
Motivation: 视觉语言模型在隐私、版权和偏见方面引发担忧,现有遗忘方法主要微调语言解码器,导致浅层遗忘,未能擦除底层视觉表示并常引入对象幻觉,因此需要一种在视觉编码器上进行深度语义移除的方法。
Result: 在对象识别和人脸身份任务上的实验表明,HFRU实现了超过98%的遗忘和保留性能,同时引入可忽略的对象幻觉,显著优于先前方法。
Insight: 创新点包括在视觉编码器上进行强化学习遗忘,采用两阶段方法结合对齐破坏和GRPO优化,以及引入抽象奖励来鼓励语义有效替换并减轻幻觉,这为VLMs的深度知识移除提供了新思路。
Abstract: Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.
[133] SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation cs.CV | cs.AIPDF
Tianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang, Yu Zeng
TL;DR: 本文提出了SCOPE框架,通过结构化分解和条件技能编排来解决复杂图像生成中语义承诺的跟踪问题,并引入了Gen-Arena基准和EGIP评估指标来验证其有效性。
Details
Motivation: 解决文本到图像模型在实现复杂视觉意图时,因语义承诺在生成生命周期中难以持续跟踪而导致的‘概念鸿沟’问题。
Result: SCOPE在Gen-Arena基准上大幅超越所有基线,EGIP达到0.60,同时在WISE-V和MindBench基准上也取得了0.907和0.61的强结果。
Insight: 创新点在于将语义承诺形式化为可操作的结构化规范,并通过条件调用检索、推理和修复技能来维持承诺,实现了对复杂生成任务的细粒度控制与验证。
Abstract: While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.
[134] MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation cs.CVPDF
Xinyan Ye, Jiankang Deng, Abbas Edalat
TL;DR: MoCoTalk是一个多条件视频扩散框架,用于可控的说话头部生成,它统一了参考图像、面部关键点、3DMM渲染的阴影网格和语音音频四种控制信号。为了解决异质条件间的破坏性干扰,它引入了自适应多条件路由器,计算通道级、时间步感知的门控。为了更好捕捉语音相关的面部动态,它设计了一种基于3DMM的嘴部增强阴影网格表示,解耦了头部运动、嘴部运动、表情和光照。
Details
Motivation: 现有说话头部生成方法通常只处理身份、姿态、表情和嘴部动态等要素的一个子集,并且在涉及多个条件时依赖于固定权重或启发式融合,缺乏对异质条件干扰的有效处理和细粒度的属性级控制。
Result: 大量实验表明,MoCoTalk在大多数结构、运动和感知指标上达到了最先进的性能,同时提供了单条件方法不具备的属性级可控性。
Insight: 核心创新点在于自适应多条件路由器(根据特征子空间和噪声水平动态调整融合策略)和嘴部增强阴影网格(提供时间一致的几何先验并允许推理时灵活重组属性),这为解决多条件生成中的干扰问题和提升语音-视觉对齐提供了新思路。
Abstract: Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.
[135] Flow-OPD: On-Policy Distillation for Flow Matching Models cs.CV | cs.AIPDF
Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen
TL;DR: 本文提出了Flow-OPD,一个针对流匹配(Flow Matching)文生图模型的后训练统一框架,旨在解决多任务对齐中的奖励稀疏性和梯度干扰问题。该方法采用两阶段策略:先通过单奖励GRPO微调训练领域专家教师模型,再利用基于流的冷启动方案建立初始策略,并通过三步流程(在线策略采样、任务路由标注、密集轨迹级监督)将异构专业知识整合到单一学生模型中。
Details
Motivation: 现有流匹配文生图模型在多任务对齐中存在两个关键瓶颈:标量奖励导致的奖励稀疏性,以及联合优化异构目标引起的梯度干扰,这共同导致了竞争指标的“跷跷板效应”和普遍的奖励破解问题。
Result: 基于Stable Diffusion 3.5 Medium,Flow-OPD将GenEval分数从63提升至92,OCR准确率从59提升至94,相比原始GRPO方法整体提升约10分,同时保持了图像保真度和人类偏好对齐,并表现出超越教师的涌现效应。
Insight: 主要创新点包括:1)将大语言模型社区成功的在线策略蒸馏(OPD)首次引入流匹配模型;2)提出两阶段对齐策略,先训练单任务专家再整合;3)引入流式冷启动方案和任务路由标注机制;4)提出流形锚定正则化(MAR),利用任务无关教师提供全数据监督,缓解纯强化学习对齐中常见的美学退化问题。
Abstract: Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a ‘seesaw effect’ of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent ‘teacher-surpassing’ effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.
[136] Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment cs.CVPDF
Jerry Jiang, Haowen Sun, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno
TL;DR: 本文提出Proxy3D方法,为视觉语言模型(VLMs)设计了一种紧凑而全面的3D代理表示。该方法仅以视频帧作为输入,通过语义和几何编码器提取场景特征,并进行语义感知聚类,在3D空间中生成一组代理。通过构建SpaceSpan数据集和多阶段训练,实现了3D代理表示与VLM的对齐。该方法在缩短视觉序列长度的同时,在3D视觉问答、视觉定位和通用空间智能基准测试中取得了有竞争力或最先进的性能。
Details
Motivation: 现有VLM方法多沿用2D流程,使用像素对齐表示,导致基于对应关系的模型在空间一致性上表现不佳,而基于3D几何先验的表示模型在视觉序列序列化中效率低下。本文旨在解决这一问题,为VLM的视觉模态设计高效且全面的3D表示。
Result: 在3D视觉问答、视觉定位和通用空间智能基准测试中,该方法在使用更短视觉序列的情况下,取得了有竞争力或最先进的(SOTA)性能。
Insight: 创新点在于提出了通过语义感知聚类生成3D代理表示的方法,将复杂的3D场景信息压缩为紧凑的代理集合,从而在保持空间理解能力的同时提高了VLM处理视觉信息的效率。从客观角度看,该方法将3D几何先验与语义聚类结合,为VLM的3D空间推理提供了一种新颖且高效的表示学习范式。
Abstract: Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.
[137] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction cs.CV | cs.AIPDF
Wei Yu, Yunhang Qian
TL;DR: 本文提出了EmambaIR,一种基于高效视觉状态空间模型的图像重建方法,专门处理空间稀疏且时间连续的事件流。该方法通过跨模态Top-k稀疏注意力模块(TSAM)和门控状态空间模块(GSSM)来克服CNN和ViT在全局特征捕获和计算复杂度方面的局限性,在运动去模糊、去雨和HDR增强等任务上实现了SOTA性能,同时显著降低了内存和计算成本。
Details
Motivation: 解决现有基于事件的图像重建方法中CNN难以捕获全局特征关联、ViT存在二次计算复杂度(O(n^2))的问题,使其难以应用于高分辨率场景。
Result: 在运动去模糊、去雨和HDR增强三个图像重建任务的六个数据集上进行了广泛实验,结果表明EmambaIR显著超越了现有SOTA方法,同时大幅减少了内存消耗和计算成本。
Insight: 创新点包括:1)跨模态Top-k稀疏注意力模块(TSAM),通过像素级Top-k稀疏注意力实现高效的跨模态交互,生成丰富且稀疏的融合特征;2)门控状态空间模块(GSSM),利用非线性门控单元增强线性复杂度(O(n))状态空间模型的时间表示能力,有效捕获全局上下文依赖而无典型计算开销。从客观角度看,该方法将状态空间模型(SSM)引入事件引导的图像重建领域,结合稀疏注意力机制,在保持线性复杂度的同时提升了性能,是一种高效且创新的架构设计。
Abstract: Recent event-based image reconstruction methods predominantly rely on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to process complementary event information. However, these architectures face fundamental limitations: CNNs often fail to capture global feature correlations, whereas ViTs incur quadratic computational complexity (e.g., $O(n^2)$), hindering their application in high-resolution scenarios. To address these bottlenecks, we introduce EmambaIR, an Efficient visual State Space Model designed for image reconstruction using spatially sparse and temporally continuous event streams. Our framework introduces two key components: the cross-modal Top-k Sparse Attention Module (TSAM) and the Gated State-Space Module (GSSM). TSAM efficiently performs pixel-level top-k sparse attention to guide cross-modal interactions, yielding rich yet sparse fusion features. Subsequently, GSSM utilizes a nonlinear gated unit to enhance the temporal representation of vanilla linear-complexity ($O(n)$) SSMs, effectively capturing global contextual dependencies without the typical computational overhead. Extensive experiments on six datasets across three diverse image reconstruction tasks - motion deblurring, deraining, and High Dynamic Range (HDR) enhancement - demonstrate that EmambaIR significantly outperforms state-of-the-art methods while offering substantial reductions in memory consumption and computational cost. The source code and data are publicly available at: https://github.com/YunhangWickert/EmambaIR
cs.RO [Back]
[138] Weather-Robust Scene Semantics with Vision-Aligned 4D Radar cs.RO | cs.CVPDF
Kali Hamilton, Christoffer Heckman
TL;DR: 本文提出了一种利用视觉对齐的4D雷达实现天气鲁棒场景语义理解的方法。通过将雷达编码器与冻结的SigLIP视觉嵌入对齐,并使用一个参数约700万的可训练视觉语言模型解码结构化场景描述,该方法在K-RADAR数据集的雾、小雪和大雪等恶劣天气序列上表现优于相机基线,后者出现了超过90%的幻觉错误。
Details
Motivation: 解决在雨、雾、雪等恶劣天气条件下,相机和激光雷达性能严重退化,而毫米波雷达基本不受影响的问题,旨在利用雷达实现鲁棒的场景语义理解。
Result: 在K-RADAR数据集上,所有雷达配置在雾、小雪和大雪等保留测试序列上均优于相机基线(后者幻觉率超过90%)。
Insight: 主要创新点包括:1) 将雷达编码器与冻结的视觉嵌入对齐,并利用冻结的VLM解码,实现了高效的雷达-视觉语言模型融合;2) 识别并解决了连接雷达与冻结VLM时的主要失败模式——令牌范数不匹配,提出使用投影器输出后的层归一化来缓解;3) 对编码器复杂度、描述格式和池化策略的分析为未来雷达-VLM流水线设计提供了权衡依据。
Abstract: Cameras and LiDAR degrade in rain, fog, and snow, while millimeter-wave radar remains largely unaffected. We align a radar encoder to frozen SigLIP vision embeddings and decode structured scene captions through a frozen vision-language model (VLM) with approximately 7M trainable parameters. On K-RADAR with held-out fog, light snow, and heavy snow sequences, all radar configurations outperform a camera baseline that collapses to over 90% hallucination. We identify a token-norm mismatch as the dominant failure mode when bridging radar to a frozen VLM and show that projector-output LayerNorm resolves it. Analysis of encoder complexity, caption format, and pooling strategy reveals tradeoffs that inform future radar-VLM pipeline design.
[139] TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning cs.RO | cs.AI | cs.CV | cs.LGPDF
Giacomo Spigler
TL;DR: 本文提出了TAVIS基准测试,用于评估主动视觉在模仿学习中的性能,包含TAVIS-Head和TAVIS-Hands两个任务套件,并引入了配对头戴相机与固定相机协议、GALT指标和程序化分布偏移划分等评估原语。
Details
Motivation: 解决主动视觉在模仿学习中缺乏统一基准测试的问题,以量化主动视觉在不同任务和条件下的贡献。
Result: 基于Diffusion Policy和π_0的基线实验表明,主动视觉通常有帮助但效果因任务而异,多任务策略在分布偏移下性能显著下降,且模仿学习能产生与人类操作员相当的预期性注视。
Insight: 创新点在于构建了首个主动视觉模仿学习的综合基准,并引入了基于认知科学的GALT指标来量化策略的预期性注视行为。
Abstract: Active vision – where a policy controls its own gaze during manipulation – has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites – TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) – on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and $π_0$ reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis-benchmark.
[140] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale cs.RO | cs.CVPDF
Daniel Dauner, Valentin Charraut, Bastian Berle, Tianyu Li, Long Nguyen
TL;DR: 本文提出了123D,一个开源框架,旨在统一大规模多模态自动驾驶数据。它通过单一API处理不同传感器模态(如相机、激光雷达、状态、标注、交通灯和高精地图)的同步与异步访问,整合了八个真实世界数据集(总计3300小时、9万公里)和一个可配置的合成数据集,并提供了数据分析和可视化工具。
Details
Motivation: 自动驾驶领域存在大量异构、碎片化的多模态数据,各数据集在模态、速率、同步和标注规范上不一致,导致难以跨数据集利用和评估模型泛化能力。
Result: 通过123D框架,作者对八个真实数据集进行了系统研究,比较了标注统计并评估了位姿和校准精度;展示了跨数据集3D目标检测迁移和规划强化学习两个应用案例。
Insight: 创新点在于提出了一种基于时间戳事件流的统一数据表示方法,支持任意数据集的同步/异步访问,解决了多模态数据集成和跨数据集评估的难题,为大规模自动驾驶研究提供了标准化工具。
Abstract: The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset’s pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at https://github.com/kesai-labs/py123d.
cs.AI [Back]
[141] More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models cs.AI | cs.CL | cs.LGPDF
Xiao Wang
TL;DR: 本文研究发现,在多项选择题问答中,具备思维链推理能力的模型(如DeepSeek-R1)的答案位置偏差会随着推理轨迹长度的增加而增强,这与通常认为的‘仔细思考能减少启发式偏见’的假设相反。
Details
Motivation: 动机是检验思维链推理和经过推理调优的模型是否真的通过仔细思考减少了浅层启发式偏见,特别是在多项选择题中的位置偏差问题上。
Result: 在13种推理模式配置中,有12种在控制准确率后,显示出推理轨迹长度与位置偏差分数之间存在正偏相关(0.11至0.41)。在671B参数的DeepSeek-R1模型中,虽然总体位置偏差很小,但在最长的推理轨迹四分位数中仍存在显著偏差。
Insight: 创新点在于揭示了推理模型中的‘长度驱动的位置偏差’现象,即更长的思考过程反而可能累积并加剧对特定答案位置的偏好,而非消除偏见。这挑战了推理模型默认具有顺序鲁棒性的假设,并提供了诊断工具包来审计此类偏差。
Abstract: Chain-of-thought (CoT) reasoning and reasoning-tuned models such as DeepSeek-R1 are commonly assumed to reduce shallow heuristic biases by thinking carefully. We test this on position bias in multiple-choice QA and find a different story: within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory. Across thirteen reasoning-mode configurations (two R1-distilled 7-8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B) on MMLU, ARC-Challenge, and GPQA, twelve show a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, ranging from 0.11 to 0.41 (all p < 0.05). All twelve open-weight reasoning-mode configurations show monotonically increasing PBS across length quartiles. A truncation intervention provides causal evidence: continuations resumed from later points in the trajectory are increasingly likely to shift toward position-preferred options (16% to 32% for R1-Qwen-7B across absolute-position buckets). At 671B, aggregate PBS collapses to 0.019, but the length effect still manifests in the longest quartile (PBS = 0.071), suggesting that accuracy gates the expression of length-driven bias rather than eliminating the underlying mechanism. We additionally find that direct-answer position bias is a distinct phenomenon with a different footprint (strong in Llama-Instruct-direct, weak in Qwen-Instruct-direct, and uncorrelated with trajectory length): CoT reasoning replaces this baseline bias with length-accumulated bias. Our results argue that reasoning-capable models should not be treated as order-robust by default in MCQ evaluation pipelines, and offer a diagnostic toolkit (PBS, commitment change point, effective switching, truncation probes) for auditing position bias in reasoning models.
[142] State Representation and Termination for Recursive Reasoning Systems cs.AI | cs.CL | cs.LGPDF
Debashis Guha, Amritendu Mukherjee, Sanjay Kukreja, Tarun Kumar
TL;DR: 本文提出了一种用于递归推理系统的状态表示与终止准则框架。该框架将推理状态表示为编码提取主张、证据关系、开放问题和置信权重的认知状态图,并定义了’顺序间隙’作为评估迭代收敛性的指标,以确定何时停止推理迭代。
Details
Motivation: 解决递归推理系统中两个通常被隐含处理的设计选择:如何表示不断演化的推理状态,以及何时停止迭代过程。
Result: 主要理论结果给出了线性化顺序间隙在不动点附近非退化的充要条件,表明该准则在何时是信息性的而非代数上空洞的;这是一个局部条件,而非全局收敛保证。
Insight: 创新点在于将推理状态形式化为认知状态图,并引入顺序间隙这一可操作的局部收敛性准则;该框架可应用于智能体循环、思维树推理、定理证明和持续学习等多个领域。
Abstract: Recursive reasoning systems alternate between acquiring new evidence and refining an accumulated understanding. Two design choices are typically left implicit: how to represent the evolving reasoning state, and when to stop iterating. This paper addresses both. We represent the reasoning state as an epistemic state graph encoding extracted claims, evidential relations, open questions, and confidence weights. We define the order-gap as the distance between the states reached by expand-then-consolidate versus consolidate-then-expand; a small order-gap suggests that the two orderings agree and further iteration is unlikely to help. Our main result gives a necessary and sufficient condition for the linearised order-gap to be non-degenerate near the fixed point, showing when the criterion is informative rather than algebraically vacuous. This is a local condition, not a global convergence guarantee. We apply the framework to recursive reasoning systems and sketch its application to agent loops, tree-of-thought reasoning, theorem proving, and continual learning.
[143] When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment cs.AI | cs.CL | cs.LGPDF
Long Zhang, Wei-neng Chen, Feng-feng Wei, Zi-bo Qin
TL;DR: 本文研究了语言模型在生成最终答案前,其答案偏好何时稳定下来的问题。作者提出了一个基于有限答案集偏好稳定的可计算框架,通过将模型自身延续概率投影到有限答案集上来定义答案偏好稳定时间、回溯稳定时间和领先时间等概念,并在Qwen3-4B-Instruct模型上进行了实验验证。
Details
Motivation: 语言模型通常在给出最终答案前会先生成推理过程,但可见的答案本身并不能揭示模型的答案偏好是何时稳定下来的。本文旨在通过一个可计算的对象——有限答案偏好稳定——来研究这个问题。
Result: 在受控的延迟裁决任务中,使用Qwen3-4B-Instruct模型进行实验,发现上下文有限答案投影在答案可被解析之前就已稳定,在主模板中平均领先17-31个token,在一个解析器干净的复现中领先时间更短且为正。该信号追踪的是模型最终输出而非真实答案,可以从紧凑的隐藏摘要中线性恢复,部分可与光标进度分离,并作为共享信息进行传递。
Insight: 创新点在于提出了一个不依赖于贪婪展开或学习探针的、基于解析器的答案偏好稳定测量框架。该框架能够精确地量化模型内部答案偏好的形成时机,为理解语言模型的决策过程提供了新的视角和可计算工具。
Abstract: Language models often generate reasoning before giving a final answer, but the visible answer does not reveal when the model’s answer preference became stable. We study this question through a narrow computable object: \emph{finite-answer preference stabilization}. For a model state and specified answer verbalizers, we project the model’s own continuation probabilities onto a finite answer set; in binary tasks this yields an exact log-odds code, $δ(ξ)=S_θ(\mathrm{yes}\midξ)-S_θ(\mathrm{no}\midξ)$. This target defines parser-based answer onset, retrospective stabilization time, and lead without relying on greedy rollouts or learned probes. In controlled delayed-verdict tasks with Qwen3-4B-Instruct, the contextual finite-answer projection stabilizes before the answer is parseable, with 17–31 token mean lead in the main templates and positive, shorter lead in a parser-clean replication. The signal tracks the model’s eventual output rather than truth, is linearly recoverable from compact hidden summaries, is partly separable from cursor progress, and transfers as shared information without a single invariant coordinate. Diagnostics separate the measurement from online stopping, verbalizer-free belief, and causal answer control; exact steering shows local sensitivity of $δ$ but not reliable generation control.
[144] Uneven Evolution of Cognition Across Generations of Generative AI Models cs.AI | cs.CVPDF
Isaac Galatzer-Levy, Daniel McDuff, Xin Liu, Jed McGiffin
TL;DR: 本文引入心理测量学框架评估生成式AI模型的认知能力,发现其认知结构存在显著不平衡:语言理解和记忆能力接近人类顶尖水平,而感知推理能力却接近最低水平。通过AIQ基准测试追踪多代模型发展轨迹,发现性能提升不对称,且语言模态的抽象定量推理能力远优于视觉模态,表明模型架构存在语言偏向性。
Details
Motivation: 为超越狭隘任务性能,评估生成式AI模型的认知能力,追踪其跨代演化轨迹,以揭示其认知结构的不平衡性及发展瓶颈。
Result: 在韦氏成人智力量表改编任务中,领先多模态模型的语言理解和记忆能力达到人类第98百分位以上,感知推理能力低于第1百分位。AIQ基准测试显示六代模型性能有显著但不均衡提升,语言模态的抽象定量推理能力远超视觉模态。
Insight: 创新点在于引入心理测量学框架和AIQ基准量化评估AI认知能力,揭示了生成式模型认知发展的不平衡性和语言偏向性,表明仅靠扩展和优化可能无法克服架构限制以实现平衡的类人通用智能。
Abstract: The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted from the Wechsler Adult Intelligence Scale revealed a profoundly uneven cognitive architecture: near-ceiling performance in verbal comprehension and working memory (>$98^{\text{th}}$ percentile) contrasted with near-floor performance in perceptual reasoning (<$1^{\text{st}}$ percentile). To track developmental trajectories beyond human-normed limits, we developed the Artificial Intelligence Quotient (AIQ) Benchmark and applied it to six generations and two model families, revealing significant but asymmetric performance gains. Notably, we uncovered a sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format, indicating an architectural bias towards language-based symbolic manipulation. While abstract visual reasoning improved, visual-perceptual organization remained largely stagnant. Collectively, these findings demonstrate that the cognitive abilities of generative models are evolving unevenly, suggesting that scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations in achieving balanced, human-like general intelligence.
stat.ML [Back]
[145] Reliable Chain-of-Thought via Prefix Consistency stat.ML | cs.CL | cs.LGPDF
Naoto Iwase, Yuki Ichihara, Mohammad Atif Quamar, Junpei Komiyama
TL;DR: 本文提出了一种名为前缀一致性的新方法,用于提升大语言模型在推理任务中的准确性。该方法通过截断思维链的中间部分并重新生成剩余部分,观察不同答案的再现频率差异,以此作为可靠性信号来加权候选答案,从而减少计算开销并保持或提升多数投票的准确率。
Details
Motivation: 解决现有自洽性方法(如多数投票)在聚合多个思维链时计算成本高的问题,并探索一种不依赖令牌对数概率或自评分提示的、更高效的可靠性评估信号。
Result: 在五个推理模型和四个数学与科学基准测试(如GSM8K、MATH等)上,前缀一致性在大多数设置中是最好的正确性预测指标;使用其重新加权投票,在达到标准多数投票平台准确率时,可减少高达21倍(中位数4.6倍)的令牌消耗。
Insight: 创新点在于发现并利用了正确与错误答案思维链在部分截断后重新生成时的再现频率差异,将其量化为一种轻量级、无需模型内部概率信息的可靠性信号(前缀一致性),从而实现了更高效的答案聚合。
Abstract: Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at https://github.com/naoto-iwase/prefix-consistency.
[146] Consistency Regularised Gradient Flows for Inverse Problems stat.ML | cs.CV | cs.LGPDF
Alessio Spagnoletti, Tim Y. J. Wang, Marcelo Pereyra, O. Deniz Akyildiz
TL;DR: 本文提出了一种统一的欧几里得-瓦瑟斯坦-2梯度流框架,用于解决基于视觉-语言潜在扩散模型(LDMs)的逆问题。该框架通过在潜在空间中联合执行后验采样和提示优化,将先验和后验与观测数据对齐,从而在减少神经函数评估(NFEs)次数且无需通过自编码器进行反向传播的情况下,实现高效且高质量的图像重建。
Details
Motivation: 现有基于LDMs的逆问题求解器通常需要大量的神经函数评估(NFEs)以及对大型预训练组件的反向传播,导致计算成本高昂,有时还会降低重建质量。本文旨在开发一种计算效率更高的方法来解决这一瓶颈。
Result: 在多个经典成像逆问题上的实验表明,该方法以显著降低的计算成本实现了最先进的(SOTA)性能。
Insight: 主要创新点在于提出了一个统一的梯度流框架,将后验采样和提示优化联合在一个流程中,从而在潜在空间直接对齐先验、后验与数据。这种方法避免了通过自编码器的反向传播,并与少步潜在文生图模型结合,实现了低NFE推理,为高效利用生成先验解决逆问题提供了新思路。
Abstract: Vision-Language Latent Diffusion Models (LDMs) (Rombach et al., 2022) provide powerful generative priors for inverse problems. However, existing LDM-based inverse solvers typically require a large number of neural function evaluations (NFEs) and backpropagation through large pretrained components, leading to substantial computational costs and, in some cases, degraded reconstruction quality. We propose a unified Euclidean-Wasserstein-2 gradient-flow framework that jointly performs posterior sampling and prompt optimization in the latent space through a single flow that aligns the prior and posterior with the observed data. Combined with few-step latent text-to-image models, this formulation enables low-NFE inference without backpropagation through autoencoders. Experiments across several canonical imaging inverse problems show that our method achieves state-of-the-art performance with significantly reduced computational cost.
physics.optics [Back]
[147] Pre-training Enables Extraordinary All-optical Image Denoising physics.optics | cs.CVPDF
Xudong Lv, Yuxiang Sun, Shuo Wang, Nanxing Chen, Jun Guan
TL;DR: 本文提出了一种基于预训练的光学神经网络方法,用于实现高效的全光学图像去噪。通过两步优化过程(大规模数据集预训练和任务特定数据集微调),该方法在严重噪声条件下显著提升了去噪质量,峰值信噪比从低于8 dB提升至18 dB以上,并适用于多种图像风格。
Details
Motivation: 光学神经网络在速度和能效方面具有优势,但其训练方法尚未充分探索,导致性能不佳。本文旨在通过预训练方法提升光学图像去噪的质量和泛化能力。
Result: 在严重噪声(PSNR低于8 dB)条件下,该方法相比传统傅里叶域滤波和直接训练的光学网络表现出显著优势,将PSNR提升至18 dB以上,并在MNIST、ChestMNIST、CIFAR-10和CelebA等多样化数据集上验证了有效性。
Insight: 创新点在于将大规模预训练与任务特定微调结合到光学神经网络中,实现了跨领域图像去噪的泛化能力,为光学计算在视觉应用(如人脸检测、车牌识别)中的实际部署提供了新思路。
Abstract: Optical neural networks are emerging as powerful machine learning and information processing tools because of their potential advantages in speed and energy efficiency. The training methods of these physical models, however, remain underexplored compared to their digital counterparts and are leading to suboptimal performance. This paper reports a pre-training-driven approach that leads to snapshot image denoising with substantially improved quality. We demonstrated effective free-space optical denoising by a diffractive network optimized by a two-step process including (1) pre-training using a massive dataset of 3.45 million diverse but simple images and (2) fine-tuning with the corresponding task-specific datasets. Compared to conventional Fourier-domain filtering and directly trained diffractive networks, such a transfer learning process exhibited prominent advantages for denoising images degraded by severe noise, peak signal-to-noise ratio (PSNR) below 8 dB, while preserving fine image features and improving the PSNR to above 18 dB. Importantly, the same pre-trained optical network could be consistently fine-tuned to process degraded images from highly diverse styles ranging from handwritten digits (MNIST) and chest X-rays (ChestMNIST) to CIFAR-10 images and human faces (CelebA). We further demonstrated the critical role of our optical denoisers in vision-based applications, including face detection, plate recognition, and localization of UAVs in noisy conditions.
eess.IV [Back]
[148] Multimodal synthesis of MRI and tabular data with diffusion in a joint latent space via cross-attention eess.IV | cs.AI | cs.CV | cs.LGPDF
Daniel Mensing, Jan Kapar, Jochen G. Hirsch, Matthias Günther, Horst Hahn
TL;DR: 本文提出了一种多模态潜在扩散模型,通过交叉注意力在共享潜在空间中联合合成三维磁共振成像(MRI)和表格临床数据,实现了MRI与表格模态的连贯联合表示学习。该模型利用变分自编码器在扩散合成前融合两种模态,并通过独立的解码器分别重建MRI和表格数据。在德国国家队列(NAKO Gesundheitsstudie)包含超过10,000名参与者的数据上评估,生成的MRI体积在解剖学上合理且与合成的表格属性(如年龄、性别、身体测量和种族)一致。
Details
Motivation: 解决在单一生成模型中联合建模MRI和混合类型表格数据的挑战,旨在生成连贯的合成多模态患者数据,以支持医疗保健领域数字孪生的发展。
Result: 在图像模态上,使用Fréchet距离和精确率-召回率指标定量评估确认了高保真图像生成;在表格模态上,模型在标准评估指标上优于CTGAN,并与TVAE结果相当,相对于已建立的单模态基线表现出竞争力。
Insight: 创新点在于首次在单一潜在扩散框架中联合建模MRI和混合类型表格数据,通过交叉注意力和共享潜在空间实现多模态连贯合成,为生成合成多模态患者数据提供了概念验证。
Abstract: We propose a multimodal latent diffusion model that jointly synthesizes volumetric magnetic resonance imaging (MRI) and tabular clinical data within a shared latent space via cross-attention. This approach enables coherent joint representation learning of MRI and tabular modalities for generative modeling. Our model utilizes a variational autoencoder to fuse the two modalities before diffusion-based synthesis, allowing modality-appropriate reconstruction with separate decoders for MRI and tabular data. We evaluated the framework on data from the German National Cohort (NAKO Gesundheitsstudie), comprising over 10,000 participants with MRI scans and clinical tabular features such as age, sex, body measurements, and ethnicity. The generated MRI volumes exhibited anatomical plausibility and body composition consistent with the synthesized tabular attributes. Quantitative evaluation using Fréchet distance and precision-recall metrics confirmed high-fidelity image generation. In the tabular modality, our model outperformed CTGAN across standard evaluation metrics and achieved results comparable to TVAE, demonstrating competitive performance relative to established unimodal baselines. This work is, to our knowledge, the first to demonstrate the feasibility of jointly modeling MRI and mixed-type tabular data in a single latent diffusion framework, offering a proof-of-concept for generating coherent synthetic multimodal patient data and aligning with the broader goal of developing digital twins in healthcare.
cs.MM [Back]
[149] Anisotropic Modality Align cs.MM | cs.CVPDF
Xiaomin Yu, Yijiang Li, Yuhui Zhang, Hanzhen Zhao, Yue Yang
TL;DR: 本文针对多模态大语言模型训练中高质量配对数据稀缺的问题,提出了一种新的视角:利用预训练多模态对比模型的共享表示空间作为桥梁,实现单模态数据的多模态训练。研究发现,阻碍模态互换性的核心并非简单的全局偏移,而是集中在少数主导方向上的各向异性残差结构。基于此,作者提出了各向异性模态间隙对齐原则,并设计了一个名为AnisoAlign的几何校正框架,用于对未配对的模态表示进行有界校正,从而在目标模态中构建替代表示。实验验证了该方法在几何诊断和纯文本MLLM训练中的有效性。
Details
Motivation: 解决多模态大语言模型训练中高质量配对数据稀缺的问题,并深入探究利用预训练多模态对比模型共享空间进行单模态数据训练的关键前提——不同模态的表示能否可靠互换,其核心障碍在于共享空间中持续存在的模态间隙。
Result: 实验证实了AnisoAlign框架在几何诊断和纯文本多模态大语言模型训练中的益处,表明该方法能够有效对齐模态间隙并提升模型性能。
Insight: 创新点在于将模态间隙从一个经验观察重新定义为一种可校正的、结构化的几何现象,并提出了各向异性模态间隙对齐原则。从客观角度看,该研究通过分析模态表示的几何本质,揭示了阻碍互换性的关键因素是各向异性残差结构,而非全局偏移,这为利用单模态数据训练多模态模型提供了一个新的表示对齐视角和具体可行的校正框架。
Abstract: Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.
eess.SP [Back]
[150] Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference eess.SP | cs.CVPDF
Jingyi Liu, Cheng Yuan, Lijun He, Jun Zhang, Jiawei Shao
TL;DR: 本文提出了一种面向任务的人体动作理解通信框架(TOAU),通过边缘-云协同推理来解决传统视频传输方法带宽消耗大、延迟高和隐私泄露的问题。该框架在边缘端使用姿态估计器和VQ-VAE将原始视频压缩为紧凑的运动令牌序列进行传输,在云端使用轻量级投影器将其对齐到大视觉语言模型(VLM)的嵌入空间以完成复杂动作理解。
Details
Motivation: 解决智能传感应用中,传统方法需要将海量视频数据从资源受限的边缘设备传输到云端服务器所导致的上行带宽消耗巨大、延迟不可接受以及隐私泄露的问题。
Result: 在三个基准测试上的综合评估表明,与基于视频编解码器的解决方案相比,TOAU系统将传输负载减少到约1%,系统延迟降低到约20%,同时保持了可比较的动作理解准确率。
Insight: 创新点在于将面向任务的通信思想应用于人体动作理解,通过提取并量化关键姿态信息(运动令牌)而非原始像素来极大压缩数据,并结合高效的指令调优范式,利用大视觉语言模型的能力进行理解,实现了通信效率与任务性能的平衡。从客观角度看,其边缘-云分工(边缘提取紧凑表征,云端进行复杂推理)和利用VLM进行跨模态对齐的思路具有借鉴意义。
Abstract: The expanding application of smart sensing has created a growing demand for the accurate understanding of human action at the network edge. Traditional approaches require massive video data to be transmitted from resource-constrained edge devices to powerful cloud servers, incurring prohibitive uplink bandwidth consumption and unacceptable latency while raising privacy concerns. To overcome these bottlenecks, we propose a task-oriented communication framework for human action understanding (TOAU) through edge-cloud collaboration. Our framework utilizes a monocular pose estimator to extract continuous joint coordinates from raw videos, followed by a vector quantized variational autoencoder (VQ-VAE) to convert these coordinates into discrete motion tokens. Consequently, only a compact sequence of codebook indices is transmitted over the network, consuming as few as 9 bits per frame and avoiding privacy leakages. At the cloud server, a lightweight projector aligns these motion tokens with the embedding space of a large vision-language model (VLM) to facilitate complex action understanding, which is trained with an efficient instruction tuning paradigm. Comprehensive evaluations on three benchmarks demonstrate that our TOAU system reduces the transmission payload to approximately 1% and the system latency to around 20% compared to video codec-based solutions, while delivering comparable action understanding accuracy.
cs.SE [Back]
[151] SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair cs.SE | cs.CLPDF
Ion George Dinu, Marian Cristian Mihăescu, Traian Rebedea
TL;DR: 本文提出了SmellBench,一个用于评估大型语言模型(LLM)智能体在修复架构代码异味(Architectural Code Smell)方面能力的任务编排框架和评估基准。研究首次对GPT、Claude、Gemini、Mistral等四个模型家族的11种智能体配置进行了实证评估,测试其在Python项目scikit-learn中65个高严重性架构异味上的修复表现。
Details
Motivation: 架构代码异味会损害软件可维护性且人工修复成本高,但现有LLM智能体擅长修复局部错误和代码级重构,其在需要跨模块推理设计意图的架构异味修复方面的能力尚未被探索。
Result: 在经专家验证(其中63.1%的检测结果为误报)的数据集上,最佳智能体的修复解决率为47.7%,在识别误报方面与专家的一致性最高可达κ=0.94。然而,修复的积极程度与代码库整体质量呈负相关:最积极的智能体引入了140个新的异味。
Insight: 论文的创新点在于首次系统评估了LLM智能体在架构代码异味修复这一未被探索的任务上的能力,并揭示了当前LLM在局部代码转换能力与跨模块重构所需的架构理解之间存在差距。SmellBench框架提供了可复用的基础设施,包含针对特定异味类型的优化提示、迭代多步执行支持以及分别评估修复效果、误报识别和净代码库影响的评分方法,为跟踪自动化软件工程在此维度上的进展提供了基准。
Abstract: Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to $κ= 0.94$ expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at https://doi.org/10.5281/zenodo.19247588.
cs.IR [Back]
[152] Bridging Textual Profiles and Latent User Embeddings for Personalization cs.IR | cs.CLPDF
Zhaoxuan Tan, Xiang Zhai, Yan Zhu, Meng Jiang, Mohamed Hammad
TL;DR: 本文提出了BLUE框架,通过强化学习将基于文本的用户画像与基于嵌入的推荐目标对齐,从而统一了可解释的文本表示和有效的隐式用户嵌入。
Details
Motivation: 现有方法中,监督学习的隐式用户嵌入虽检索有效但难以解释,而文本用户画像虽可解释却因缺乏直接监督而难以优化下游效用,BLUE旨在弥合这一差距。
Result: 在Amazon Reviews 2023和Google Local Reviews的零样本序列推荐实验中,BLUE在嵌入冻结和可训练条件下均优于强基线,并在跨域迁移中表现出更强的泛化能力。
Insight: 创新点在于通过强化学习框架,利用LLM生成文本画像并以嵌入模型提供奖励信号,同时引入基于下一项预测的文本空间监督,确保画像既语义可解释又对下游检索高效。
Abstract: Personalized systems rely on user representations to connect behavioral history with downstream recommendation applications. Existing methods typically employ either supervised latent user embeddings, which are effective for retrieval but difficult to interpret, or textual user profiles, which are interpretable but challenging to optimize for downstream utility due to lack of direct supervision. To bridge this gap, we present BLUE, a reinforcement learning framework that unifies these two forms of user representation by aligning language-based user profiles with embedding-based recommendation objectives. Given a user interaction history, BLUE leverages a profiler Large Language Model (LLM) to generate textual profiles, while an embedding model provides reward signals. This encourages the resulting textual representations to move closer to positive items and farther from negative ones in the embedding space. We further introduce a text-space supervision signal based on next-item prediction, ensuring the learned profiles remain both semantically meaningful and highly effective for downstream retrieval. Experiments on Amazon Reviews 2023 and Google Local Reviews in zero-shot sequential recommendation settings demonstrate that BLUE consistently outperforms strong baselines under both frozen and trainable embedding conditions. Notably, BLUE achieves clear gains in cross-domain transfer, highlighting the strong generalization ability of the learned user profiles. Furthermore, these generated profiles provide superior personalized context for question answering compared to raw user histories or alternative profile optimization methods. Overall, these results show that BLUE provides an effective way to unify interpretable textual profiling with discriminative latent embeddings for personalization.
cs.SD [Back]
[153] Do Joint Audio-Video Generation Models Understand Physics? cs.SD | cs.AI | cs.CV | cs.MMPDF
Zijun Cui, Xiulong Liu, Hao Fang, Mingwei Xu, Jiageng Liu
TL;DR: 该论文提出了AV-Phys Bench基准测试,用于评估联合音视频生成模型对物理常识的理解能力,并发现现有模型在物理一致性方面存在显著不足,尤其是在事件和环境驱动的场景转换以及反物理提示上表现不佳。
Details
Motivation: 解决当前联合音视频生成模型是否真正理解音视频物理规律,还是仅生成看似合理但违反现实一致性的内容的问题。
Result: 在AV-Phys Bench上测试了三个专有模型和四个开源模型,Seedance 2.0表现最佳,但所有模型在物理理解上仍远未达到鲁棒水平,在事件转换、环境转换和反物理提示上性能急剧下降。
Insight: 创新点包括引入首个评估音视频物理常识的基准测试AV-Phys Bench,以及提出AV-Phys Agent评估器;关键挑战在于跨模态物理一致性和场景动态转换的建模。
Abstract: Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.
cs.LG [Back]
[154] ProtSent: Protein Sentence Transformers cs.LG | cs.CLPDF
Dan Ofer, Oriel Perets, Michal Linial, Nadav Rappoport
TL;DR: 本文提出了ProtSent,一个对比微调框架,用于将蛋白质语言模型(pLMs)适配为通用蛋白质嵌入模型。该方法通过结合五个蛋白质对数据集进行对比学习训练,旨在提升嵌入空间对蛋白质功能、进化和结构相似性的表征能力,并在23个下游任务上评估了其性能。
Details
Motivation: 现有蛋白质语言模型产生的平均池化序列嵌入并未经过显式训练来反映蛋白质之间的功能、进化或结构相似性,因此需要一种方法将这些模型转化为通用的、能更好捕获蛋白质相似性的嵌入模型。
Result: 在ESM-2 150M模型上,ProtSent在23个下游任务中的15个上取得提升,其中远程同源性检测提升105%,变异效应预测提升17%,SCOPe-40结构检索的Recall@1提升19.9%。较小的35M变体在16个任务上提升,远程同源性提升40.5%,SCOPe-40的Recall@1提升15.5%。
Insight: 创新点在于提出了一种无任务特定监督的对比微调框架,通过整合多源蛋白质对数据(如Pfam家族、结构负样本、蛋白质相互作用等)来重构嵌入空间,使其更好地捕获蛋白质功能和结构信息,从而提升下游任务的性能。
Abstract: Protein language models (pLMs) produce per-residue representations that capture evolutionary and structural information, yet their mean-pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins. We present Protein Sentence Transformers (ProtSent), a contrastive fine-tuning framework for adapting PLMs into general-purpose embedding models. ProtSent trains with MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein–protein interactions, and Deep Mutational Scanning data. We evaluate on 23~downstream tasks using frozen embeddings with a k-nearest-neighbor probe to measure embedding neighborhood quality. On ESM-2 150M, ProtSent improves 15 of 23 tasks, with gains of +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 structural retrieval. The 35M variant improves 16 of 23 tasks with +40.5% on remote homology and +15.5% Recall@1 on SCOPe-40. Contrastive fine-tuning restructures the embedding space to better capture protein function and structure, without any task-specific supervision. We release the models, public data, and training recipe and code.
[155] Theoretical Limits of Language Model Alignment cs.LG | cs.CL | cs.CY | cs.ITPDF
Lucas Monteiro Paes, Natalie Mackraz, Barry-John Theobald, Federico Danieli
TL;DR: 本文从信息论角度分析了语言模型对齐的基本极限,推导了在给定KL散度预算下可实现的最大期望奖励增益。研究发现最佳奖励改进由Jeffreys散度项控制,并提出了基于基础模型样本的实用估计器。论文还分析了代理奖励设置下的奖励黑客问题,证明了奖励集成可缓解此问题。实证表明,best-of-N方法接近理论极限,而PPO和GRPO方法则显著次优。
Details
Motivation: 尽管强化学习和best-of-N等对齐方法被广泛使用,但在KL散度预算下奖励改进的基本极限仍不清楚。本文旨在从信息论角度刻画KL正则化对齐的理论极限,以理解对齐过程的根本约束。
Result: 在安全和摘要两个任务上计算了KL-奖励帕累托前沿,实证结果显示best-of-N方法接近理论极限,而PPO和GRPO方法则显著次优。
Insight: 创新点在于从信息论角度推导了KL正则化对齐的闭式最优奖励改进表达式(由Jeffreys散度控制而非√KL),提出了基于基础模型协方差的实用估计器,并理论证明了奖励集成可缓解奖励黑客问题。从客观角度看,该研究为对齐方法的性能评估提供了理论基准,并揭示了当前算法(如PPO)与理论极限之间的差距。
Abstract: Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.
[156] Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG | cs.AI | cs.CLPDF
Xiaoze Liu, Dhananjay Ram, Yuting Zhang, Zhaoyang Zhang, Wei Xia
TL;DR: 本文提出了一个名为‘互强化学习’的框架,用于异构大语言模型的并发强化学习后训练。该框架通过共享经验交换、多工作者资源分配和分词器异构层,使不同模型能够交换经验,同时保持各自的参数、目标和分词器。作者在GRPO基础上实例化了三种经验共享方法,并通过上下文老虎机分析揭示了它们在稳定性与支持度权衡中的结构位置,最终发现结果级共享在该评估体系中表现最优。
Details
Motivation: 解决异构大语言模型在强化学习后训练中难以有效共享经验的问题,旨在通过一个统一的框架使不同模型家族能够交换经验,从而提升训练效率和性能。
Result: 在评估体系中,结果级共享(SGT)在稳定性与支持度的权衡中占据了有利位置,具体表现为通过已验证的同伴成功提供了救援集得分方向。
Insight: 创新点在于提出了一个支持异构LLM经验共享的通用框架,并设计了三种不同级别的共享机制(数据级、价值级、结果级),通过理论分析揭示了它们在稳定性-支持度权衡中的特性,为异构模型协作学习提供了新的设计思路。
Abstract: We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Worker Resource Allocation (MWRA), and a Tokenizer Heterogeneity Layer (THL) that retokenizes text and aligns token-level traces across incompatible vocabularies. This substrate makes the experience-sharing design question operational across model families. We instantiate three controlled probes on top of GRPO: data-level rollout sharing via Peer Rollout Pooling (PRP), value-level advantage sharing via Cross-Policy GRPO Advantage Sharing (XGRPO), and outcome-level success transfer via Success-Gated Transfer (SGT). A contextual-bandit analysis characterizes their structural positions on a stability-support trade-off: PRP pays density-ratio variance and THL residual costs, XGRPO preserves on-policy actor support while changing scalar baselines, and SGT supplies a rescue-set score direction toward verified peer successes. In the evaluated regime, outcome-level sharing occupies the favorable point of this trade-off.
[157] When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models cs.LG | cs.CLPDF
Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok
TL;DR: 本文研究了混合专家语言模型中路由器的有效性,通过反事实分析发现标准top-k路由器在模型推理的脆弱标记上表现不佳,存在未被选择的更低损失的替代路由。
Details
Motivation: 动机在于评估训练好的MoE模型中路由器选择的路由是否最优,特别是针对驱动复杂推理的脆弱标记,现有路由器可能未充分利用模型内部已有的更好路由。
Result: 在Qwen3-30B-A3B、GPT-OSS-20B、DeepSeek-V2-Lite和OLMoE-1B-7B等多个模型上验证了标准路由器在脆弱标记上效用低的问题;仅更新最后一层路由器即可提升在AIME 2024+2025和HMMT 2025基准上的pass@K性能。
Insight: 创新点在于提出反事实路由分析方法,揭示了标准路由器训练机制的局限性(仅评估执行路由且依赖聚合统计),表明路由错误分配是性能瓶颈之一,而非仅专家容量不足。
Abstract: Mixture-of-Experts (MoE) language models route each token to a small subset of experts, but whether the routes selected by a trained top-$k$ router are good ones is rarely evaluated directly. Holding the model fixed, we compare each standard route against sampled equal-compute alternatives for the same token and score each by the next-token probability it assigns to the realized token in a verified reasoning trajectory. The result is sharply token-conditional: the standard router is well-aligned with route utility on confident tokens but uninformative on the fragile tokens that drive hard reasoning, where lower-loss equal-compute routes consistently exist inside the frozen model but are not selected. The same pattern holds across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B, and follows structurally from how standard top-$k$ training evaluates routing decisions: the language modeling loss scores only the executed route, and load balancing depends only on aggregate routing statistics. A minimal router-only update to the final-layer router, leaving every expert and every other router frozen, is sufficient to shift pass@K on AIME 2024+2025 and HMMT 2025 for both Qwen3-30B-A3B and GPT-OSS-20B, suggesting that at least part of the failure reflects router-reachable misallocation rather than expert capacity alone.
[158] ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression cs.LG | cs.CLPDF
Tingcheng Bian, Yuzhe Zhang, Jing Jin, Jinchang Luo, MingQuan Cheng
TL;DR: 本文提出了ExpThink,一种用于自适应压缩思维链(CoT)的强化学习框架。它通过经验引导的奖励塑造和难度自适应的优势函数,在多个数学推理基准测试中,在显著缩短响应长度(最高减少77%)的同时,甚至提高了准确性,实现了比基线和其他RL压缩方法更高的准确率-效率比。
Details
Motivation: 现有用于思维链压缩的强化学习方法依赖于统一、静态的长度惩罚,忽略了模型能力动态变化和问题难度差异,导致压缩效果不佳。本文旨在解决这两个维度的问题。
Result: 在多个数学推理基准测试上,ExpThink将平均响应长度减少了高达77%,同时提高了准确性,其准确率-效率比(准确率除以平均token数)比原始基线高出3倍,并且在长度和准确率两项指标上都优于现有的基于RL的压缩方法。
Insight: 创新点在于:1)经验引导的奖励塑造,通过追踪每个问题的最短正确答案并应用三级奖励,形成一个无需手动调度的自进化课程;2)难度自适应的优势函数,用正确计数归一化替代标准差归一化,产生与难度单调相关的梯度,在困难问题上加强学习以保持准确性,在简单问题上抑制梯度以鼓励简洁。这些机制共同强制执行了“准确率优先,压缩其次”的训练目标。
Abstract: Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbf{ExpThink}\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emph{experience-guided reward shaping} tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second, \emph{difficulty-adaptive advantage} replaces standard deviation normalization with correct-count normalization, yielding monotonically difficulty-scaled gradients that amplify learning on hard problems to preserve accuracy while suppressing gradients on easy ones to encourage brevity. Together, these mechanisms enforce an accuracy-first, compression-second training objective. Experiments on multiple mathematical reasoning benchmarks demonstrate that \textbf{ExpThink}\xspace reduces average response length by up to 77% while simultaneously improving accuracy, achieving up to $3\times$ higher accuracy-efficiency ratio (accuracy divided by average token count) than the vanilla baseline and outperforming existing RL-based compression methods on both metrics.
[159] Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor’s Internal States cs.LG | cs.AI | cs.CLPDF
Yunho Choi, Jongwon Lim, Woojin Ahn, Minjae Oh, Jeonghoon Shim
TL;DR: 本文提出了一种名为POISE(Policy Optimization with Internal State Value Estimation)的新方法,用于大型推理模型的强化学习。该方法通过利用策略模型前向传播过程中已计算出的内部状态(如隐藏状态和词元熵统计量)来构建一个轻量级价值估计器,从而以极低成本获得基线估计,用于方差减少。POISE采用跨轮次构造来保证梯度无偏性,并仅需单轮次采样即可估计提示价值,从而在固定计算预算下实现更高的提示多样性、更稳定的学习以及消除检测零优势提示的计算开销。
Details
Motivation: 现有用于大型推理模型的可验证奖励强化学习方法(如PPO需要策略模型规模的价值模型,GRPO需要每个提示多次采样以稳定经验组均值)在基线估计方面计算成本高昂。本文旨在解决这一问题,通过利用策略模型自身的内部信号来低成本地获得有效的基线估计。
Result: 在Qwen3-4B和DeepSeek-R1-Distill-Qwen-1.5B模型上,于数学推理基准测试中,POISE在达到与DAPO相当性能的同时,所需计算量更少。其价值估计器的性能与一个独立的LLM规模价值模型相似,并能泛化到各种可验证任务上。
Insight: 核心创新点在于利用策略模型前向传播中已有的内部状态(隐藏状态、词元熵)来构建轻量级在线训练的价值估计器,并通过跨轮次构造解决轨迹条件特征带来的梯度偏差问题。这实现了更高效、稳定的策略优化,减少了对外部价值模型或多轮次采样的依赖,为强化学习中的基线估计提供了一种低成本、高性能的替代方案。
Abstract: Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model’s internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout’s value from an independent rollout’s internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model’s own internal representations, POISE enables more stable and efficient policy optimization.
[160] Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators cs.LG | cs.AI | cs.CLPDF
Tsuyoshi Okita
TL;DR: 本文提出了CIKA框架,利用大语言模型作为干预模拟器,通过因果干预来激活模型已掌握但未使用的知识,从而提升数学推理能力。该方法在多个数学推理基准测试上取得了优于现有模型的表现。
Details
Motivation: 现有方法(如基于MCTS的测试时搜索或因果图引导的知识注入)无法识别哪些概念对正确答案具有因果贡献,因为观察到的关联可能是由问题难度等混杂因素驱动的虚假关联。
Result: 在Omni-MATH-Rule基准测试上达到69.7%的准确率,整体Omni-MATH达到64.0%,优于o1-mini的60.5%;在GSM8K上达到97.2%,在AIME 2024-2026上达到46-50%,在MathArena上达到46.2%。因果知识激活组件在基础模型单独失败的问题上贡献了33.8%的正确答案。
Insight: 核心创新是提出了干预能力探针(ICP),通过外生干预设定概念掌握状态,以估计因果效应,从而区分因果相关概念与无关概念,并激活模型已拥有但未激活的知识。
Abstract: Recent methods for improving LLM mathematical reasoning, whether through MCTS-based test-time search or causal graph-guided knowledge injection, cannot identify which concepts causally contribute to a correct answer, as the observed association may be spurious, driven by confounders such as problem difficulty. We propose CIKA (Causal Intervention for Knowledge Activation), a framework that uses the LLM itself as an interventional simulator: a prompt sets the concept state to ``mastered’’ and the correctness change estimates the causal effect. We formalize this quantity as an Interventional Capability Probe (ICP), which diagnoses whether the LLM can use a given concept – distinct from merely possessing knowledge. Because the intervention exogenously sets the concept state independently of problem difficulty, ICP separates confounding that observational methods cannot. On 67 screened problems, the ICP of the top-ranked concept (+0.219) is significantly larger than that of the negative control (+0.039; paired $t$-test, $p < 10^{-6}$, Cohen’s $d = 0.86$), confirming that the probe discriminates causally relevant concepts from irrelevant ones. Analysis of 601 Omni-MATH problems further shows that solved problems have 6.1$\times$ higher ATE than unsolved ones (0.338 vs. 0.055), confirming that ICP is predictive of problem-solving success. With a 7B-parameter LLM whose weights are entirely frozen, CIKA achieves 69.7% on the contamination-free Omni-MATH-Rule benchmark and 64.0% overall, compared to 60.5% for o1-mini, and 97.2% on GSM8K, 46–50% on AIME 2024–2026, and 46.2% on MathArena. The Causal Knowledge Activation component contributes 33.8% of correct answers on problems where the base model alone fails, demonstrating that the LLM already possessed but had not activated the requisite knowledge.
[161] Tracing Uncertainty in Language Model “Reasoning” cs.LG | cs.AI | cs.CLPDF
Nils Grünefeld, Bertram Højer, Philipp Mondorf, Barbara Plank, Anna Rogers
TL;DR: 该论文提出了一种通过不确定性量化来研究语言模型(LM)推理过程动态的方法,将LM生成的中间标记序列(即推理轨迹)视为演化的模型状态,并提取不确定性轨迹轮廓(如斜率和线性度)来表征轨迹。研究发现,这些轮廓能有效预测轨迹是否产生正确答案,在GSM8K和ProntoQA基准上AUROC最高达0.807,且仅使用轨迹前几百个标记即可达到0.801 AUROC,表明错误可早期检测。正确与错误轨迹的不确定性轮廓存在定性差异,正确轨迹的不确定性下降更陡峭且更非线性。
Details
Motivation: 语言模型的推理过程(如思维链或测试时扩展)虽能提升基准性能,但其内部动态机制尚不明确,论文旨在通过不确定性量化来深入理解这些动态,将推理轨迹作为模型状态演变进行研究。
Result: 在五个语言模型上,于GSM8K和ProntoQA基准评估,不确定性轮廓预测轨迹正确性的AUROC最高达0.807,优于近期相关工作;仅使用轨迹前几百个标记时AUROC仍达0.801,显示早期错误检测能力。
Insight: 创新点在于将不确定性量化框架应用于LM推理轨迹分析,提出不确定性轨迹轮廓作为紧凑特征集,揭示了正确与错误轨迹在不确定性信号形状上的定性差异(如斜率和线性度),为基于不确定性的决策提供了原则性视角来研究生成过程。
Abstract: Language model (LM) “reasoning”, commonly described as Chain-of-Thought or test-time scaling, often improves benchmark performance, but the dynamics underlying this process remain poorly understood. We study these dynamics through the lens of uncertainty quantification by treating the “reasoning” traces, the intermediate token sequences generated by LMs, as evolving model states. We summarize each trace by an uncertainty trace profile: a small set of features describing the shape of the uncertainty signal over its trace, such as its slope and linearity. We find that across five LMs evaluated on GSM8K and ProntoQA, these profiles predict whether a trace yields a correct final answer with AUROC up to 0.807, improving markedly on recent related work. We reach AUROC 0.801 using only the first few hundred tokens of full traces, suggesting that errors can be detected early in the generation. A detailed comparison of correct and incorrect traces further reveals qualitatively distinct uncertainty profiles, with correct traces showing a steeper and less linear decline in uncertainty. Together, the results suggest that our method, grounded in decision-making under uncertainty, provides a principled lens for studying the generative process underlying LM “reasoning”.
[162] KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG | cs.AI | cs.CLPDF
Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, Yohan Jo
TL;DR: 本文提出了一种名为vOPD(带控制变量基线的在线策略蒸馏)的方法,用于稳定大型语言模型的在线策略蒸馏(OPD)训练。该方法将OPD视为策略梯度强化学习问题,并引入一个控制变量基线(即价值函数)来减少梯度方差。该价值函数具有闭式解,即学生模型与教师模型之间的逐令牌负向反向KL散度,无需额外计算。vOPD在保持轻量级单样本估计器无偏的同时,有效降低了方差,并在数学和科学推理基准测试中超越了原始OPD,性能与计算成本高昂的全词表基线相当。
Details
Motivation: 在线策略蒸馏(OPD)作为大型语言模型后训练的主流范式,在实践中因单样本蒙特卡洛估计器的高梯度方差而不稳定,且稳定训练的方法尚不成熟。本文旨在通过引入强化学习中的控制变量基线来稳定OPD训练。
Result: 在数学和科学推理基准测试上,vOPD持续优于原始OPD,并与计算开销最大的全词表基线性能匹配,实现了通过原则性强化学习方差减少对在线策略蒸馏的高效稳定。
Insight: 创新点在于将OPD形式化为策略梯度强化学习,并推导出其价值函数的闭式解(即逐令牌负向反向KL散度),从而无需额外推理或评论家网络即可实现无偏的方差减少。此外,提出的top-k基线近似进一步降低了成本而不影响性能,为稳定蒸馏提供了高效且理论依据充分的方法。
Abstract: On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function – from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.
[163] Predictive but Not Plannable: RC-aux for Latent World Models cs.LG | cs.AI | cs.CVPDF
Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
TL;DR: 该论文提出了RC-aux辅助目标,用于校正潜在世界模型中预测与规划之间的时空不匹配问题,通过多步预测和预算条件可达性监督来改善潜在空间的规划对齐性,并在LeWorldModel上验证了其在目标导向像素控制任务中的有效性。
Details
Motivation: 解决潜在世界模型在短时预测准确但潜在空间与长时规划不匹配的核心问题,即时空不匹配:模型训练基于局部预测监督,但部署时需在潜在空间进行长时目标导向搜索,而欧氏距离可能无法反映有限动作预算下的可达性。
Result: 在LeWorldModel上实例化RC-aux,并在延续训练和从头匹配设置下评估。在目标条件像素控制任务和LIBERO-Goal扩展任务中,RC-aux以适度额外成本改进了LeWM风格的规划性能。
Insight: 创新点在于引入轻量级的RC-aux辅助目标,通过时间轴上的多步开环预测和空间轴上的预算条件可达性监督(结合时序硬负样本),使潜在空间编码下游搜索所需的时空与几何结构;客观分析表明,该工作强调了规划性能不仅依赖预测精度,更取决于学习表示是否对齐规划需求。
Abstract: A latent world model may achieve accurate short-horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long-horizon goal-directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability-Correction auxiliary objective (RC-aux), a lightweight correction for this mismatch in reconstruction-free latent world models. RC-aux keeps the world-model backbone unchanged and adds planning-aligned supervision along two axes. Along the time axis, multi-horizon open-loop prediction trains the model beyond one-step consistency. Along the space axis, budget-conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability-aware planner to favor trajectories that are both goal-directed and attainable under the available budget. We instantiate RC-aux on LeWorldModel and evaluate it under both continuation-training and matched-from-scratch settings. Across goal-conditioned pixel-control tasks and a LIBERO-Goal extension, RC-aux improves LeWM-style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at https://github.com/Guang000/RC-aux.
cs.GR [Back]
[164] PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation cs.GR | cs.CV | cs.MMPDF
Junchuan Zhao, Qifan Liang, Ye Wang
TL;DR: 本文提出PersonaGest,一个两阶段的个性化语音伴随手势生成框架。第一阶段使用语义引导的RVQ-VAE在残差量化结构中解耦手势内容与风格,第二阶段通过掩码生成Transformer生成内容令牌,再通过风格残差Transformer进行风格控制。该方法在客观指标和用户感知研究中均达到SOTA水平,并展现出与参考提示的强风格一致性。
Details
Motivation: 现有基于VQ-VAE的语音伴随手势生成方法虽提升了生成质量,但未能将语义结构编码到运动表示中,也未能显式解耦内容与风格,限制了语义连贯性和个性化保真度。本文旨在同时解决这两个限制。
Result: 在客观指标和感知用户研究中均展示了最先进的性能,并展现出与参考运动提示的强风格一致性。
Insight: 创新点在于:1) 提出语义引导的RVQ-VAE,通过语义感知运动码本和对比学习显式解耦内容与风格;2) 采用两阶段生成框架,结合语义感知重掩码策略和级联风格残差Transformer,实现高质量、语义连贯且风格可控的手势生成。
Abstract: Co-speech gesture generation aims to synthesize realistic body movements that are semantically coherent with speech and faithful to a user-specified gestural style. Existing VQ-VAE based co-speech gesture generation methods improve generation quality but fail to encode semantic structure into the motion representation or explicitly disentangle content from style, limiting both semantic coherence and personalization fidelity. We present PersonaGest, a two-stage framework addressing both limitations. In the first stage, a semantic-guided RVQ-VAE disentangles motion content and gestural style within the residual quantization structure, where a Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics and contrastive learning further enforces content-style separation. In the second stage, a Masked Generative Transformer generates content tokens via a semantic-aware re-masking strategy, followed by a cascade of Style Residual Transformers conditioned on a reference motion prompt for style control. Extensive experiments demonstrate state-of-the-art performance on objective metrics and perceptual user studies, with strong style consistency to the reference prompt. Our project page with demo videos is available at https://danny-nus.github.io/PersonaGest/
cond-mat.mtrl-sci [Back]
[165] Fine-tuning a vision-language model for fracture-surface morphology recognition cond-mat.mtrl-sci | cs.CVPDF
Quanliang Liu, Jungtaek Kim, Kangwook Lee, Hyunseok Oh
TL;DR: 本研究通过微调开源视觉语言模型Qwen3-VL-32B-Instruct,开发了一个专门用于识别断裂表面形态的专家模型。研究构建了一个包含13,168张开源文献图像的数据集,并利用GPT-5.2-Reasoning生成形态标注,结合了手动收集和图像旋转增强。该模型在100张人工标注图像的基准测试中,性能超越了多个旗舰级专有多模态模型。
Details
Motivation: 通用视觉语言模型缺乏材料表征领域所需的特定视觉知识,难以可靠地进行科学图像理解。本研究旨在通过领域特定的微调,提升模型在断裂表面图像分析任务上的性能。
Result: 在100张人工标注图像的基准测试上,微调后的专家模型达到了0.92的精确度,显著优于基础模型Qwen3-VL-32B-Instruct(0.35)、GPT-5.5-Reasoning(0.58)和Gemini 3.1 Pro-Reasoning(0.78),实现了SOTA性能。消融实验表明,手动收集稀有特征图像和图像旋转增强对提升不常见断裂形态特征的识别有益。
Insight: 论文的创新点在于通过针对性的数据收集(包括手动收集和基于旋转的数据增强)和微调,将通用VLM成功转化为领域专家模型。这为将VLM应用于其他科学图像分析领域(如自主显微工作流)提供了一种可行的范式,即结合领域特定视觉准确性与更广泛的多模态推理能力。
Abstract: Vision-language models (VLMs) have shown strong potential for scientific image understanding, but general-purpose models often lack the domain-specific visual knowledge required for reliable materials characterization. In this work, we fine-tuned an open-source VLM (Qwen3-VL-32B-Instruct) for fracture-surface image analysis using a curated dataset of 13,168 open-source, literature-mined fracture-surface images. Morphology annotations were generated by GPT-5.2-Reasoning (high) from both the images and relevant excerpts of their source papers, and the dataset was further enriched with targeted manual collection and rotation-based augmentation. The resulting specialist model outperforms flagship proprietary multimodal models on a benchmark of 100 manually annotated images. It achieves a precision of 0.92, compared to 0.35 for the base Qwen3-VL-32B-Instruct, 0.58 for GPT-5.5-Reasoning (high), and 0.78 for Gemini 3.1 Pro-Reasoning (high). Dataset ablations show that manual collection of rare-feature images and augmentation via image rotation are both beneficial to improve recognition of less common fracture morphology features. We further discuss integrated use of the fine-tuned model with proprietary models to combine fracture-specific visual accuracy with broader multimodal reasoning for autonomous fractography. Although focused on fracture-surface images, this work demonstrates how VLMs can be adapted through targeted collection and fine-tuning on novel feature images to recognize those features and support downstream decision-making in autonomous microscopy workflows.