Table of Contents
- cs.CL [Total: 26]
- cs.CV [Total: 43]
- cs.SE [Total: 1]
- eess.IV [Total: 3]
- cs.MM [Total: 1]
- cs.LG [Total: 5]
- cs.AI [Total: 6]
- cs.RO [Total: 3]
cs.CL [Back]
[1] Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering cs.CL | cs.CVPDF
Jongha Kim, Byungoh Ko, Jeehye Na, Jinsung Yoon, Hyunwoo J. Kim
TL;DR: 本文提出了一种名为相关性感知多上下文对比解码(RMCD)的新解码方法,用于检索增强的视觉问答任务。该方法通过结合基于每个上下文预测的输出,并根据其与问题的相关性进行加权,有效聚合多个相关上下文的有用信息,同时抵消不相关上下文的负面影响。
Details
Motivation: 尽管大型视觉语言模型(LVLMs)能力显著,但仍缺乏对特定实体的详细知识。检索增强生成(RAG)通过从外部知识库提供额外上下文来增强LVLMs,但现有解码方法未能充分利用多个相关上下文并抑制不相关上下文的负面影响。
Result: 实验表明,RMCD在多个LVLMs上始终优于其他解码方法,在三个知识密集型视觉问答基准测试中取得了最佳性能,且无需额外训练即可应用。分析还显示,RMCD对检索结果具有鲁棒性,从最弱到最强的检索结果中均表现最佳。
Insight: 创新点在于提出了一种基于相关性加权的多上下文对比解码策略,能够动态整合和筛选检索到的上下文信息,提升模型在知识密集型任务中的性能,且方法简单易用,无需重新训练模型。
Abstract: Despite the remarkable capabilities of Large Vision Language Models (LVLMs), they still lack detailed knowledge about specific entities. Retrieval-augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external Knowledge Base. However, we observe that previous decoding methods for RAG are sub-optimal as they fail to sufficiently leverage multiple relevant contexts and suppress the negative effects of irrelevant contexts. To this end, we propose Relevance-aware Multi-context Contrastive Decoding (RMCD), a novel decoding method for RAG. RMCD outputs a final prediction by combining outputs predicted with each context, where each output is weighted based on its relevance to the question. By doing so, RMCD effectively aggregates useful information from multiple relevant contexts while also counteracting the negative effects of irrelevant ones. Experiments show that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge-intensive visual question-answering benchmarks. Also, RMCD can be simply applied by replacing the decoding method of LVLMs without additional training. Analyses also show that RMCD is robust to the retrieval results, consistently performing the best across the weakest to the strongest retrieval results. Code is available at https://github.com/mlvlab/RMCD.
[2] Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding cs.CL | cs.AIPDF
Yanzheng Xiang, Lan Wei, Yizhen Yao, Qinglin Zhu, Hanqi Yan
TL;DR: 本文提出COVER(Cache Override Verification for Efficient Revision)方法,旨在解决可撤销扩散解码中因验证机制导致的‘flip-flop’振荡问题,该方法通过单次前向传播实现留一验证和稳定草稿生成,从而在保持输出质量的同时加速解码。
Details
Motivation: 现有并行扩散解码的验证方案常引发令牌反复被掩盖和恢复的‘flip-flop’振荡,这不仅削弱了并行草稿生成的上下文条件,还因重复修正消耗预算而拖慢推理速度。
Result: 在多个基准测试中,COVER显著减少了不必要的修正次数,在保持输出质量的同时实现了更快的解码速度。
Insight: 创新点在于通过KV缓存覆盖构建双重视图进行留一验证,并引入封闭形式的对角线校正防止自泄漏,同时使用平衡不确定性、下游影响和缓存漂移的稳定性感知分数来优先选择验证种子,并自适应调整每步验证的种子数量。
Abstract: Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.
[3] Is my model “mind blurting”? Interpreting the dynamics of reasoning tokens with Recurrence Quantification Analysis (RQA) cs.CLPDF
Quoc Tuan Pham, Mehdi Jafari, Flora Salim
TL;DR: 本文提出使用递归量化分析(RQA)作为分析大型推理模型在测试时推理链的非文本替代方法。通过将令牌生成视为动态系统,对隐藏嵌入轨迹应用RQA指标(如确定性和层流性)来量化潜在表示中的重复和停滞模式。在DeepSeek-R1-Distill的3600个生成轨迹上的分析表明,RQA能捕获响应长度无法反映的信号,并将任务复杂度的预测准确率提升8%。
Details
Motivation: 测试时计算对大型推理模型至关重要,但通过生成文本来分析其推理行为日益不切实际且不可靠。响应长度常被用作推理努力的粗略代理,但该指标无法捕捉思维链(CoT)或生成令牌的动态和有效性。
Result: 在DeepSeek-R1-Distill模型的3600个生成轨迹上进行分析,RQA不仅捕获了响应长度未反映的信号,还将任务复杂度的预测准确率提升了8%。
Insight: 创新点在于将递归量化分析(RQA)这一非线性时间序列分析方法引入到分析大型语言模型的推理动态中,通过模型的潜在嵌入轨迹而非输出文本来量化推理过程的重复和停滞模式,为研究测试时扩展的潜在令牌生成动力学提供了原则性工具。
Abstract: Test-time compute is central to large reasoning models, yet analysing their reasoning behaviour through generated text is increasingly impractical and unreliable. Response length is often used as a brute proxy for reasoning effort, but this metric fails to capture the dynamics and effectiveness of the Chain of Thoughts (CoT) or the generated tokens. We propose Recurrence Quantification Analysis (RQA) as a non-textual alternative for analysing model’s reasoning chains at test time. By treating token generation as a dynamical system, we extract hidden embeddings at each generation step and apply RQA to the resulting trajectories. RQA metrics, including Determinism and Laminarity, quantify patterns of repetition and stalling in the model’s latent representations. Analysing 3,600 generation traces from DeepSeek-R1-Distill, we show that RQA captures signals not reflected by response length, but also substantially improves prediction of task complexity by 8%. These results help establish RQA as a principled tool for studying the latent token generation dynamics of test-time scaling in reasoning models.
[4] VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation cs.CLPDF
Yancheng Wang, Osama Hanna, Ruiming Xie, Xianfeng Rui, Maohao Shen
TL;DR: 本文提出VowelPrompt框架,通过提取语音中元音段的音高、能量和时长等韵律特征,并将其转化为自然语言描述,以增强基于大语言模型的语音情感识别能力。该方法采用两阶段适应策略(监督微调+强化学习),在多种基准数据集上实现了零样本、微调、跨领域和跨语言的SOTA性能,并生成可解释的推理过程。
Details
Motivation: 现有基于大语言模型的语音情感识别方法通常忽略细粒度的韵律信息,导致效果和可解释性受限。本文旨在通过引入可解释的元音级韵律线索,使模型能同时推理词汇语义和细粒度韵律变化。
Result: 在多个基准数据集上的广泛评估表明,VowelPrompt在零样本、微调、跨领域和跨语言条件下均持续优于最先进的情感识别方法。
Insight: 创新点在于将语音韵律特征(特别是元音段的音高、能量、时长)转化为自然语言描述,使大语言模型能直接处理多模态信息;同时采用两阶段适应策略(SFT+RLVR/GRPO)提升推理能力和泛化性,并生成基于上下文语义和细粒度韵律结构的可解释输出。
Abstract: Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.
[5] RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution cs.CLPDF
Isaac Picov, Ritesh Goru
TL;DR: RoPE-LIME是一种用于解释闭源大语言模型输出的高效归因方法,它通过使用较小的开源代理模型,在固定模型输出的前提下,基于概率目标计算词级归因,并引入了RoPE嵌入空间的局部性核和Sparse-K采样策略来提升稳定性和效率。
Details
Motivation: 解决闭源LLM因API访问限制而无法使用基于梯度的归因方法,以及现有扰动方法成本高、噪声大的问题。
Result: 在HotpotQA和MMLU子集上的实验表明,RoPE-LIME比留一采样产生信息量更大的归因,性能优于gSMILE,并显著减少了闭源模型的API调用。
Insight: 创新点在于将推理与解释解耦,使用RoPE空间的局部性核确保掩码下的稳定相似性,以及Sparse-K采样策略在有限预算下提高交互覆盖;可借鉴其利用代理模型和高效扰动来降低对闭源模型依赖的思路。
Abstract: Explaining closed-source LLM outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text. We introduce RoPE-LIME, an open-source extension of gSMILE that decouples reasoning from explanation: given a fixed output from a closed model, a smaller open-source surrogate computes token-level attributions from probability-based objectives (negative log-likelihood and divergence targets) under input perturbations. RoPE-LIME incorporates (i) a locality kernel based on Relaxed Word Mover’s Distance computed in RoPE embedding space for stable similarity under masking, and (ii) Sparse-K sampling, an efficient perturbation strategy that improves interaction coverage under limited budgets. Experiments on HotpotQA (sentence features) and a hand-labeled MMLU subset (word features) show that RoPE-LIME produces more informative attributions than leave-one-out sampling and improves over gSMILE while substantially reducing closed-model API calls.
[6] Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math cs.CLPDF
Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Hyunwoo Ko, Amit Agarwal
TL;DR: 本文提出了一种名为‘基于结果的效用’的无监督评估方法,用于评估研究级数学问题的解决方案质量。该方法通过将候选方案作为上下文示例应用于解决相关且可验证的问题,来测试其价值,从而避免依赖专家验证。实验表明,该方法在排名质量上优于奖励模型、生成式奖励模型和LLM评判者。
Details
Motivation: 当前推理模型在生成研究级数学问题的解决方案方面取得进展,但验证环节仍依赖稀缺的专家时间,成为瓶颈。本文旨在开发一种无需专家参与(无监督)的评估方法,以高效评估解决方案的质量。
Result: 在原创的研究级数学问题数据集上,基于结果的效用方法在排名质量上显著优于基线方法。例如,对于GPT-OSS-120B模型,它将Acc@1从67.2提升至76.3,AUC从71.4提升至79.6;在GPT-OSS-20B上,AUC从69.0提升至79.2。此外,与LLM评判者相比,该方法展现出更大的求解器-评估器差距,即使在求解器经常失败的实例上也能保持更强的正确-错误区分能力。
Insight: 创新点在于提出了一种基于‘解决方案应包含足够方法级信息,从而在相关问题上提升下游性能’的假设的无监督评估框架。该方法通过将候选方案作为上下文示例测试其效用,避免了直接验证的困难,为评估复杂推理任务提供了新思路,可借鉴用于其他需要高质量评估但缺乏标注数据的领域。
Abstract: Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.
[7] Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production cs.CLPDF
Alberto Andres Valdes Gonzalez
TL;DR: 本文系统比较了基于提示的大语言模型(LLM)与微调编码器在文本分类任务上的性能、延迟和成本,发现微调编码器在保持竞争力的分类性能的同时,成本和延迟远低于LLM,建议在生产系统中根据多目标权衡进行模型选择。
Details
Motivation: 解决在生产系统中进行文本分类时,模型选择仅关注预测性能而忽视延迟和成本等实际约束的问题,旨在为结构化文本分类任务提供成本感知的模型选择框架。
Result: 在IMDB、SST-2、AG News和DBPedia四个基准测试上,微调的BERT家族编码器在宏F1分数上达到竞争性甚至更优的性能,同时成本和延迟比零样本/少样本LLM提示低一到两个数量级。
Insight: 创新点在于将模型评估构建为多目标决策问题,使用帕累托前沿和参数化效用函数分析权衡;客观来看,研究强调了在生产环境中结合成本效益的模型选择策略,微调编码器可作为高效核心组件,而LLM更适合作为混合架构的补充。
Abstract: Large language models (LLMs) such as GPT-4o and Claude Sonnet 4.5 have demonstrated strong capabilities in open-ended reasoning and generative language tasks, leading to their widespread adoption across a broad range of NLP applications. However, for structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone, overlooking operational constraints encountered in production systems. In this work, we present a systematic comparison of two contrasting paradigms for text classification: zero- and few-shot prompt-based large language models, and fully fine-tuned encoder-only architectures. We evaluate these approaches across four canonical benchmarks (IMDB, SST-2, AG News, and DBPedia), measuring predictive quality (macro F1), inference latency, and monetary cost. We frame model evaluation as a multi-objective decision problem and analyze trade-offs using Pareto frontier projections and a parameterized utility function reflecting different deployment regimes. Our results show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance while operating at one to two orders of magnitude lower cost and latency compared to zero- and few-shot LLM prompting. Overall, our findings suggest that indiscriminate use of large language models for standard text classification workloads can lead to suboptimal system-level outcomes. Instead, fine-tuned encoders emerge as robust and efficient components for structured NLP pipelines, while LLMs are better positioned as complementary elements within hybrid architectures. We release all code, datasets, and evaluation protocols to support reproducibility and cost-aware NLP system design.
[8] FMBench: Adaptive Large Language Model Output Formatting cs.CLPDF
Yaoting Wang, Yun Zhou, Henghui Ding
TL;DR: 本文提出了FMBench基准测试,用于评估大语言模型在遵循多样化指令场景下生成符合Markdown格式约束的输出能力。针对Markdown格式中常见的列表、表格、标题和代码块等细微错误,作者设计了一个轻量级的对齐流程,结合监督微调和强化学习微调来优化模型,在保持语义忠实度的同时提升结构正确性。
Details
Motivation: 大语言模型在面向用户和系统集成的流程中,需要生成既满足语义意图又符合格式约束的输出。Markdown格式在助手、文档和工具增强管道中无处不在,但模型仍容易产生难以检测的细微错误,如下游可用性显著降低的破损列表、格式错误的表格、不一致的标题和无效的代码块。
Result: 在OpenPangu和Qwen两个模型系列上的实验表明,监督微调持续改善了语义对齐,而从强监督微调策略初始化的强化学习微调在面对具有挑战性的Markdown指令时,进一步提升了鲁棒性。结果揭示了语义目标和结构目标之间存在固有的权衡。
Insight: 创新点在于提出了一个专注于自适应Markdown输出格式化的基准测试FMBench,它强调多级组织、混合内容(自然语言与列表/表格/代码交错)和严格遵循用户指定布局约束等真实世界格式化行为。方法上,提出了一种不依赖硬解码约束的轻量级对齐流程,通过结合监督微调和强化学习微调来优化复合目标,平衡语义保真度和结构正确性。
Abstract: Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.
[9] On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation cs.CLPDF
Wenbo Shang, Yuxi Sun, Jing Ma, Xin Huang
TL;DR: 本文提出了一种基于幽默理论GTVH的多角色LLM协作框架HOMER,用于生成图像幽默字幕。该框架通过冲突脚本提取器、检索增强的分层想象器和字幕生成器三个角色,结合视觉理解、幽默推理和创造性想象,解决了现有LLM方法在创造性和可解释性上的局限。
Details
Motivation: 解决多模态幽默生成(如图像幽默字幕生成)中LLM方法创造力有限和可解释性不足的问题,利用幽默理论GTVH提升生成质量。
Result: 在New Yorker Cartoon两个基准数据集上的实验表明,HOMER在多模态幽默字幕生成任务上超越了现有SOTA基线和强大的LLM推理策略。
Insight: 创新点在于将幽默理论GTVH融入多角色LLM协作框架,通过冲突脚本提取和检索增强的想象树扩展创造性空间,增强了幽默生成的多样性和理论基础。
Abstract: Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.
[10] TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking cs.CL | cs.AI | cs.CRPDF
Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao, Weiyue Li, Mengyu Wang
TL;DR: 本文提出了一种名为TrailBlazer的历史引导强化学习方法,用于对大语言模型进行黑盒越狱攻击。该方法通过分析并重新加权先前交互步骤中暴露的漏洞信号来指导后续决策,从而提高了攻击效率和成功率。
Details
Motivation: 现有的大语言模型越狱技术未能有效利用早期交互回合中揭示的漏洞,导致攻击效率低下且不稳定。由于越狱攻击涉及顺序交互,强化学习为此提供了一个自然的框架。
Result: 在AdvBench和HarmBench基准测试上的大量实验表明,该方法实现了最先进的越狱性能,同时显著提高了查询效率。
Insight: 论文的核心创新点是引入了一个基于注意力的重新加权机制,该机制突出交互历史中的关键漏洞,从而以更少的查询实现更高效的探索。这强调了历史漏洞信号在强化学习驱动的越狱策略中的重要性。
Abstract: Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.
[11] CORE: Comprehensive Ontological Relation Evaluation for Large Language Models cs.CL | cs.AI | cs.LGPDF
Satyam Dwivedi, Sanjukta Ghosh, Shivam Dwivedi, Nishi Kumari, Anil Thakur
TL;DR: 本文介绍了CORE(全面本体关系评估)数据集和基准,用于评估大语言模型区分有意义语义关系与真正无关性的能力。研究发现,尽管现有LLMs在相关对上表现良好,但在无关对上性能严重下降,揭示了其在语义推理方面的系统性缺陷。
Details
Motivation: 现有评估很少衡量LLMs区分有意义语义关系与真正无关性的能力,因此需要一个新的基准来填补这一空白,以更全面地评估LLMs的语义推理能力。
Result: 在CORE基准上,29个SOTA LLMs的总体准确率为48.25-70.9%,其中相关对准确率高达86.5-100%,但无关对准确率仅为0-41.35%。在更大的225K MCQ数据集上,准确率进一步降至约2%。人类基线准确率为92.6%。
Insight: 论文的创新点在于提出了一个专注于评估LLMs无关性推理能力的基准,揭示了LLMs在区分相关与无关语义关系时存在系统性偏差(如语义崩溃),这为LLM评估和安全提供了新的关键方向。
Abstract: Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen’s Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error increases 2-4x on unrelated pairs, and a mean semantic collapse rate of 37.6% indicates systematic generation of spurious relations. On the CORE 225K MCQs dataset, accuracy further drops to approximately 2%, highlighting substantial challenges in domain-specific semantic reasoning. We identify unrelatedness reasoning as a critical, under-evaluated frontier for LLM evaluation and safety.
[12] Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning cs.CLPDF
Xinxin Lin, Guangxin Dai, Yi Zhong, Xiang Li, Xue Xiao
TL;DR: 本文提出了一个名为ClinMPO的强化学习框架,旨在将轻量级大语言模型(LLM)的内部推理过程与专业精神病学实践对齐,以解决其在临床决策支持中存在的幻觉和浅层推理问题。该框架使用一个基于4,474篇精神病学期刊文章、遵循循证医学原则构建的数据集独立训练的专业奖励模型。在专门设计用于隔离推理能力而非死记硬背的基准测试子集上,经过ClinMPO调优的Qwen3-8B模型在诊断准确性上超越了300名医学生的基准水平。
Details
Motivation: 大语言模型在医疗决策支持中具有变革潜力,但其在精神病学中的应用受到幻觉和浅层推理的限制,尤其是在对隐私保护和高效临床部署至关重要的轻量级参数模型中。现有的训练范式过于注重语言流畅性,而忽视了结构化的临床逻辑,导致模型与专业诊断认知存在根本性错位。
Result: 在专门设计用于隔离推理能力(而非记忆)的、领先大参数LLM也持续失败的基准测试子集上,经过ClinMPO调优的Qwen3-8B模型实现了31.4%的诊断准确率,超过了300名医学生30.8%的人类基准水平。
Insight: 论文宣称的创新点在于提出了一个基于医学证据引导的强化学习框架(ClinMPO),通过独立训练的专业奖励模型,将LLM的内部推理与专业临床实践对齐。从客观角度看,其核心创新在于将循证医学原则结构化地融入强化学习的奖励机制中,为轻量级模型在复杂专业推理任务上超越人类基准提供了一条可扩展的路径,强调了认知对齐而非单纯语言建模的重要性。
Abstract: Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.
[13] RelayGen: Intra-Generation Model Switching for Efficient Reasoning cs.CLPDF
Jiwon Song, Yoongon Kim, Jae-Joon Kim
TL;DR: 本文提出RelayGen框架,通过无训练的段级运行时模型切换来提升大型推理模型的效率。该框架利用生成不确定性分析识别推理轨迹中的难度变化,在难度较低的段将生成任务委托给更小的模型,从而在保持大模型高难度推理能力的同时显著降低推理延迟。
Details
Motivation: 大型推理模型在复杂推理任务中表现优异但部署成本高昂,现有效率优化方法要么忽略单次生成内部的难度变化,要么依赖高复杂度的监督令牌级路由。本文旨在开发一种能够利用长形式推理中难度变化的轻量级效率提升方案。
Result: 在多个推理基准测试中,RelayGen在保持大模型大部分准确性的同时显著降低推理延迟。结合推测解码技术,实现了最高2.2倍的端到端加速,且准确率下降小于2%,无需额外训练或学习路由组件。
Insight: 创新点在于提出基于生成不确定性(令牌概率边际)的离线分析来识别段级难度转换信号,实现无训练的粗粒度模型切换。客观来看,其核心洞察是推理轨迹的难度变化具有段级连续性,因此无需复杂的令牌级控制即可实现高效委托。
Abstract: Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi-step reasoning trajectories, but inference-time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing efficiency-oriented approaches either ignore this intra-generation variation or rely on supervised token-level routing with high system complexity. We present \textbf{RelayGen}, a training-free, segment-level runtime model switching framework that exploits difficulty variation in long-form reasoning. Through offline analysis of generation uncertainty using token probability margins, we show that coarse-grained segment-level control is sufficient to capture difficulty transitions within a reasoning trajectory. RelayGen identifies model-specific switch cues that signal transitions to lower-difficulty segments and dynamically delegates their continuation to a smaller model, while preserving high-difficulty reasoning on the large model. Across multiple reasoning benchmarks, RelayGen substantially reduces inference latency while preserving most of the accuracy of large models. When combined with speculative decoding, RelayGen achieves up to 2.2$\times$ end-to-end speedup with less than 2% accuracy degradation, without requiring additional training or learned routing components.
[14] Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making cs.CLPDF
Baichuan-M3 Team, :, Chengfeng Dou, Fan Yang, Fei Li
TL;DR: Baichuan-M3是一个医疗增强的大型语言模型,旨在将范式从被动问答转变为主动的、临床级的决策支持。它通过专门的训练流程模拟医生的系统性工作流程,具备主动信息获取、长程推理整合证据以及自适应幻觉抑制等关键能力。
Details
Motivation: 解决现有系统在开放式医疗咨询中的局限性,特别是被动问答模式无法模拟真实临床决策过程的问题,旨在提供更可靠、主动的医疗决策支持。
Result: 在HealthBench、新引入的HealthBench-Hallu和ScanBench基准测试中取得了最先进(SOTA)的结果,在临床问询、建议和安全性方面显著优于GPT-5.2。
Insight: 核心创新在于将LLM的应用范式从被动响应转变为主动的临床工作流建模,通过模拟医生系统性的信息收集、推理和决策过程来提升可靠性和实用性,其自适应幻觉抑制机制对确保医疗事实的可靠性具有重要借鉴意义。
Abstract: We introduce Baichuan-M3, a medical-enhanced large language model engineered to shift the paradigm from passive question-answering to active, clinical-grade decision support. Addressing the limitations of existing systems in open-ended consultations, Baichuan-M3 utilizes a specialized training pipeline to model the systematic workflow of a physician. Key capabilities include: (i) proactive information acquisition to resolve ambiguity; (ii) long-horizon reasoning that unifies scattered evidence into coherent diagnoses; and (iii) adaptive hallucination suppression to ensure factual reliability. Empirical evaluations demonstrate that Baichuan-M3 achieves state-of-the-art results on HealthBench, the newly introduced HealthBench-Hallu and ScanBench, significantly outperforming GPT-5.2 in clinical inquiry, advisory and safety. The models are publicly available at https://huggingface.co/collections/baichuan-inc/baichuan-m3.
[15] Inference-Time Rethinking with Latent Thought Vectors for Math Reasoning cs.CL | cs.LG | stat.MLPDF
Deqian Kong, Minglu Zhao, Aoyang Qin, Bo Pang, Chenxin Tao
TL;DR: 本文提出了一种名为‘推理时反思’的生成框架,用于数学推理任务。该方法将推理过程分解为连续的潜在思维向量(决定推理什么)和基于该向量生成文本轨迹的解码器(决定如何推理)。通过交替生成候选推理轨迹和优化潜在向量,模型能够在推理过程中进行迭代自我修正,从而提升性能。
Details
Motivation: 标准的思维链推理在单次前向传播中生成解决方案,一旦生成每个token就无法撤销,缺乏从早期错误中恢复的机制。本文旨在解决这一问题,使模型能够在推理时进行迭代自我修正。
Result: 在GSM8K数据集上,从头训练一个0.2B参数的模型,经过30次反思迭代后,其性能超过了参数规模大10到15倍的基线模型,包括一个3B参数的模型。这证明了有效的数学推理可以源于复杂的推理时计算,而不仅仅是庞大的参数量。
Insight: 核心创新点在于将推理过程解耦为声明性的潜在思维向量和程序性的文本生成,并引入基于梯度的优化来迭代修正潜在向量。这为推理过程提供了一个可优化的连续表示,使得模型能够通过推理时的计算而非单纯增加参数来提升性能,是一种新颖的推理范式。
Abstract: Standard chain-of-thought reasoning generates a solution in a single forward pass, committing irrevocably to each token and lacking a mechanism to recover from early errors. We introduce Inference-Time Rethinking, a generative framework that enables iterative self-correction by decoupling declarative latent thought vectors from procedural generation. We factorize reasoning into a continuous latent thought vector (what to reason about) and a decoder that verbalizes the trace conditioned on this vector (how to reason). Beyond serving as a declarative buffer, latent thought vectors compress the reasoning structure into a continuous representation that abstracts away surface-level token variability, making gradient-based optimization over reasoning strategies well-posed. Our prior model maps unstructured noise to a learned manifold of valid reasoning patterns, and at test time we employ a Gibbs-style procedure that alternates between generating a candidate trace and optimizing the latent vector to better explain that trace, effectively navigating the latent manifold to refine the reasoning strategy. Training a 0.2B-parameter model from scratch on GSM8K, our method with 30 rethinking iterations surpasses baselines with 10 to 15 times more parameters, including a 3B counterpart. This result demonstrates that effective mathematical reasoning can emerge from sophisticated inference-time computation rather than solely from massive parameter counts.
[16] Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning cs.CLPDF
Zhuoyuan Hao, Zhuo Li, Wu Li, Fangming Liu, Min Zhang
TL;DR: 本文提出利用大型推理模型(LRMs)在推理链开头自发重复问题(称为“提示回响”,EOP)的现象,将其作为一种计算分配机制。通过形式化EOP的概率成本(定义“回响似然间隙”ΔL)并开发两种方法——通过监督微调注入“先回响后推理”模式的ED-SFT,以及无需训练、在推理中重新锚定模型的Echoic Prompting(EP),论文旨在提升模型推理效率与准确性。实验在多个数学推理基准上验证了方法的有效性。
Details
Motivation: 现有方法(如扩展自我一致性、添加通用“思考标记”)未能解释或利用LRMs在内部推理链开头自发重复问题的现象,这些方法要么注入任务无关标记,要么采用无法解释该现象的启发式策略。本文旨在分析和利用这种“提示回响”(EOP)现象,将其作为一种前端计算塑造机制,以更有效地分配测试时计算资源。
Result: 在GSM8K、MathQA、Hendrycks-MATH、AIME24和MATH-500等数学推理基准上,在相同的解码设置和计算预算下,所提方法相比基线取得了持续的性能提升。
Insight: 创新点在于首次将模型开头的自发重复(EOP)形式化为一种可计算的概率成本(通过回响似然间隙ΔL),并将其理论链接到似然增益与下游准确性。方法上,提出了通过监督微调(ED-SFT)系统性地注入“先回响后推理”模式,以及无需训练的推理中重锚定技术(Echoic Prompting)。从注意力机制分析发现,EOP能增加中间层对答案前缀的关注,体现了“注意力重聚焦”机制,这为理解模型内部推理过程提供了新视角。
Abstract: Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic thinking tokens'' and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain -- and often ignore -- the \emph{spontaneous} repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emph{Echo of Prompt (EOP)}, as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the \emph{Echo Likelihood Gap} $Δ\mathcal{L}$ as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop \emph{Echo-Distilled SFT (ED-SFT)} to instill an echo-then-reason’’ pattern through supervised finetuning, and \emph{Echoic Prompting (EP)} to re-ground the model mid-trace without training. While promising, quantifying benefits beyond verbosity is non-trivial. Therefore, we conduct length and suffix-controlled likelihood analyses together with layer-wise attention studies, showing that EOP increases answer to answer-prefix attention in middle layers, consistent with an \emph{attention refocusing} mechanism. We evaluate on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 under identical decoding settings and budgets, and find consistent gains over baselines. Code is available at https://github.com/hhh2210/echoes-as-anchors.
[17] FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge cs.CLPDF
Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Xiao Xu
TL;DR: 本文提出了FairJudge,一种自适应、去偏且一致的LLM-as-a-Judge系统,旨在解决现有LLM评估系统在适应性、系统偏见和评估一致性方面的局限性。通过将评判行为建模为可学习且正则化的策略,并构建高信息密度的评判数据集,采用课程式SFT-DPO-GRPO训练范式,FairJudge在多个基准测试中显著提升了评估性能。
Details
Motivation: 现有LLM-as-a-Judge系统存在三个根本问题:对任务和领域特定评估标准的适应性有限、由位置、长度、格式和模型来源等非语义线索驱动的系统偏见,以及评估不一致导致不同评估模式(如点式与成对)下的矛盾判断。
Result: 在多个内部和公共基准测试上的实验结果表明,FairJudge持续提高了评估一致性和F1分数,减少了非语义偏见,并且性能显著优于更大的指令调优LLM。
Insight: 创新点在于将评判行为建模为可学习的正则化策略,从数据为中心的角度构建高信息密度数据集以注入监督信号,并采用课程式训练范式逐步对齐评分标准、缓解偏见和确保跨模式一致性,同时避免灾难性遗忘。这为构建更可靠、可适应的LLM评估系统提供了新思路。
Abstract: Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and model provenance, and evaluation inconsistency that leads to contradictory judgments across different evaluation modes (e.g., pointwise versus pairwise). To address these issues, we propose FairJudge, an adaptive, debiased, and consistent LLM-as-a-Judge. Unlike prior approaches that treat the judge as a static evaluator, FairJudge models judging behavior itself as a learnable and regularized policy. From a data-centric perspective, we construct a high-information-density judging dataset that explicitly injects supervision signals aligned with evaluation behavior. Building on this dataset, we adopt a curriculum-style SFT-DPO-GRPO training paradigm that progressively aligns rubric adherence, bias mitigation, and cross-mode consistency, while avoiding catastrophic forgetting. Experimental results on multiple internal and public benchmarks show that FairJudge consistently improves agreement and F1, reduces non-semantic biases, and outperforms substantially larger instruction-tuned LLMs. All resources will be publicly released after acceptance to facilitate future research.
[18] Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features cs.CL | eess.ASPDF
Steffen Freisinger, Philipp Seeberger, Tobias Bocklet, Korbinian Riedhammer
TL;DR: 本文提出了一种用于口语内容主题分割的多模态方法,通过微调文本编码器和孪生音频编码器来捕捉句子边界周围的声学线索,以提升分割性能。
Details
Motivation: 解决当前主题分割方法未能充分利用声学特征的问题,以改善在线视频和播客等多话题口语内容的分割效果。
Result: 在YouTube视频的大规模数据集上,该方法相比纯文本和多模态基线有显著提升,并在葡萄牙语、德语和英语的三个额外数据集上表现出更强的鲁棒性,优于更大的纯文本基线。
Insight: 创新点在于结合文本和声学特征进行多模态学习,特别是通过孪生音频编码器捕捉句子边界的声学线索,增强了模型对ASR噪声的鲁棒性和跨语言泛化能力。
Abstract: Spoken content, such as online videos and podcasts, often spans multiple topics, which makes automatic topic segmentation essential for user navigation and downstream applications. However, current methods do not fully leverage acoustic features, leaving room for improvement. We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries. Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines. Our model also proves more resilient to ASR noise and outperforms a larger text-only baseline on three additional datasets in Portuguese, German, and English, underscoring the value of learned acoustic features for robust topic segmentation.
[19] Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought cs.CLPDF
Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang
TL;DR: 本文提出了PACT框架,通过风险感知的思维链实现动态安全控制,以解决LLM中静态安全策略缺乏运行时可控性的问题。PACT采用分层策略架构,包括不可覆盖的全局安全策略和用户定义策略,将安全决策分解为分类→行动的透明路径。
Details
Motivation: 解决LLM因静态、一刀切的安全策略导致的帮助性与安全性权衡问题,缺乏运行时可控性,难以适应多样化应用需求。
Result: 在全局策略评估中达到接近SOTA的安全性能,在用户特定策略评估中实现最佳可控性,有效缓解安全-帮助性权衡。
Insight: 创新点在于分层策略架构(全局与用户定义策略结合)和基于风险感知思维链的透明决策路径,实现了动态、可配置的安全控制。
Abstract: Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.
[20] Not All Layers Need Tuning: Selective Layer Restoration Recovers Diversity cs.CL | cs.AIPDF
Bowen Zhang, Meiyi Wang, Harold Soh
TL;DR: 本文提出了一种名为选择性层恢复(SLR)的训练后优化方法,旨在解决大型语言模型(LLM)在指令微调后出现的模式崩溃问题,即生成多样性下降、输出重复。该方法通过将微调后模型中特定范围的层恢复到其预训练权重,构建一个混合模型,从而在保持高质量输出的同时显著提升生成多样性。
Details
Motivation: 动机在于解决LLM在训练后(如指令微调)虽然提升了指令遵循性和帮助性,但常常导致生成多样性下降(模式崩溃)的问题。作者基于LLM不同层承担不同功能角色的证据,假设模式崩溃可定位到特定层,并可通过恢复这些层的预训练权重来恢复多样性。
Result: 在设计的代理任务(约束随机字符生成,CRC)上,实验揭示了多样性-有效性之间的权衡,并找到了能以最小质量损失提升多样性的层恢复配置。在三个不同任务(创意写作、开放域问答、多步推理)和三个模型家族(Llama、Qwen、Gemma)上的评估表明,SLR能一致且显著地提升输出多样性,同时保持高质量输出。
Insight: 宣称的创新点在于提出了SLR这一无需训练、不增加推理成本的方法,通过选择性恢复特定层来构建混合模型,有效平衡多样性与质量。客观分析认为,其核心洞察在于将模式崩溃问题定位到模型内部特定层,并通过一个精心设计的代理任务(CRC)来指导恢复层的选择,这为模型微调后的性能调优提供了新的、可解释的维度。
Abstract: Post-training improves instruction-following and helpfulness of large language models (LLMs) but often reduces generation diversity, which leads to repetitive outputs in open-ended settings, a phenomenon known as mode collapse. Motivated by evidence that LLM layers play distinct functional roles, we hypothesize that mode collapse can be localized to specific layers and that restoring a carefully chosen range of layers to their pre-trained weights can recover diversity while maintaining high output quality. To validate this hypothesis and decide which layers to restore, we design a proxy task – Constrained Random Character(CRC) – with an explicit validity set and a natural diversity objective. Results on CRC reveal a clear diversity-validity trade-off across restoration ranges and identify configurations that increase diversity with minimal quality loss. Based on these findings, we propose Selective Layer Restoration (SLR), a training-free method that restores selected layers in a post-trained model to their pre-trained weights, yielding a hybrid model with the same architecture and parameter count, incurring no additional inference cost. Across three different tasks (creative writing, open-ended question answering, and multi-step reasoning) and three different model families (Llama, Qwen, and Gemma), we find SLR can consistently and substantially improve output diversity while maintaining high output quality.
[21] compar:IA: The French Government’s LLM arena to collect French-language human prompts and preference data cs.CL | cs.AIPDF
Lucie Termignon, Simonas Zilinskas, Hadrien Pélissier, Aurélien Barrot, Nicolas Chesnais
TL;DR: 本文介绍了法国政府开发的compar:IA平台,这是一个开源数字公共服务,旨在从法语使用者中收集大规模的人类偏好数据,以解决非英语语言在LLM性能、文化对齐和安全鲁棒性方面的不足。
Details
Motivation: 动机是解决LLM在非英语语言(尤其是法语)上因预训练数据和人类偏好对齐数据稀缺而导致的性能下降、文化对齐不足和安全鲁棒性问题。
Result: 截至2026年2月7日,平台已收集超过60万个自由形式提示和25万个偏好投票,其中约89%为法语数据,并发布了对话、投票和反应三个互补数据集。
Insight: 创新点在于通过政府主导的开源平台,以低参与摩擦和隐私保护的方式收集真实世界的人类偏好数据,为多语言模型训练和评估提供基础设施,并计划扩展为国际数字公共产品。
Abstract: Large Language Models (LLMs) often show reduced performance, cultural alignment, and safety robustness in non-English languages, partly because English dominates both pre-training data and human preference alignment datasets. Training methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) require human preference data, which remains scarce and largely non-public for many languages beyond English. To address this gap, we introduce compar:IA, an open-source digital public service developed inside the French government and designed to collect large-scale human preference data from a predominantly French-speaking general audience. The platform uses a blind pairwise comparison interface to capture unconstrained, real-world prompts and user judgments across a diverse set of language models, while maintaining low participation friction and privacy-preserving automated filtering. As of 2026-02-07, compar:IA has collected over 600,000 free-form prompts and 250,000 preference votes, with approximately 89% of the data in French. We release three complementary datasets – conversations, votes, and reactions – under open licenses, and present initial analyses, including a French-language model leaderboard and user interaction patterns. Beyond the French context, compar:IA is evolving toward an international digital public good, offering reusable infrastructure for multilingual model training, evaluation, and the study of human-AI interaction.
[22] R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging cs.CLPDF
Yanlin Lai, Mitt Huang, Hangyu Guo, Xiangfeng Wang, Haodong Li
TL;DR: 本文提出R-Align方法,旨在通过以推理过程为中心的对齐策略,提升生成式奖励模型(GenRM)的推理忠实度,从而改善基于人类反馈的强化学习(RLHF)在主观领域对齐大语言模型的效果。
Details
Motivation: 现有生成式奖励模型在训练和评估中仅关注偏好预测的标签准确性,而忽视了其生成推理过程的质量,这可能导致模型决策与人类参考判断不一致,进而影响下游RLHF策略的性能。
Result: 在多个奖励模型基准测试中,R-Align显著降低了虚假正确率(S-Corr),并在STEM、代码生成、指令遵循和通用任务上带来了策略性能的持续提升。
Insight: 论文的创新点在于强调了推理忠实度对下游RLHF性能的重要性,并提出了通过显式监督推理过程与黄金判断对齐的R-Align训练方法,为提升生成式奖励模型的鲁棒性和有效性提供了新思路。
Abstract: Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that reasoning fidelity-the consistency between a GenRM’s preference decision and reference decision rationales-is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward-model benchmarks to compute Spurious Correctness (S-Corr)-the fraction of label-correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S-Corr even for competitive GenRMs, and higher S-Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment. R-Align reduces S-Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.
[23] Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling cs.CL | cs.AIPDF
Kate Sanders, Nathaniel Weir, Sapana Chaudhary, Kaj Bostrom, Huzefa Rangwala
TL;DR: 本文提出了一种数据驱动的方法,自动构建细粒度的推理错误分类法(即评分标准),以增强大语言模型在未见推理轨迹上的错误检测能力。该方法在编码、数学和化学工程等技术领域相比基线方法表现出更强的错误识别性能,并可用于构建更强大的LLM-as-judge奖励函数,用于通过强化学习训练推理模型。实验表明,使用这些奖励训练的模型在困难领域的任务准确率比使用通用LLM-as-judge训练的模型提升高达45%,且仅需20%的黄金标签即可接近使用可验证奖励训练的模型性能。
Details
Motivation: 大语言模型在推理输出验证中存在困难,尤其是在长输出、需要专家知识的领域以及没有可验证奖励的问题中,难以可靠地识别思维轨迹中的错误。
Result: 在编码、数学和化学工程等技术领域的基准测试中,利用所构建错误分类法的分类方法在错误识别上优于基线方法。使用这些奖励训练的模型任务准确率比通用LLM-as-judge训练的模型提升+45%,且仅需20%的黄金标签即可接近使用可验证奖励的模型性能。
Insight: 创新点在于将数据驱动的细粒度推理错误分类法(评分标准)从评估定性模型行为扩展到评估通常通过RLVR奖励学习的任务的定量模型正确性,从而为在没有完整昂贵黄金标签数据集的情况下训练模型解决复杂技术问题提供了可能。
Abstract: An impediment to using Large Language Models (LLMs) for reasoning output verification is that LLMs struggle to reliably identify errors in thinking traces, particularly in long outputs, domains requiring expert knowledge, and problems without verifiable rewards. We propose a data-driven approach to automatically construct highly granular reasoning error taxonomies to enhance LLM-driven error detection on unseen reasoning traces. Our findings indicate that classification approaches that leverage these error taxonomies, or “rubrics”, demonstrate strong error identification compared to baseline methods in technical domains like coding, math, and chemical engineering. These rubrics can be used to build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning. Experimental results show that these rewards have the potential to improve models’ task accuracy on difficult domains over models trained by general LLMs-as-judges by +45%, and approach performance of models trained by verifiable rewards while using as little as 20% as many gold labels. Through our approach, we extend the usage of reward rubrics from assessing qualitative model behavior to assessing quantitative model correctness on tasks typically learned via RLVR rewards. This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels, which are often highly costly to procure.
[24] Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations cs.CLPDF
Shamik Bhattacharya, Daniel Perkins, Yaren Dogan, Vineeth Konjeti, Sudarshan Srinivasan
TL;DR: 本文提出了一种可解释的视觉词义消歧框架,通过利用CLIP模型将歧义文本和候选图像映射到共享多模态空间。该方法采用双通道文本提示(语义和基于照片的提示)结合WordNet同义词来增强文本嵌入,并通过鲁棒的测试时增强来优化图像嵌入,最后使用余弦相似度确定与歧义文本最匹配的图像。
Details
Motivation: 解决大型语言模型在自然语言理解中由词汇歧义带来的持续挑战,探索如何通过视觉领域来消解词汇歧义。
Result: 在SemEval-2023 VWSD数据集上,增强嵌入将MRR从0.7227提升至0.7590,命中率从0.5810提升至0.6220。消融研究表明双通道提示提供了强大且低延迟的性能,而激进的图像增强仅带来边际收益。
Insight: 创新点在于结合双通道文本提示(语义与照片提示)和WordNet同义词来增强CLIP的文本表示,并通过测试时图像增强优化视觉表示。客观分析表明,精确对齐CLIP的提示比引入噪声外部信号(如WordNet定义或多语言提示集成)更有效,强调了保持语义特异性对视觉词义消歧的重要性。
Abstract: Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency performance, whereas aggressive image augmentation yields only marginal gains. Additional experiments with WordNet definitions and multilingual prompt ensembles further suggest that noisy external signals tend to dilute semantic specificity, reinforcing the effectiveness of precise, CLIP-aligned prompts for visual word sense disambiguation.
[25] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks cs.CLPDF
Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu
TL;DR: 本文提出SEMA框架,一种简单有效的多轮越狱攻击学习方法,通过预填充自调优和意图漂移感知的强化学习训练攻击者,在多个数据集和受害者模型上实现了最先进的攻击成功率。
Details
Motivation: 现有方法在探索复杂性和意图漂移方面存在不足,无法有效应对多轮越狱攻击这一现实威胁模型。
Result: 在AdvBench等数据集上,SEMA在闭源和开源受害者模型上平均达到80.1%的ASR@1,比现有最佳方法高出33.9%,实现了SOTA水平。
Insight: 创新点包括预填充自调优稳定学习过程、意图漂移感知奖励函数保持有害目标一致性,以及开环攻击机制降低探索复杂度并统一单/多轮设置。
Abstract: Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average $80.1%$ ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.
[26] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning cs.CL | cs.AIPDF
Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen
TL;DR: 本文提出了InftyThink+,一个基于强化学习的端到端框架,用于优化无限视野的迭代推理过程。该方法通过模型控制的迭代边界和显式总结,学习何时总结、保留什么信息以及如何恢复推理,从而解决传统思维链方法存在的二次成本、上下文长度限制和中间信息丢失等问题。
Details
Motivation: 大型推理模型通过扩展推理时的思维链获得强大性能,但这种方法存在二次成本、上下文长度限制以及因‘中间丢失效应’导致的推理质量下降问题。现有的迭代推理方法依赖于监督学习或固定启发式规则,未能优化总结时机、信息保留和推理恢复等关键决策。
Result: 在DeepSeek-R1-Distill-Qwen-1.5B模型上的实验表明,InftyThink+在AIME24基准上准确率提升了21%,明显优于传统的长思维链强化学习方法,并且在分布外基准上泛化能力更强。同时,该方法显著降低了推理延迟并加速了强化学习训练。
Insight: 核心创新在于使用端到端强化学习来优化整个迭代推理轨迹,特别是学习何时进行总结的策略性决策。其两阶段训练方案(监督冷启动后接轨迹级强化学习)是一个有效的训练策略,使得模型能够学习战略性的总结和继续决策,从而在提升性能的同时提高推理效率。
Abstract: Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
cs.CV [Back]
[27] From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors cs.CVPDF
Ding-Jiun Huang, Yuanhao Wang, Shao-Ji Yuan, Albert Mosella-Montoro, Francisco Vicente Carrasco
TL;DR: 本文提出SuperHead框架,用于从低质量图像或视频源中生成高保真、可动画的3D说话头。该方法利用预训练3D生成模型的先验,通过一种新颖的动态感知3D反演方案,优化生成模型的潜在表示,最终输出一个超分辨率的3D高斯溅射(3DGS)头部模型,并可绑定到参数化头部模型(如FLAME)进行动画驱动。
Details
Motivation: 解决从低质量图像或视频源创建高保真、可动画3D说话头时,因源数据质量差导致的3D重建效果不佳问题,核心挑战在于合成高质量几何与纹理的同时,确保动画过程中的3D一致性、时间一致性并保持主体身份。
Result: 实验表明,SuperHead能在动态面部运动下生成具有精细面部细节的虚拟化身,在视觉质量上显著优于基线方法。
Insight: 主要创新点在于提出了一种动态感知的3D反演方案,将预训练3D生成模型的丰富先验知识用于处理动态3D输入的超分辨率任务,并通过联合监督(上采样的2D面部渲染图与深度图)来确保动态面部运动下的真实感。从客观角度看,该方法将3D生成先验与3DGS表示及参数化模型动画相结合,为解决低质量动态3D头部增强问题提供了一个新颖的、端到端的框架。
Abstract: Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently rigged to an underlying parametric head model (e.g., FLAME) for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.
[28] EgoAVU: Egocentric Audio-Visual Understanding cs.CVPDF
Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang
TL;DR: 本文提出EgoAVU,一个用于自动生成第一人称视角音频-视觉叙述、问题和答案的可扩展数据引擎,以解决多模态大语言模型在理解第一人称视频时因缺乏联合模态文本标签而表现不佳的问题。通过构建大规模训练数据集EgoAVU-Instruct和评估基准EgoAVU-Bench,并微调现有模型,显著提升了模型在音频-视觉联合理解任务上的性能。
Details
Motivation: 现有多模态大语言模型虽能接受视觉和音频输入,但由于缺乏包含连贯联合模态信息的文本标签,其在第一人称视频中联合理解两种模态的能力尚未得到充分探索。
Result: 在EgoAVU-Bench上,微调后的模型性能提升高达113%;在EgoTempo和EgoIllusion等其他基准测试上也实现了高达28%的相对性能增益。
Insight: 创新点在于通过跨模态相关性建模自动生成音频-视觉叙述,并利用基于令牌的视频过滤和模块化、基于图的策展来确保数据多样性和质量,从而构建大规模、高质量的指令微调数据集和评估基准,有效缓解了现有模型对视觉信号的过度偏倚问题。
Abstract: Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.
[29] MGP-KAD: Multimodal Geometric Priors and Kolmogorov-Arnold Decoder for Single-View 3D Reconstruction in Complex Scenes cs.CVPDF
Luoxi Zhang, Chun Xie, Itaru Kitahara
TL;DR: MGP-KAD是一种用于复杂场景下单视图3D重建的新型多模态特征融合框架,通过整合RGB和几何先验来提升重建精度,并采用基于Kolmogorov-Arnold Networks(KAN)的混合解码器处理复杂多模态输入。
Details
Motivation: 解决复杂真实场景中单视图3D重建因噪声、物体多样性和数据集有限而面临的挑战。
Result: 在Pix3D数据集上的大量实验表明,MGP-KAD实现了最先进的性能,显著改善了几何完整性、平滑性和细节保留。
Insight: 创新点包括:通过采样和聚类真实物体数据生成类别级几何先验特征,并在训练中动态调整以增强几何理解;以及引入KAN混合解码器来克服传统线性解码器在处理复杂多模态输入时的局限性。
Abstract: Single-view 3D reconstruction in complex real-world scenes is challenging due to noise, object diversity, and limited dataset availability. To address these challenges, we propose MGP-KAD, a novel multimodal feature fusion framework that integrates RGB and geometric prior to enhance reconstruction accuracy. The geometric prior is generated by sampling and clustering ground-truth object data, producing class-level features that dynamically adjust during training to improve geometric understanding. Additionally, we introduce a hybrid decoder based on Kolmogorov-Arnold Networks (KAN) to overcome the limitations of traditional linear decoders in processing complex multimodal inputs. Extensive experiments on the Pix3D dataset demonstrate that MGP-KAD achieves state-of-the-art (SOTA) performance, significantly improving geometric integrity, smoothness, and detail preservation. Our work provides a robust and effective solution for advancing single-view 3D reconstruction in complex scenes.
[30] Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving cs.CVPDF
Xuyang Chen, Conglang Zhang, Chuanheng Fu, Zihao Yang, Kaixuan Zhou
TL;DR: 本文提出Driving with DINO (DwD)框架,利用视觉基础模型(如DINOv3)的特征作为统一桥梁,以解决自动驾驶视频生成中Sim2Real(从模拟到真实)领域存在的‘一致性-真实感困境’。该方法通过主成分子空间投影和随机通道尾部丢弃等技术,在保留结构细节的同时消除合成伪影,并结合可学习的空间对齐模块与因果时间聚合器,实现了高精度控制和时序稳定的逼真视频生成。
Details
Motivation: 现有基于可控视频扩散的自动驾驶Sim2Real方法依赖显式中间表示(如边缘、深度、语义图)来弥合领域差距,但这些方法面临‘一致性-真实感困境’:低级信号能精确控制但会引入合成伪影损害真实感,高级先验能促进真实感但缺乏保持结构一致性的细节。
Result: 论文在自动驾驶视频生成任务上验证了DwD框架的有效性,其方法在保持时序稳定性和控制一致性的同时,显著提升了生成视频的视觉真实感。
Insight: 核心创新在于利用视觉基础模型(VFM)特征作为统一、信息丰富的中间表示,它天然编码了从高级语义到细粒度结构的连续谱信息。通过设计主成分子空间投影、随机通道尾部丢弃、可学习空间对齐模块和因果时间聚合器,巧妙地解决了特征中的纹理‘烘烤’问题、结构损失问题、高分辨率适配问题以及时序一致性问题,从而在控制与真实感之间取得了更好的平衡。
Abstract: Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by “baking in” synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for “texture baking,” while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3’s high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/
[31] M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning cs.CVPDF
Bangji Yang, Ruihan Guo, Jiajun Fan, Chaoran Cheng, Ge Liu
TL;DR: 本文提出了M3框架,这是一个无需训练的、基于多模态、多智能体和多轮视觉推理的文本到图像生成方法,旨在解决现有生成模型在处理复杂组合提示时的困难。M3通过编排现成的基础模型,在推理时进行迭代优化,显著提升了生成图像的保真度和组合能力。
Details
Motivation: 现有生成模型在文本到图像合成方面取得了高保真度,但在处理涉及多重约束的复杂组合提示时仍存在困难,本文旨在通过推理时优化系统性地解决这些问题。
Result: 在具有挑战性的OneIG-EN基准测试中,应用M3的Qwen-Image模型超越了包括Imagen4(0.515)和Seedream 3.0(0.530)在内的商业旗舰系统,达到了最先进的性能(0.532总体得分)。M3还显著提升了GenEval组合指标,在强化测试集上将空间推理性能有效提升了一倍。
Insight: 创新点在于提出了一个无需训练、即插即用的多智能体推理框架,通过规划器分解提示、检查器/优化器/编辑器智能体逐一修正约束、验证器确保单调改进的协作循环,实现了对任何预训练文本到图像模型的组合生成能力提升,为无需昂贵重新训练的组合生成建立了新范式。
Abstract: Generative models have achieved impressive fidelity in text-to-image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce \textbf{M3 (Multi-Modal, Multi-Agent, Multi-Round)}, a training-free framework that systematically resolves these failures through iterative inference-time refinement. M3 orchestrates off-the-shelf foundation models in a robust multi-agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open-source models, M3 achieves remarkable results on the challenging OneIG-EN benchmark, with our Qwen-Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state-of-the-art performance (0.532 overall). This demonstrates that intelligent multi-agent reasoning can elevate open-source models beyond proprietary alternatives. M3 also substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance on hardened test sets. As a plug-and-play module compatible with any pre-trained T2I model, M3 establishes a new paradigm for compositional generation without costly retraining.
[32] PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining cs.CV | cs.CLPDF
Cheng Liang, Chaoyi Wu, Weike Zhao, Ya Zhang, Yanfeng Wang
TL;DR: 该论文提出了PhenoLIP,一个将表型本体知识整合到医学视觉-语言预训练中的新框架。为了解决现有医学VLM未能系统捕获医学表型本体知识的问题,作者构建了首个大规模、以表型为中心的多模态知识图谱PhenoKG,并基于此设计了一个两阶段预训练方法,通过知识蒸馏将结构化知识注入模型。此外,还引入了专家验证的基准测试PhenoBench用于评估。实验表明,PhenoLIP在表型分类和跨模态检索任务上显著超越了现有SOTA方法。
Details
Motivation: 现有的大规模医学视觉-语言模型主要依赖粗粒度的图像-文本对比目标,未能有效利用医学领域定义良好的表型本体中编码的系统性视觉知识,导致对结构化医学图像理解不足。
Result: 在作者构建的PhenoBench基准上,PhenoLIP在表型分类准确率上比BiomedCLIP提升了8.85%,在跨模态检索上比BIOMEDICA提升了15.03%,达到了新的SOTA水平。
Insight: 论文的核心创新点在于首次构建了大规模、以表型为中心的多模态知识图谱(PhenoKG),并设计了一个两阶段的知识增强预训练框架,通过从文本本体数据学习知识增强的表型嵌入空间,再通过教师引导的知识蒸馏目标将这种结构化知识蒸馏到多模态预训练中,从而将领域先验知识显式地整合到VLM中,提升了模型的结构化和可解释性医学图像理解能力。
Abstract: Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image–caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.
[33] DeDPO: Debiased Direct Preference Optimization for Diffusion Models cs.CVPDF
Khiem Pham, Quang Nguyen, Tung Nguyen, Jingsen Zhu, Michele Santacatterina
TL;DR: 本文提出了一种名为DeDPO(去偏直接偏好优化)的半监督框架,用于解决扩散模型对齐中高质量人类偏好数据稀缺且成本高昂的问题。该方法通过整合因果推断中的去偏估计技术,有效利用大规模、低成本合成AI反馈(如自训练和视觉语言模型生成的标注)来增强有限的人类数据,从而在保持性能的同时实现可扩展的对齐训练。
Details
Motivation: 直接偏好优化(DPO)已成为扩散模型的主要对齐方法,但其依赖大规模、高质量人类偏好标注,存在严重的成本和可扩展性瓶颈。本文旨在克服这一限制,通过利用成本效益高的合成AI反馈来扩充数据。
Result: 实验表明,DeDPO对合成标注方法的变化具有鲁棒性,其性能达到甚至偶尔超过完全使用人类标注数据训练模型的理论上限,从而确立了其作为使用廉价合成监督进行人机对齐的可扩展解决方案。
Insight: 论文的核心创新点是将因果推断中的去偏估计技术独特地整合到DPO目标中,通过显式识别和校正合成标注器固有的系统偏差和噪声,确保从包括自训练和视觉语言模型在内的不完美反馈源中进行鲁棒学习。这为利用低成本合成数据实现高效模型对齐提供了新思路。
Abstract: Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.
[34] DroneKey++: A Size Prior-free Method and New Benchmark for Drone 3D Pose Estimation from Sequential Images cs.CVPDF
Seo-Bin Hwang, Yeong-Jun Cho
TL;DR: 本文提出了DroneKey++,一种无需先验尺寸信息的无人机三维姿态估计框架,能够联合进行关键点检测、无人机分类和姿态估计。同时,为了克服现有数据集规模小、环境受限的问题,构建了一个大规模合成数据集6DroneSyn用于可靠的泛化验证。
Details
Motivation: 现有无人机三维姿态估计方法通常依赖物理尺寸或3D网格等先验信息,且现有数据集规模小、模型单一、环境受限,难以可靠验证模型的泛化能力。
Result: 在构建的6DroneSyn数据集上,DroneKey++在旋转估计上达到MAE 17.34度、MedAE 17.1度,在平移估计上达到MAE 0.135米、MedAE 0.242米,推理速度在CPU上为19.25 FPS,GPU上为414.07 FPS,展现了跨无人机模型的强泛化能力和实时应用潜力。
Insight: 创新点在于提出了一个不依赖尺寸先验的端到端框架,通过关键点编码器和基于射线的几何推理姿态解码器联合完成任务;同时,通过360度全景合成技术构建了大规模、多模型、多背景的合成数据集,为领域提供了新的基准和评估标准。
Abstract: Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34 deg and MedAE 17.1 deg for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is publicly available.
[35] Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings cs.CV | cs.LGPDF
Grégoire Dhimoïla, Thomas Fel, Victor Boutin, Agustin Picard
TL;DR: 本文提出了一种基于跨模态冗余的等能量假设,并利用对齐稀疏自编码器(SAE)来探究视觉-语言模型(VLMs)共享嵌入空间的几何结构。研究发现,通过鼓励能量一致性训练,可以在不损害重建性能的前提下获得更具解释性的表示,从而揭示出双模态原子承载对齐信号、单模态原子解释模态间隙等关键结构特性。
Details
Motivation: 尽管视觉-语言模型在图像与文本对齐方面取得了显著成功,但其共享嵌入空间的几何结构仍缺乏深入理解。本文旨在通过跨模态冗余假设来探究这一几何特性,以提升模型的可解释性和可控性。
Result: 在已知真实标签的受控数据上验证了等能量假设的有效性:当假设成立时对齐效果提升,不成立时则保持中性。应用于基础VLMs时,该方法揭示了清晰的结构:双模态稀疏原子承载全部跨模态对齐信号,单模态原子完全解释了模态间隙,移除单模态原子可消除间隙且不影响性能,在双模态子空间中进行向量运算能实现分布内编辑并提升检索效果。
Insight: 创新点在于提出了等能量假设作为跨模态冗余的几何约束,并设计了对齐稀疏自编码器来操作化这一假设,从而在不牺牲模型保真度的前提下使潜在几何结构变得可解释和可操作。从客观角度看,该方法为理解VLMs的表示学习提供了新的分析工具,并展示了通过适当归纳偏置实现性能与可解释性平衡的潜力。
Abstract: Vision-language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.
[36] ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos cs.CVPDF
Yuantao Chen, Jiahao Chang, Chongjie Ye, Chaoran Zhang, Zhaojie Fang
TL;DR: ForeHOI是一种新颖的前馈模型,能够直接从单目手物交互视频中快速重建3D物体几何,无需任何预处理步骤,推理时间在一分钟内。该方法通过联合预测2D掩码修复和3D形状补全,有效解决了手部严重遮挡问题,并在合成数据集上训练,实现了SOTA性能且速度提升约100倍。
Details
Motivation: 解决从日常手物交互视频中重建3D物体的挑战,这些视频存在严重遮挡以及相机、手和物体复杂耦合运动的问题。
Result: 在物体重建任务中达到最先进(SOTA)性能,显著优于先前方法,并实现约100倍的加速。
Insight: 创新点在于前馈框架中联合预测2D掩码修复和3D形状补全,通过2D与3D形状补全的信息交换提升重建质量,并贡献了首个大规模高保真合成手物交互数据集以支持训练。
Abstract: The ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based methods. The information exchanges between the 2D and 3D shape completion boosts the overall reconstruction quality, enabling the framework to effectively handle severe hand-object occlusion. Furthermore, to support the training of our model, we contribute the first large-scale, high-fidelity synthetic dataset of hand-object interactions with comprehensive annotations. Extensive experiments demonstrate that ForeHOI achieves state-of-the-art performance in object reconstruction, significantly outperforming previous methods with around a 100x speedup. Code and data are available at: https://github.com/Tao-11-chen/ForeHOI.
[37] An Interpretable Vision Transformer as a Fingerprint-Based Diagnostic Aid for Kabuki and Wiedemann-Steiner Syndromes cs.CV | q-bio.QMPDF
Marilyn Lionts, Arnhildur Tomasdottir, Viktor I. Agustsson, Yuankai Huo, Hans T. Bjornsson
TL;DR: 本研究提出了一种基于视觉Transformer的可解释深度学习模型,利用指纹图像来区分卡布奇综合征(KS)、维德曼-施泰纳综合征(WSS)患者与健康对照,以及区分这两种综合征本身。模型在三个二分类任务中取得了良好的性能,并通过注意力可视化增强了可解释性,表明指纹特征具有综合征特异性,为罕见遗传综合征提供了一种非侵入性、可解释且易于获取的辅助诊断工具。
Details
Motivation: 卡布奇综合征和维德曼-施泰纳综合征是两种罕见但具有重叠临床特征的发育障碍,遗传检测是诊断金标准,但许多患者因难以获得遗传检测和专业解读而无法确诊。皮肤纹理(指纹)异常是多种遗传综合征的已知标志,但在分子检测时代仍未得到充分利用。本研究旨在探索利用指纹图像作为非侵入性诊断信号的可行性。
Result: 模型在三个二分类任务中取得了以下结果:健康对照 vs. KS的AUC为0.80(F1分数0.71),健康对照 vs. WSS的AUC为0.73(F1分数0.72),KS vs. WSS的AUC为0.85(F1分数0.83)。这些结果表明模型能够有效区分不同类别。
Insight: 论文的创新点在于将视觉Transformer应用于罕见遗传综合征的指纹图像诊断,并利用注意力机制进行可视化解释,增强了模型的可信度和临床可接受性。从客观角度看,这展示了将深度学习与生物特征(指纹)结合用于辅助医疗诊断的潜力,特别是在资源有限或难以进行基因检测的场景下,提供了一种低成本、非侵入性的筛查工具思路。
Abstract: Kabuki syndrome (KS) and Wiedemann-Steiner syndrome (WSS) are rare but distinct developmental disorders that share overlapping clinical features, including neurodevelopmental delay, growth restriction, and persistent fetal fingertip pads. While genetic testing remains the diagnostic gold standard, many individuals with KS or WSS remain undiagnosed due to barriers in access to both genetic testing and expertise. Dermatoglyphic anomalies, despite being established hallmarks of several genetic syndromes, remain an underutilized diagnostic signal in the era of molecular testing. This study presents a vision transformer-based deep learning model that leverages fingerprint images to distinguish individuals with KS and WSS from unaffected controls and from one another. We evaluate model performance across three binary classification tasks. Across the three classification tasks, the model achieved AUC scores of 0.80 (control vs. KS), 0.73 (control vs. WSS), and 0.85 (KS vs. WSS), with corresponding F1 scores of 0.71, 0.72, and 0.83, respectively. Beyond classification, we apply attention-based visualizations to identify fingerprint regions most salient to model predictions, enhancing interpretability. Together, these findings suggest the presence of syndrome-specific fingerprint features, demonstrating the feasibility of a fingerprint-based artificial intelligence (AI) tool as a noninvasive, interpretable, and accessible future diagnostic aid for the early diagnosis of underdiagnosed genetic syndromes.
[38] MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training cs.CVPDF
Lucia Gordon, Serge Belongie, Christian Igel, Nico Lang
TL;DR: 该论文提出了MMEarth-Bench,一个包含五个新多模态环境任务的数据集,用于评估全球尺度下的多模态预训练模型。研究发现,尽管多模态预训练在有限数据设置下能提升模型鲁棒性,但地理泛化能力仍然不足。为此,论文提出了一种模型无关的测试时训练方法TTT-MMR,利用测试时可用的所有模态作为辅助任务,以改善模型在新下游任务和地理域上的适应能力。
Details
Motivation: 现有地理空间基准数据集通常模态单一且全球代表性不足,限制了多模态预训练模型在全球尺度上的评估能力,因此需要构建一个多模态、全球分布的数据集来填补这一空白。
Result: 在MMEarth-Bench上对多种预训练模型进行基准测试,结果显示TTT-MMR方法在随机和地理测试分割上均提升了模型性能,且地理批处理在TTT过程中实现了正则化与专业化之间的良好权衡。
Insight: 创新点包括引入首个多模态、全球覆盖的地理空间基准数据集MMEarth-Bench,以及提出模型无关的测试时训练方法TTT-MMR,该方法能灵活利用所有可用模态作为辅助任务,有效提升模型的地理泛化和适应能力。
Abstract: Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal environmental tasks with 12 modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models and find that while (multimodal) pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time as auxiliary tasks, regardless of whether a pretrained model accepts them as input. Our method improves model performance on both the random and geographic test splits, and geographic batching leads to a good trade-off between regularization and specialization during TTT. Our dataset, code, and visualization tool are linked from the project page at lgordon99.github.io/mmearth-bench.
[39] Unsupervised MRI-US Multimodal Image Registration with Multilevel Correlation Pyramidal Optimization cs.CVPDF
Jiazheng Wang, Zeyu Liu, Min Liu, Xiang Chen, Hang Zhang
TL;DR: 本文提出了一种基于多级相关金字塔优化(MCPO)的无监督多模态医学图像配准方法,旨在解决手术导航中术前与术中多模态图像(如MRI与超声)因模态差异和组织变形导致的配准难题。该方法首先利用模态无关邻域描述子提取特征,将图像映射到特征空间,然后通过多级金字塔融合优化机制,在不同尺度上结合密集相关分析和权重平衡耦合凸优化,实现位移场的全局优化与局部细节互补。
Details
Motivation: 解决手术导航中,由于多模态图像(如MRI与超声)之间的固有差异以及术中组织位移和切除导致的图像变形,使得术前与术中多模态图像的有效配准面临重大挑战。
Result: 在Learn2Reg 2025挑战赛的ReMIND2Reg任务中,该方法在验证阶段和测试阶段均取得了第一名。此外,在Resect数据集上验证,实现了平均目标配准误差(TRE)为1.798毫米,证明了其在术前到术中图像配准中的广泛适用性。
Insight: 创新点在于结合了模态无关邻域描述子进行特征提取,以及新颖的多级金字塔融合优化机制,该机制通过密集相关分析和权重平衡耦合凸优化,在不同尺度上协同优化位移场,兼顾全局一致性与局部细节,为无监督多模态配准提供了有效的优化框架。
Abstract: Surgical navigation based on multimodal image registration has played a significant role in providing intraoperative guidance to surgeons by showing the relative position of the target area to critical anatomical structures during surgery. However, due to the differences between multimodal images and intraoperative image deformation caused by tissue displacement and removal during the surgery, effective registration of preoperative and intraoperative multimodal images faces significant challenges. To address the multimodal image registration challenges in Learn2Reg 2025, an unsupervised multimodal medical image registration method based on multilevel correlation pyramidal optimization (MCPO) is designed to solve these problems. First, the features of each modality are extracted based on the modality independent neighborhood descriptor, and the multimodal images is mapped to the feature space. Second, a multilevel pyramidal fusion optimization mechanism is designed to achieve global optimization and local detail complementation of the displacement field through dense correlation analysis and weight-balanced coupled convex optimization for input features at different scales. Our method focuses on the ReMIND2Reg task in Learn2Reg 2025. Based on the results, our method achieved the first place in the validation phase and test phase of ReMIND2Reg. The MCPO is also validated on the Resect dataset, achieving an average TRE of 1.798 mm. This demonstrates the broad applicability of our method in preoperative-to-intraoperative image registration. The code is avaliable at https://github.com/wjiazheng/MCPO.
[40] Accelerating Vision Transformers on Brain Processing Unit cs.CV | cs.AIPDF
Jinchi Tang, Yan Guo
TL;DR: 本文提出了一种将Vision Transformer(ViT)模型(如DeiT)中的线性层和层归一化操作替换为卷积操作的方法,以解决CNN优化的硬件(如BPU)与ViT计算特性不匹配的问题,从而在BPU上实现ViT的加速部署。
Details
Motivation: 由于CNN优化的硬件(如BPU)专为四维卷积操作设计,而ViT中的线性层处理三维数据,导致ViT难以利用BPU的加速优势。本文旨在解决这一架构不匹配问题,使ViT能在BPU上高效运行。
Result: 在ImageNet上,量化后的DeiT-Base模型准确率达到80.4%(原模型为81.8%),推理速度提升最高达3.8倍。在花卉分类数据集上,微调后的DeiT-Base模型仅损失0.5%的准确率,证明了方法的有效性。
Insight: 创新点在于通过将ViT的线性层和层归一化重构为卷积操作,使模型能直接继承原始权重,无需重新训练或微调,即可在BPU上实现加速。这是首次成功利用BPU加速ViT的部署,为硬件适配提供了新思路。
Abstract: With the advancement of deep learning technologies, specialized neural processing hardware such as Brain Processing Units (BPUs) have emerged as dedicated platforms for CNN acceleration, offering optimized INT8 computation capabilities for convolutional operations. Meanwhile, Vision Transformer (ViT) models, such as the Data-efficient Image Transformer (DeiT), have demonstrated superior performance and play increasingly crucial roles in computer vision tasks. However, due to the architectural mismatch between CNN-optimized hardware and Vision Transformer computation characteristics–namely, that linear layers in Transformers operate on three-dimensional data while BPU acceleration is designed for four-dimensional convolution operations-it is difficult or even impossible to leverage BPU’s advantages when deploying Vision Transformers. To address this challenge, we propose a novel approach that restructures the Vision Transformer by replacing linear layers and layer normalization operations with carefully designed convolutional operators. This enables DeiT to fully utilize the acceleration capabilities of BPUs, while allowing the original weight parameters to be inherited by the restructured models without retraining or fine-tuning. To the best of our knowledge, this is the first successful deployment of Vision Transformers that fully leverages BPU classification datasets demonstrate the effectiveness of our approach. Specifically, the quantized DeiT-Base model achieves 80.4% accuracy on ImageNet, compared to the original 81.8%, while obtaining up to a 3.8* inference speedup. Our finetuned DeiT model on the flower classification dataset also achieves excellent performance, with only a 0.5% accuracy drop for the DeiT-Base model, further demonstrating the effectiveness of our method.
[41] Uncertainty-Aware 4D Gaussian Splatting for Monocular Occluded Human Rendering cs.CVPDF
Weiquan Wang, Feifei Shao, Lin Li, Zhen Wang, Jun Xiao
TL;DR: 本文提出U-4DGS,一种不确定性感知的4D高斯泼溅框架,用于从单目视频中渲染被遮挡的动态人体。该方法将任务重新表述为异方差观测噪声下的最大后验估计问题,通过概率变形网络和双重光栅化管道生成像素对齐的不确定性图,以自适应地调制梯度并抑制不可靠观测带来的伪影,同时利用置信感知正则化防止几何漂移。
Details
Motivation: 现有方法在遮挡条件下渲染动态人体时,要么依赖生成模型补全缺失内容导致时间闪烁,要么使用刚性几何启发式方法无法捕捉多样外观,导致渲染质量严重下降。本文旨在解决单目视频中动态人体在遮挡下的高保真渲染问题。
Result: 在ZJU-MoCap和OcMotion数据集上的大量实验表明,U-4DGS在渲染保真度和鲁棒性方面达到了SOTA水平。
Insight: 创新点包括将遮挡渲染重新建模为异方差噪声下的最大后验估计问题,引入概率变形网络和双重光栅化来显式建模不确定性,以及利用学习到的不确定性进行置信感知正则化以保持时空一致性。这为动态场景渲染中处理不确定性和遮挡提供了可借鉴的框架。
Abstract: High-fidelity rendering of dynamic humans from monocular videos typically degrades catastrophically under occlusions. Existing solutions incorporate external priors-either hallucinating missing content via generative models, which induces severe temporal flickering, or imposing rigid geometric heuristics that fail to capture diverse appearances. To this end, we reformulate the task as a Maximum A Posteriori estimation problem under heteroscedastic observation noise. In this paper, we propose U-4DGS, a framework integrating a Probabilistic Deformation Network and a Double Rasterization pipeline. This architecture renders pixel-aligned uncertainty maps that act as an adaptive gradient modulator, automatically attenuating artifacts from unreliable observations. Furthermore, to prevent geometric drift in regions lacking reliable visual cues, we enforce Confidence-Aware Regularizations, which leverage the learned uncertainty to selectively propagate spatial-temporal validity. Extensive experiments on ZJU-MoCap and OcMotion demonstrate that U-4DGS achieves SOTA rendering fidelity and robustness.
[42] Revisiting Salient Object Detection from an Observer-Centric Perspective cs.CV | cs.AIPDF
Fuxi Zhang, Yifan Wang, Hengrun Zhao, Zhuohan Sun, Changxing Xia
TL;DR: 本文提出了一种以观察者为中心的显著目标检测方法(OC-SOD),通过考虑观察者特定的偏好或意图等主观因素,来解决传统显著目标检测任务中存在的固有模糊性和多样性问题。作者利用多模态大语言模型构建了首个OC-SOD数据集OC-SODBench,并在此基础上设计了一个基于“感知-反思-调整”过程的智能体基线模型OC-SODAgent。
Details
Motivation: 现有显著目标检测方法通常将其视为一个具有单一真实分割图的客观预测任务,忽略了不同观察者因先验知识不同而产生的主观感知差异,导致问题定义不完整且不适定。本文旨在通过引入观察者中心视角来弥补人类感知与计算建模之间的差距。
Result: 在作者新构建的OC-SODBench数据集(包含33k图像和152k文本提示-目标对)上进行的广泛实验验证了所提方法的有效性。
Insight: 核心创新在于将显著目标检测重新定义为考虑观察者特定因素的主观任务,从而能够实现个性化和上下文感知的显著性预测。同时,利用多模态大语言模型高效构建大规模数据集的方法,以及模仿人类认知过程的智能体基线架构,也具有借鉴意义。
Abstract: Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under-determined and fundamentally ill-posed. To address this issue, we propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction. By leveraging multi-modal large language models, we develop an efficient data annotation pipeline and construct the first OC-SOD dataset named OC-SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC-SODAgent, an agentic baseline which performs OC-SOD via a human-like “Perceive-Reflect-Adjust” process. Extensive experiments on our proposed OC-SODBench have justified the effectiveness of our contribution. Through this observer-centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly “salient.” Code and dataset are publicly available at: https://github.com/Dustzx/OC_SOD
[43] POINTS-GUI-G: GUI-Grounding Journey cs.CVPDF
Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian
TL;DR: 本文介绍了POINTS-GUI-G-8B模型,这是一个专注于图形用户界面(GUI)基础定位任务的视觉语言模型。该模型从基础模型POINTS-1.5出发,通过精细化的数据工程、改进的训练策略以及结合可验证奖励的强化学习,在多个GUI定位基准测试中取得了最先进的性能。
Details
Motivation: 现有GUI智能体研究通常基于已具备强大空间感知能力的模型进行微调,而本文旨在从基础定位能力较弱的模型出发,掌握完整的GUI基础定位技术流程,以作为自动化复杂数字任务(如在线购物、航班预订)的先决能力。
Result: 模型在ScreenSpot-Pro、OSWorld-G、ScreenSpot-v2和UI-Vision四个基准测试上分别取得了59.9、66.0、95.7和49.9的分数,达到了最先进的(SOTA)性能水平。
Insight: 创新点包括:1)精细化的数据工程,统一了多种开源数据集格式并采用了数据增强、过滤和难度分级策略;2)改进的训练策略,如持续微调视觉编码器以提升感知精度,并保持训练与推理时的分辨率一致性;3)将强化学习应用于感知密集型的GUI定位任务,利用其易于验证且高精度的奖励机制来显著提升定位精度。
Abstract: The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model’s success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.
[44] MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing cs.CVPDF
Wenjie Wang, Wei Wu, Ying Liu, Yuan Zhao, Xiaole Lv
TL;DR: 本文提出MeDocVL,一种用于医疗文档理解和解析的视觉语言模型。该模型通过训练驱动的标签精炼从噪声标注中构建高质量监督,并结合噪声感知的混合后训练策略(整合强化学习和监督微调),以实现鲁棒且精确的字段级信息提取。在医疗发票基准测试中,MeDocVL超越了传统OCR系统和现有VLM基线,在噪声监督下达到了最先进的性能。
Details
Motivation: 解决医疗文档OCR中因复杂布局、领域特定术语和噪声标注带来的挑战,以及现有OCR系统和通用视觉语言模型在可靠解析此类文档方面的不足,特别是满足严格的字段级精确匹配需求。
Result: 在医疗发票基准测试上,MeDocVL持续优于传统OCR系统和强大的VLM基线,在噪声监督下实现了最先进的性能。
Insight: 创新点在于提出了训练驱动的标签精炼方法来从噪声标注中构建高质量监督,以及噪声感知的混合后训练策略,结合了强化学习和监督微调,以提升模型在复杂医疗文档解析中的鲁棒性和精确性。从客观角度看,这种针对噪声标注和领域特定挑战的端到端训练框架设计具有借鉴意义。
Abstract: Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
[45] A neuromorphic model of the insect visual system for natural image processing cs.CV | cs.NEPDF
Adam D. Hines, Karin Nordström, Andrew B. Barron
TL;DR: 本文提出了一种受昆虫视觉系统启发的神经形态模型,通过完全自监督的对比学习目标将密集视觉输入转换为稀疏、可区分的编码,并在花朵识别和自然图像基准测试中验证了其有效性。
Details
Motivation: 现有计算模型过于注重任务性能而忽略了基于生物学的处理通路,本文旨在构建一个更贴近昆虫视觉系统原理的通用生物启发模型。
Result: 在模拟定位任务中,该模型优于简单的图像下采样基线,并在花朵识别和自然图像基准上生成了可靠的稀疏编码。
Insight: 创新点在于结合了神经形态视觉处理通路和完全自监督对比学习,实现了无需标签数据的表征学习,并支持跨任务重用;模型同时提供了人工神经网络和脉冲神经网络两种实现,增强了部署灵活性。
Abstract: Insect vision supports complex behaviors including associative learning, navigation, and object detection, and has long motivated computational models for understanding biological visual processing. However, many contemporary models prioritize task performance while neglecting biologically grounded processing pathways. Here, we introduce a bio-inspired vision model that captures principles of the insect visual system to transform dense visual input into sparse, discriminative codes. The model is trained using a fully self-supervised contrastive objective, enabling representation learning without labeled data and supporting reuse across tasks without reliance on domain-specific classifiers. We evaluated the resulting representations on flower recognition tasks and natural image benchmarks. The model consistently produced reliable sparse codes that distinguish visually similar inputs. To support different modelling and deployment uses, we have implemented the model as both an artificial neural network and a spiking neural network. In a simulated localization setting, our approach outperformed a simple image downsampling comparison baseline, highlighting the functional benefit of incorporating neuromorphic visual processing pathways. Collectively, these results advance insect computational modelling by providing a generalizable bio-inspired vision model capable of sparse computation across diverse tasks.
[46] Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors cs.CVPDF
Soham Pahari, Sandeep C. Kumain
TL;DR: 该论文提出了一种名为SemGeo-AttentionNet的双流架构,用于建模人类在三维物体表面的视觉注意力。该方法通过非对称跨模态融合,结合基于扩散模型的几何条件多视图渲染语义先验和点云Transformer的几何处理,将自下而上的几何处理与自上而下的语义识别联系起来。论文还将该框架扩展到通过强化学习生成时序扫描路径,并首次在3D网格拓扑上结合了抑制返回动态。
Details
Motivation: 现有3D显著性方法依赖于手工几何特征或缺乏语义感知的学习方法,无法解释人类为何注视语义重要但几何不突出的区域。本文旨在通过认知动机的架构,建模由几何处理和语义识别相互作用产生的人类视觉注意力。
Result: 在SAL3D、NUS3D和3DVA数据集上的评估显示了显著的性能提升,验证了所提架构在三维表面上有效建模人类视觉注意力的能力。
Insight: 创新点在于明确形式化了视觉注意力的二分性,通过几何特征查询语义内容的跨注意力机制,以及首个尊重3D网格拓扑并包含抑制返回动态的时序扫描路径生成框架。从客观角度看,将扩散模型生成的语义先验与几何处理深度融合,为解决3D显著性预测中语义缺失问题提供了新思路。
Abstract: Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.
[47] Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters cs.CV | cs.ROPDF
Yuxiang Zhao, Yirong Yang, Yanqing Zhu, Yanfen Shen, Chiyu Wang
TL;DR: 本文提出了一种新型任务:无先验知识的室外到室内指令驱动具身导航,旨在解决现有方法在室外到室内无缝过渡中的局限性。作者设计了一个以视觉为中心的导航框架,利用基于图像的提示进行决策,并创建了首个开源数据集,通过轨迹条件视频合成生成数据。实验表明,该方法在成功率和路径效率等关键指标上优于现有最先进基线。
Details
Motivation: 现有具身导航方法通常局限于室内或室外环境,并依赖精确坐标系等强假设,无法实现从室外到室内的细粒度入口导航,限制了实际部署中的实用性。本文旨在消除对外部先验知识的依赖,使智能体仅基于以自我为中心的视觉观察和指令进行导航。
Result: 通过大量实验,本文提出的方法在成功率和路径效率等关键指标上一致优于最先进的基线模型,展示了其在无先验知识室外到室内导航任务中的有效性。
Insight: 创新点包括引入无先验知识的室外到室内指令驱动导航任务,提出以视觉为中心的框架利用图像提示进行决策,以及通过轨迹条件视频合成构建首个开源数据集,为实际部署中的无缝过渡导航提供了新思路。
Abstract: Embodied navigation holds significant promise for real-world applications such as last-mile delivery. However, most existing approaches are confined to either indoor or outdoor environments and rely heavily on strong assumptions, such as access to precise coordinate systems. While current outdoor methods can guide agents to the vicinity of a target using coarse-grained localization, they fail to enable fine-grained entry through specific building entrances, critically limiting their utility in practical deployment scenarios that require seamless outdoor-to-indoor transitions. To bridge this gap, we introduce a novel task: out-to-in prior-free instruction-driven embodied navigation. This formulation explicitly eliminates reliance on accurate external priors, requiring agents to navigate solely based on egocentric visual observations guided by instructions. To tackle this task, we propose a vision-centric embodied navigation framework that leverages image-based prompts to drive decision-making. Additionally, we present the first open-source dataset for this task, featuring a pipeline that integrates trajectory-conditioned video synthesis into the data generation process. Through extensive experiments, we demonstrate that our proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.
[48] ChatUMM: Robust Context Tracking for Conversational Interleaved Generation cs.CVPDF
Wenxun Dai, Zhiyuan Zhao, Yule Zhong, Yiji Cheng, Jianwei Zhang
TL;DR: ChatUMM是一个对话式统一多模态模型,旨在通过稳健的上下文跟踪来维持交错的文本-图像生成,以解决现有统一多模态模型局限于单轮交互的问题。
Details
Motivation: 现有统一多模态模型受限于单轮交互范式,无法作为持续对话中的助手,因此需要开发能够进行多轮交错对话的模型。
Result: 在视觉理解和指令引导编辑基准测试中,ChatUMM在开源统一模型中达到了最先进的性能,同时在文本到图像生成中保持了有竞争力的保真度。
Insight: 创新点包括交错多轮训练策略,将序列化的文本-图像流建模为连续对话流,以及一个系统的对话数据合成流水线,通过三个阶段将单轮数据集转化为流畅的对话数据,从而增强了模型在复杂多轮场景中的鲁棒性。
Abstract: Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor’’ turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.
[49] Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction cs.CVPDF
Zizhan Guo, Yi Feng, Mengtan Zhang, Haoran Zhang, Wei Ye
TL;DR: 本文针对无监督单目3D占据预测任务,重新设计了基准评测方法。通过分析体渲染过程中的变量,确定了与物理一致性最强的占据概率表示,并改进了现有评测协议以对齐体素级3D占据真值。此外,提出了一种遮挡感知的极化机制,利用多视角视觉线索增强遮挡区域中占据与自由空间的区分能力。
Details
Motivation: 现有无监督方法在训练时使用神经辐射场,评估时却将网络输出直接视为占据概率,存在训练与评估协议不一致的问题;同时,广泛使用的2D真值无法揭示遮挡区域因几何约束不足而产生的固有模糊性。
Result: 大量实验表明,该方法不仅显著优于现有无监督方法,而且达到了与有监督方法相当的性能。
Insight: 核心创新在于重新审视并形式化了无监督3D占据预测的评测基准,确保了训练与评估的一致性;同时提出的遮挡感知极化机制,通过引入多视角信息,为遮挡区域提供了明确的几何约束,有效缓解了该区域的模糊性问题。
Abstract: Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the prevalent use of 2D ground truth fails to reveal the inherent ambiguity in occluded regions caused by insufficient geometric constraints. To address these issues, this paper presents a reformulated benchmark for unsupervised monocular 3D occupancy prediction. We first interpret the variables involved in the volume rendering process and identify the most physically consistent representation of the occupancy probability. Building on these analyses, we improve existing evaluation protocols by aligning the newly identified representation with voxel-wise 3D occupancy ground truth, thereby enabling unsupervised methods to be evaluated in a manner consistent with that of supervised approaches. Additionally, to impose explicit constraints in occluded regions, we introduce an occlusion-aware polarization mechanism that incorporates multi-view visual cues to enhance discrimination between occupied and free spaces in these regions. Extensive experiments demonstrate that our approach not only significantly outperforms existing unsupervised approaches but also matches the performance of supervised ones. Our source code and evaluation protocol will be made available upon publication.
[50] DreamHome-Pano: Design-Aware and Conflict-Free Panoramic Interior Generation cs.CVPDF
Lulu Chen, Yijiang Hu, Yuanqing Liu, Yulong Li, Yue Yang
TL;DR: 本文提出了DreamHome-Pano框架,用于生成可控、高保真的全景室内设计图像。该框架通过一个Prompt-LLM将布局约束和风格参考转换为专业描述性提示,并采用一种无冲突控制架构来防止风格属性损害布局的几何精度。
Details
Motivation: 解决现有多条件生成框架在室内设计中难以协调刚性建筑结构约束与特定风格偏好,从而导致的‘条件冲突’问题。
Result: 实验结果表明,DreamHome-Pano在美学质量和结构一致性之间取得了优越的平衡,为全景室内可视化提供了一个鲁棒且专业级的解决方案。
Insight: 创新点在于引入Prompt-LLM作为语义桥梁实现精确的跨模态对齐,并设计了包含结构感知几何先验和多条件解耦策略的无冲突控制架构,以抑制风格对空间布局的干扰。同时,构建了全景室内基准和多阶段训练流程(SFT和RL)。
Abstract: In modern interior design, the generation of personalized spaces frequently necessitates a delicate balance between rigid architectural structural constraints and specific stylistic preferences. However, existing multi-condition generative frameworks often struggle to harmonize these inputs, leading to “condition conflicts” where stylistic attributes inadvertently compromise the geometric precision of the layout. To address this challenge, we present DreamHome-Pano, a controllable panoramic generation framework designed for high-fidelity interior synthesis. Our approach introduces a Prompt-LLM that serves as a semantic bridge, effectively translating layout constraints and style references into professional descriptive prompts to achieve precise cross-modal alignment. To safeguard architectural integrity during the generative process, we develop a Conflict-Free Control architecture that incorporates structural-aware geometric priors and a multi-condition decoupling strategy, effectively suppressing stylistic interference from eroding the spatial layout. Furthermore, we establish a comprehensive panoramic interior benchmark alongside a multi-stage training pipeline, encompassing progressive Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Experimental results demonstrate that DreamHome-Pano achieves a superior balance between aesthetic quality and structural consistency, offering a robust and professional-grade solution for panoramic interior visualization.
[51] FloorplanVLM: A Vision-Language Model for Floorplan Vectorization cs.CVPDF
Yuanqing Liu, Ziming Yang, Yulong Li, Yue Yang
TL;DR: FloorplanVLM是一个将栅格户型图转换为工程级矢量图形的视觉-语言模型框架。它将向量化任务重新定义为图像条件序列建模问题,直接输出表示全局拓扑结构的JSON序列。该方法通过构建大规模数据集和采用渐进式训练策略,实现了对复杂几何形状(如斜墙和弧形)的精确、整体约束满足。
Details
Motivation: 解决将栅格户型图转换为矢量图形时,因复杂拓扑和严格几何约束带来的挑战。传统基于像素的方法依赖脆弱的启发式规则,而基于查询的Transformer方法则生成碎片化的房间,均难以满足整体几何约束。
Result: 在提出的标准化评估基准FPBench-2K上,FloorplanVLM表现出卓越的结构有效性,取得了92.52%的外墙IoU,并在非曼哈顿架构上展现出强大的泛化能力。
Insight: 创新点在于将户型图向量化任务重新定义为“像素到序列”的范式,通过直接生成结构化JSON序列来捕获全局拓扑。同时,构建大规模数据集(Floorplan-2M)和高保真子集(Floorplan-HQ-300K),并结合监督微调(SFT)与组相对策略优化(GRPO)的渐进训练策略,有效解决了数据饥渴和几何对齐问题。
Abstract: Converting raster floorplans into engineering-grade vector graphics is challenging due to complex topology and strict geometric constraints. To address this, we present FloorplanVLM, a unified framework that reformulates floorplan vectorization as an image-conditioned sequence modeling task. Unlike pixel-based methods that rely on fragile heuristics or query-based transformers that generate fragmented rooms, our model directly outputs structured JSON sequences representing the global topology. This ‘pixels-to-sequence’ paradigm enables the precise and holistic constraint satisfaction of complex geometries, such as slanted walls and curved arcs. To support this data-hungry approach, we introduce a scalable data engine: we construct a large-scale dataset (Floorplan-2M) and a high-fidelity subset (Floorplan-HQ-300K) to balance geometric diversity and pixel-level precision. We then employ a progressive training strategy, using Supervised Fine-Tuning (SFT) for structural grounding and quality annealing, followed by Group Relative Policy Optimization (GRPO) for strict geometric alignment. To standardize evaluation on complex layouts, we establish and open-source FPBench-2K. Evaluated on this rigorous benchmark, FloorplanVLM demonstrates exceptional structural validity, achieving $\textbf{92.52%}$ external-wall IoU and robust generalization across non-Manhattan architectures.
[52] DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving cs.CV | cs.ROPDF
Feiyang jia, Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye
TL;DR: 本文提出了DriveWorld-VLA框架,旨在解决端到端自动驾驶中视觉-语言-动作(VLA)与世界模型统一不足的问题。该框架通过在潜在表示层面紧密集成VLA与世界模型,将场景演化建模和动作规划统一在单一潜在空间内,使规划器能直接受益于整体场景演化建模,并支持基于动作的特征级可控想象。
Details
Motivation: 现有方法未能在一个架构内有效统一未来场景演化和动作规划,主要原因是潜在状态共享不足,限制了视觉想象对动作决策的影响。本文旨在通过统一的潜在空间世界建模来解决这一局限性。
Result: 在NAVSIMv1和NAVSIMv2基准测试中分别取得了91.3 PDMS和86.8 EPDMS的SOTA性能,在nuScenes数据集上实现了0.16的3秒平均碰撞率。
Insight: 核心创新在于将世界模型的潜在状态作为VLA规划器的核心决策状态,实现了在潜在空间内进行完全的世界建模,从而支持特征级的、动作条件可控的想象,避免了昂贵的像素级推演,并减少了对密集标注监督的依赖。
Abstract: End-to-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision-Language-Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld-VLA supports controllable, action-conditioned imagination at the feature level, avoiding expensive pixel-level rollouts. Extensive open-loop and closed-loop evaluations demonstrate the effectiveness of DriveWorld-VLA, which achieves state-of-the-art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes. Code and models will be released in https://github.com/liulin815/DriveWorld-VLA.git.
[53] Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance cs.CV | cs.CRPDF
Haipeng Li, Rongxuan Peng, Anwei Luo, Shunquan Tan, Changsheng Chen
TL;DR: 本文提出了一种名为ForgeryEraser的通用反取证攻击框架,旨在无需访问目标AIGC检测器的情况下,通过多模态引导损失在视觉语言模型(VLM)特征空间中消除伪造图像的痕迹,从而攻击现有的图像伪造检测系统。
Details
Motivation: 现有评估协议大多忽视了反取证攻击,无法确保最先进的AIGC检测器在现实应用中的全面鲁棒性,本文旨在填补这一空白。
Result: 大量实验表明,ForgeryEraser在全局合成和局部编辑基准测试上,导致先进的AIGC检测器性能显著下降,并能诱导可解释的取证模型为伪造图像生成与真实图像一致的解释。
Insight: 创新点在于揭示了基于共享VLM(如CLIP)的AIGC检测器存在对抗性漏洞,并设计了一种多模态引导损失(而非传统的基于logit的优化)来操纵特征空间,实现通用且无需白盒访问的攻击。
Abstract: The rapid advancement of AI-Generated Content (AIGC) technologies poses significant challenges for authenticity assessment. However, existing evaluation protocols largely overlook anti-forensics attack, failing to ensure the comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. To bridge this gap, we propose ForgeryEraser, a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors. We reveal an adversarial vulnerability stemming from the systemic reliance on Vision-Language Models (VLMs) as shared backbones (e.g., CLIP), where downstream AIGC detectors inherit the feature space of these publicly accessible models. Instead of traditional logit-based optimization, we design a multi-modal guidance loss to drive forged image embeddings within the VLM feature space toward text-derived authentic anchors to erase forgery traces, while repelling them from forgery anchors. Extensive experiments demonstrate that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Moreover, ForgeryEraser induces explainable forensic models to generate explanations consistent with authentic images for forged images. Our code will be made publicly available.
[54] LIBERO-X: Robustness Litmus for Vision-Language-Action Models cs.CV | cs.AI | cs.ROPDF
Guodong Wang, Chenkai Zhang, Qingjie Liu, Jinjin Zhang, Jiancheng Cai
TL;DR: 本文提出了LIBERO-X基准测试,用于系统评估视觉-语言-动作(VLA)模型的鲁棒性和泛化能力。该基准包含分层评估协议和多样化训练数据集,旨在更可靠地衡量VLA模型在环境扰动和任务复杂性增加下的性能表现。
Details
Motivation: 现有VLA基准测试由于评估协议不足,未能充分捕捉真实世界的分布偏移,导致评估结果有限或误导性,因此需要更系统、可靠的基准来推动VLA模型发展。
Result: 在代表性VLA模型上的实验显示,在累积扰动下模型性能显著下降,暴露了其在场景理解和指令接地方面的持续局限性。
Insight: 创新点在于从评估和数据双视角重构VLA基准,通过分层评估协议(针对空间泛化、物体识别和任务指令理解)与高多样性训练数据(通过人类遥操作收集)的结合,提供了更细粒度的性能退化分析和更可靠的评估基础。
Abstract: Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.
[55] SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs cs.CV | cs.AI | cs.CLPDF
Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang
TL;DR: SPARC是一个模块化视觉语言模型框架,通过将视觉感知与推理过程显式解耦为两阶段流程(先进行视觉搜索定位相关区域,再基于这些区域进行推理),解决了传统VLM在测试时扩展时因感知与推理纠缠导致的错误传播和计算效率低下的问题。
Details
Motivation: 传统视觉语言模型在测试时动态扩展计算资源(测试时扩展)时表现脆弱,其非结构化的思维链将感知与推理纠缠在一起,导致冗长混乱的上下文,微小的感知错误可能级联成完全错误的答案,且通常需要昂贵的人工设计奖励的强化学习才能获得良好性能。
Result: 在具有挑战性的视觉推理基准测试中,SPARC超越了整体式基线模型和强大的视觉基础方法。例如,在V* VQA基准上,SPARC将Qwen3VL-4B的准确率提高了6.7个百分点;在一个具有挑战性的OOD任务上,尽管所需token预算降低了200倍,其性能仍比‘thinking with images’方法高出4.6个百分点。
Insight: 核心创新在于受大脑序列性感觉-认知处理过程启发,显式地将感知与推理分离为独立的、可独立优化和扩展的模块。这种分离支持非对称计算分配、选择性优化(如单独改进瓶颈感知阶段)以及通过低分辨率全局搜索和高分辨率局部处理来压缩上下文、减少总视觉token数和计算量,从而在提升性能的同时显著提高效率。
Abstract: Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses “thinking with images” by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.
[56] An Integer Linear Programming Approach to Geometrically Consistent Partial-Partial Shape Matching cs.CVPDF
Viktoria Ehm, Paul Roetzer, Florian Bernard, Daniel Cremers
TL;DR: 本文提出了一种基于整数线性规划的几何一致性部分-部分3D形状匹配方法,旨在解决两个部分观测形状之间的对应关系建立问题,同时确定未知的重叠区域。
Details
Motivation: 部分-部分3D形状匹配是许多现实场景(如3D扫描)中的核心挑战,但现有研究较少,主要难点在于需同时计算精确对应并找出未知重叠区域。
Result: 实验表明,该方法在匹配误差和平滑性方面均实现了高质量的匹配结果,并且比先前方法更具可扩展性。
Insight: 创新点在于首次将整数线性规划专门应用于部分-部分形状匹配,并利用几何一致性作为强先验,以同时鲁棒估计重叠区域并计算保持邻域结构的对应关系。
Abstract: The task of establishing correspondences between two 3D shapes is a long-standing challenge in computer vision. While numerous studies address full-full and partial-full 3D shape matching, only a limited number of works have explored the partial-partial setting, very likely due to its unique challenges: we must compute accurate correspondences while at the same time find the unknown overlapping region. Nevertheless, partial-partial 3D shape matching reflects the most realistic setting, as in many real-world cases, such as 3D scanning, shapes are only partially observable. In this work, we introduce the first integer linear programming approach specifically designed to address the distinctive challenges of partial-partial shape matching. Our method leverages geometric consistency as a strong prior, enabling both robust estimation of the overlapping region and computation of neighbourhood-preserving correspondences. We empirically demonstrate that our approach achieves high-quality matching results both in terms of matching error and smoothness. Moreover, we show that our method is more scalable than previous formalisms.
[57] DAVE: Distribution-aware Attribution via ViT Gradient Decomposition cs.CV | cs.AI | cs.HC | cs.LGPDF
Adam Wróbel, Siddhartha Gairola, Jacek Tabor, Bernt Schiele, Bartosz Zieliński
TL;DR: 本文提出了DAVE方法,一种基于ViT梯度分解的分布感知归因方法,旨在解决Vision Transformers(ViTs)在生成稳定、高分辨率归因图时面临的挑战。该方法通过结构化分解输入梯度,利用ViT的架构特性,分离出局部等变且稳定的有效输入-输出映射成分,从而减少由补丁嵌入和注意力路由等结构引入的伪影。
Details
Motivation: 动机在于ViTs已成为计算机视觉的主流架构,但为其生成稳定且高分辨率的归因图仍具挑战性,因为架构组件(如补丁嵌入和注意力路由)会在像素级解释中引入结构化伪影,导致现有方法多依赖粗糙的补丁级归因。
Result: 摘要中未提及具体的定量实验结果、基准测试或性能水平(如SOTA比较)。
Insight: 创新点在于提出了一种基于数学基础的ViT归因方法,通过梯度分解来隔离局部等变和稳定的映射成分,从而减少架构诱导的伪影和不稳定性,这为理解ViT的决策过程提供了更清晰、更可靠的解释工具。
Abstract: Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit{(\underline{D}istribution-aware \underline{A}ttribution via \underline{V}iT Gradient D\underline{E}composition)}, a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input–output mapping. It separates these from architecture-induced artifacts and other sources of instability.
[58] CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling cs.CVPDF
Yuxin He, An Li, Cheng Xue
TL;DR: 本文提出CauCLIP,一个受因果启发的视觉-语言框架,旨在解决手术视频理解中合成数据与真实数据之间的领域差距问题。该方法利用CLIP学习领域不变表示,通过频率增强策略扰动领域特定属性,并采用因果抑制损失来减轻非因果偏差,从而专注于手术工作流程中稳定的因果因素。
Details
Motivation: 动机是解决手术阶段识别中标注临床视频有限以及合成与真实手术数据间存在巨大领域差距的问题,以训练更鲁棒的模型。
Result: 在SurgVisDom硬适应基准测试中,该方法显著优于所有竞争方法,展示了因果引导的视觉-语言模型在领域泛化手术视频理解中的有效性。
Insight: 创新点在于将因果推理与视觉-语言模型(CLIP)结合,通过频率增强和因果抑制损失来学习领域不变表示,这为领域泛化问题提供了新的解决方案,可借鉴其因果框架设计来减少模型对领域特定偏差的依赖。
Abstract: Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.
[59] PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks cs.CVPDF
Junxian Li, Kai Liu, Leyang Chen, Weida Wang, Zhixin Wang
TL;DR: 本文提出了PlanViz基准,用于评估统一多模态模型在计算机使用任务中的规划导向图像生成与编辑能力,重点关注路线规划、工作流程图和网页/UI展示三个子任务,并设计了PlanScore进行综合评估。
Details
Motivation: 统一多模态模型在图像生成和推理方面表现出色,但其在贴近日常生活的计算机使用规划任务(需要空间推理和流程理解能力)中的潜力尚未被充分探索,因此需要专门的评估基准。
Result: 通过实验,论文揭示了当前模型在该主题上的关键局限性和未来研究机会,但未在摘要中提及具体的定量结果或与SOTA的比较。
Insight: 创新点在于构建了一个针对计算机使用规划任务(如路线、流程图、UI)的图像生成评估基准,并提出了结合正确性、视觉质量和效率的任务自适应评分PlanScore,以解决数据质量和评估全面性的挑战。
Abstract: Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.
[60] CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis cs.CV | cs.HC | cs.LGPDF
Yonghao Si, Xingyuan Zeng, Zhao Chen, Libin Zheng, Caleb Chen Cao
TL;DR: 本文介绍了CytoCrowd,一个用于细胞学图像分析的新型公共基准数据集。该数据集包含446张高分辨率图像,每张图像均提供来自四位独立病理学家的原始冲突标注,以及由资深专家建立的独立高质量金标准真值。该数据集既可作为标准计算机视觉任务(如目标检测和分类)的基准,也可作为评估解决专家分歧的标注聚合算法的真实测试平台。
Details
Motivation: 当前高质量标注数据集大多只提供单一、干净的真值,这掩盖了现实世界中专家之间的分歧;或者提供多个标注但没有独立的金标准进行客观评估。CytoCrowd旨在填补这一空白,为医学图像分析提供一个更贴近现实、能反映专家差异的基准资源。
Result: 论文为数据集的两个主要任务(标准计算机视觉任务和标注聚合算法评估)提供了全面的基线结果。实验证明了CytoCrowd带来的挑战,并确立了其作为开发下一代医学图像分析模型资源的价值。
Insight: 核心创新在于数据集的双重结构设计:同时包含原始的多专家冲突标注和独立的金标准真值。这使得数据集既能用于评估模型在标准任务上的性能,又能作为研究如何有效聚合存在分歧的专家意见(即处理标注不确定性)的独特测试平台,更贴近医学诊断的实际场景。
Abstract: High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.
[61] Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening cs.CVPDF
Dong Chen, Zizhuang Wei, Jialei Xu, Xinyang Sun, Zonglin He
TL;DR: 本文提出了一种用于基于步态视频的青少年特发性脊柱侧弯(AIS)筛查的新方法。该方法引入了一个名为ScoliGait的新基准数据集,并构建了一个多模态学习框架,该框架整合了临床先验引导的运动学知识图谱和潜在注意力池化机制,以融合视频、文本和知识图谱信息,旨在实现可扩展、非侵入性且具有临床可解释性的AIS评估。
Details
Motivation: 解决传统AIS筛查方法主观性强、难以规模化、依赖专业临床知识的问题,以及现有基于步态视频分析的方法存在数据泄露(同一受试者视频片段重复使用导致性能虚高)和模型过于简化、缺乏临床可解释性的局限性。
Result: 在提出的ScoliGait数据集(包含1572个训练视频片段和300个完全独立的测试片段)上,该方法在现实的、非重复受试者基准测试中取得了显著的性能提升,建立了新的最先进水平(SOTA)。
Insight: 主要创新点包括:1) 构建了一个新的、避免数据泄露的基准数据集ScoliGait,并标注了放射学Cobb角和基于临床运动学先验的描述文本;2) 提出了一个整合临床先验引导的运动学知识图谱的多模态框架,增强了特征表示的可解释性;3) 引入了潜在注意力池化机制来有效融合视频、文本和知识图谱等多种模态信息。
Abstract: Adolescent Idiopathic Scoliosis (AIS) is a prevalent spinal deformity whose progression can be mitigated through early detection. Conventional screening methods are often subjective, difficult to scale, and reliant on specialized clinical expertise. Video-based gait analysis offers a promising alternative, but current datasets and methods frequently suffer from data leakage, where performance is inflated by repeated clips from the same individual, or employ oversimplified models that lack clinical interpretability. To address these limitations, we introduce ScoliGait, a new benchmark dataset comprising 1,572 gait video clips for training and 300 fully independent clips for testing. Each clip is annotated with radiographic Cobb angles and descriptive text based on clinical kinematic priors. We propose a multi-modal framework that integrates a clinical-prior-guided kinematic knowledge map for interpretable feature representation, alongside a latent attention pooling mechanism to fuse video, text, and knowledge map modalities. Our method establishes a new state-of-the-art, demonstrating a significant performance gap on a realistic, non-repeating subject benchmark. Our approach establishes a new state of the art, showing a significant performance gain on a realistic, subject-independent benchmark. This work provides a robust, interpretable, and clinically grounded foundation for scalable, non-invasive AIS assessment.
[62] Machine Learning for Detection and Severity Estimation of Sweetpotato Weevil Damage in Field and Lab Conditions cs.CVPDF
Doreen M. Chelangat, Sudi Murindanyi, Bruce Mugizi, Paul Musana, Benard Yada
TL;DR: 本研究提出了一种基于计算机视觉的方法,用于自动评估甘薯象鼻虫在田间和实验室条件下的损害程度。在田间环境中,通过分类模型预测根部损害严重程度,测试准确率达到71.43%;在实验室环境中,利用YOLO12目标检测模型结合根部分割与分块策略,实现了对微小虫洞的检测,平均精度达到77.7%。
Details
Motivation: 传统评估甘薯象鼻虫损害的方法依赖人工评分,存在劳动密集、主观性强和结果不一致的问题,严重阻碍了培育抗虫甘薯品种的育种计划。
Result: 田间分类模型测试准确率为71.43%;实验室目标检测模型在识别微小虫洞时平均精度为77.7%,表明计算机视觉技术能提供高效、客观且可扩展的评估工具。
Insight: 创新点包括结合田间与实验室环境的综合评估框架,以及在实验室检测中采用两阶段流程(根部分割与分块策略)以提升小目标检测能力,这为农业病虫害自动化表型分析提供了可借鉴的技术路径。
Abstract: Sweetpotato weevils (Cylas spp.) are considered among the most destructive pests impacting sweetpotato production, particularly in sub-Saharan Africa. Traditional methods for assessing weevil damage, predominantly relying on manual scoring, are labour-intensive, subjective, and often yield inconsistent results. These challenges significantly hinder breeding programs aimed at developing resilient sweetpotato varieties. This study introduces a computer vision-based approach for the automated evaluation of weevil damage in both field and laboratory contexts. In the field settings, we collected data to train classification models to predict root-damage severity levels, achieving a test accuracy of 71.43%. Additionally, we established a laboratory dataset and designed an object detection pipeline employing YOLO12, a leading real-time detection model. This methodology incorporated a two-stage laboratory pipeline that combined root segmentation with a tiling strategy to improve the detectability of small objects. The resulting model demonstrated a mean average precision of 77.7% in identifying minute weevil feeding holes. Our findings indicate that computer vision technologies can provide efficient, objective, and scalable assessment tools that align seamlessly with contemporary breeding workflows. These advancements represent a significant improvement in enhancing phenotyping efficiency within sweetpotato breeding programs and play a crucial role in mitigating the detrimental effects of weevils on food security.
[63] Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing cs.CVPDF
Meng Lou, Stanley Yu, Yizhou Yu
TL;DR: 本文提出了一种名为AdaRoute的新型参数高效微调方法,用于适应预训练视觉模型。该方法采用混合专家架构,通过动态参数路由机制,为网络中的每个模块动态生成输入依赖的权重矩阵,从而在密集预测任务中实现更定制化和强大的特征表示。
Details
Motivation: 现有参数高效微调方法在应用于复杂密集预测任务时存在局限性,包括输入无关的建模和跨层表示冗余。本文旨在解决这些问题,以使用最少的可训练参数达到与全微调相当的性能。
Result: 在语义分割、目标检测与实例分割以及全景分割等多种视觉任务上的大量实验证明了AdaRoute的优越性。
Insight: 创新点在于引入了共享专家中心,并通过简单的动态参数路由机制实现输入依赖的低秩适应,同时通过跨层共享专家中心促进隐式跨层特征交互,提高了特征多样性。
Abstract: Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose AdaRoute, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each AdaRoute module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in AdaRoute modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since AdaRoute modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experiments demonstrate the superiority of AdaRoute on diverse vision tasks, including semantic segmentation, object detection and instance segmentation, and panoptic segmentation. Code will be available at: https://bit.ly/3NZcr0H.
[64] RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing cs.CVPDF
Mohammadreza Salehi, Mehdi Noroozi, Luca Morreale, Ruchika Chavhan, Malcolm Chadwick
TL;DR: 本文提出了一种名为RFDM(残差流扩散模型)的高效因果视频编辑方法,通过将2D图像到图像(I2I)扩散模型适配为视频到视频(V2V)编辑模型,并引入一种新的I2I扩散前向过程公式来鼓励模型预测目标输出与先前预测之间的残差,从而专注于连续帧之间的变化,实现了对可变长度视频的逐帧编辑,在计算效率上媲美图像模型且独立于输入视频长度。
Details
Motivation: 解决现有指令视频编辑方法通常需要固定长度输入和大量计算资源的问题,同时探索自回归视频生成在视频编辑中的应用不足,旨在开发一种能够高效处理可变长度视频的因果编辑模型。
Result: 在全局/局部风格迁移和对象移除的配对视频数据上训练后,RFDM超越了基于I2I的方法,并与完全时空(3D)V2V模型竞争,同时匹配图像模型的计算成本,且性能独立于输入视频长度;论文还提出了一个新的基准来更好地对最先进方法的编辑任务进行排名。
Insight: 创新点在于将I2I扩散模型通过因果条件适配为V2V编辑,并引入残差流扩散机制以利用视频的时间冗余性,专注于帧间变化,从而在保持计算效率的同时实现高质量编辑;客观分析认为,该方法通过残差预测简化了去噪过程,是高效处理长视频序列的有效策略。
Abstract: Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step t on the model’s prediction at t-1. To leverage videos’ temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods for editing tasks. Trained on paired video data for global/local style transfer and object removal, RFDM surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length. More content can be found in: https://smsd75.github.io/RFDM_page/
[65] Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers cs.CVPDF
Yuxuan Yao, Yuxuan Chen, Hui Li, Kaihui Cheng, Qipeng Guo
TL;DR: 本文针对多模态扩散变换器(MMDiTs)在文本到图像生成中存在的提示遗忘现象,提出了一种无需训练的提示再注入方法,通过将早期层的提示表征重新注入到深层,有效缓解了语义遗忘问题,并在多个基准测试中提升了指令遵循能力和生成质量。
Details
Motivation: 发现并验证了在多模态扩散变换器中,文本分支的提示语义会随着网络深度增加而逐渐被遗忘,这影响了文本到图像生成的指令遵循能力。
Result: 在GenEval、DPG和T2I-CompBench++等基准测试上,该方法在指令遵循、偏好、美学和整体生成质量等指标上均取得了稳定提升。
Insight: 创新性地揭示了MMDiTs中的提示遗忘现象,并提出了一种简单有效的训练后缓解策略,即通过跨层再注入提示表征来保持语义一致性,这为改善多模态生成模型的文本对齐提供了新思路。
Abstract: Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs–SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text–image generation quality.
[66] Seeing Beyond Redundancy: Task Complexity’s Role in Vision Token Specialization in VLLMs cs.CVPDF
Darryl Hannan, John Cooper, Dylan White, Yijing Watkins
TL;DR: 本文通过构建合成基准数据集和度量指标,研究了视觉大语言模型(VLLMs)中视觉信息处理与任务复杂性之间的关系,发现任务复杂度与视觉压缩存在关联,高复杂度视觉数据对于改善VLLMs的视觉表示分布和提升复杂视觉任务性能至关重要。
Details
Motivation: VLLMs的视觉能力(尤其是细粒度视觉信息和空间推理)一直落后于其语言能力,现有研究常归因于视觉冗余,但具体原因尚不明确。本文旨在深入探究VLLMs如何处理各类视觉信息、丢弃哪些信息,以及任务复杂性如何影响这一过程。
Result: 研究通过合成基准数据集和定制度量指标分析了视觉冗余的细微差别,并通过对多个复杂视觉任务进行微调实验,发现任务复杂度与视觉压缩之间存在联系。
Insight: 创新点在于构建了专门用于探测多种视觉特征的合成数据集及冗余度量方法,并揭示了任务复杂度是影响VLLMs视觉表示分布的关键因素,为训练下一代VLLMs提供了重要见解,即需要足够比例的高复杂度视觉数据来优化模型性能。
Abstract: Vision capabilities in vision large language models (VLLMs) have consistently lagged behind their linguistic capabilities. In particular, numerous benchmark studies have demonstrated that VLLMs struggle when fine-grained visual information or spatial reasoning is required. However, we do not yet understand exactly why VLLMs struggle so much with these tasks relative to others. Some works have focused on visual redundancy as an explanation, where high-level visual information is uniformly spread across numerous tokens and specific, fine-grained visual information is discarded. In this work, we investigate this premise in greater detail, seeking to better understand exactly how various types of visual information are processed by the model and what types of visual information are discarded. To do so, we introduce a simple synthetic benchmark dataset that is specifically constructed to probe various visual features, along with a set of metrics for measuring visual redundancy, allowing us to better understand the nuances of their relationship. Then, we explore fine-tuning VLLMs on a number of complex visual tasks to better understand how redundancy and compression change based upon the complexity of the data that a model is trained on. We find that there is a connection between task complexity and visual compression, implying that having a sufficient ratio of high complexity visual data is crucial for altering the way that VLLMs distribute their visual representation and consequently improving their performance on complex visual tasks. We hope that this work will provide valuable insights for training the next generation of VLLMs.
[67] Reliable Mislabel Detection for Video Capsule Endoscopy Data cs.CV | cs.LGPDF
Julia Werner, Julius Oexle, Oliver Bause, Maxime Le Floch, Franz Brinkmann
TL;DR: 本文提出了一种用于医学数据集中错误标签检测的框架,并在两个最大的公开视频胶囊内窥镜数据集上进行了验证。该框架能够有效识别错误标注的样本,经过数据清洗后提升了异常检测性能。
Details
Motivation: 解决医学影像数据集中因标注专家稀缺和类别边界模糊导致的标注错误问题,以提高深度学习模型的分类性能。
Result: 在最大的公开视频胶囊内窥镜数据集上验证,框架成功检测出错误标签,清洗后异常检测性能优于当前基线方法。
Insight: 创新点在于构建了一个针对医学影像的可靠错误标签检测流程,并通过专家重新标注验证了其有效性,为医学数据质量控制提供了实用工具。
Abstract: The classification performance of deep neural networks relies strongly on access to large, accurately annotated datasets. In medical imaging, however, obtaining such datasets is particularly challenging since annotations must be provided by specialized physicians, which severely limits the pool of annotators. Furthermore, class boundaries can often be ambiguous or difficult to define which further complicates machine learning-based classification. In this paper, we want to address this problem and introduce a framework for mislabel detection in medical datasets. This is validated on the two largest, publicly available datasets for Video Capsule Endoscopy, an important imaging procedure for examining the gastrointestinal tract based on a video stream of lowresolution images. In addition, potentially mislabeled samples identified by our pipeline were reviewed and re-annotated by three experienced gastroenterologists. Our results show that the proposed framework successfully detects incorrectly labeled data and results in an improved anomaly detection performance after cleaning the datasets compared to current baselines.
[68] CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation cs.CVPDF
Kaiyi Huang, Yukun Huang, Yu Li, Jianhong Bai, Xintao Wang
TL;DR: CineScene是一个利用隐式3D感知场景表示进行电影视频生成的框架,通过解耦场景上下文,从静态环境图像中合成高质量视频,支持用户指定的相机轨迹并保持场景一致性。
Details
Motivation: 解决电影视频制作中物理场景构建成本高的问题,实现从静态环境图像生成具有动态主体和可控相机运动的视频。
Result: 在构建的Unreal Engine 5场景解耦数据集上,CineScene在场景一致性电影视频生成任务中达到最先进水平,能处理大幅相机运动并泛化到多样环境。
Insight: 创新点包括隐式3D感知特征注入的上下文条件机制、训练时的随机打乱策略,以及使用游戏引擎构建配对数据集以解决数据缺乏问题。
Abstract: Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model’s robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.
[69] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images cs.CVPDF
Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan
TL;DR: MedMO是一个专门针对医学图像的多模态大语言模型,通过多阶段训练(跨模态预训练、指令微调和强化学习)实现了对医学图像的全面理解和空间定位能力。该模型在多个医学任务上超越了开源基线模型,并在VQA、文本QA、报告生成和疾病定位等任务上表现出色,展现出强大的跨模态泛化能力。
Details
Motivation: 现有MLLMs在医学领域应用受限,主要问题包括领域覆盖不足、模态对齐不充分以及缺乏基于空间定位的推理能力。MedMO旨在通过专门设计的训练流程和医学数据来解决这些挑战,提升模型在复杂临床场景中的实用性。
Result: 在VQA基准测试中,MedMO比基线模型平均准确率提升13.7%,与SOTA模型Fleming-VL差距仅1.9%;在文本QA任务中,比基线提升6.9%,比Fleming-VL提升14.5%;在医学报告生成任务中,语义和临床准确性均有显著提升;在疾病定位任务中,IoU指标比基线提升40.4%,比Fleming-VL提升37.0%。模型在放射学、眼科学和病理学-显微镜等多个模态上验证了其泛化能力。
Insight: 论文的创新点包括:1)专门针对医学领域设计的多阶段训练流程,结合了跨模态对齐、多任务指令微调和基于可验证奖励的强化学习;2)引入了结合事实性检查和边界框GIoU奖励的强化学习机制,增强了空间定位和逐步推理能力;3)完全基于大规模医学数据训练,提升了领域适应性。从客观角度看,其将强化学习与空间定位奖励结合的方法为医学MLLMs的可靠推理提供了新思路。
Abstract: Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline and performs within 1.9% of the SOTA Fleming-VL. For text-based QA, it attains +6.9% over the baseline and +14.5% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, and pathology-microscopy confirm MedMO’s broad cross-modality generalization. We release two versions of MedMO: 4B and 8B. Project is available at https://genmilab.github.io/MedMO-Page
cs.SE [Back]
[70] SVRepair: Structured Visual Reasoning for Automated Program Repair cs.SE | cs.AI | cs.CVPDF
Xiaoxuan Tang, Jincheng Wang, Liwei Luo, Jingxuan Xu, Sheng Zhou
TL;DR: 本文提出了SVRepair,一个用于自动程序修复(APR)的多模态框架。它通过一个名为结构化视觉表示(SVR)的视觉语言模型,将截图、控制流图等异构视觉信息统一转换为语义场景图,以捕获GUI元素及其结构关系,为下游修复提供规范化的、与代码相关的上下文。基于此图,SVRepair驱动一个编码代理进行故障定位和补丁生成,并引入迭代式视觉工件分割策略,逐步聚焦于错误相关区域以减少无关上下文和幻觉。
Details
Motivation: 现有基于大语言模型(LLM)的自动程序修复方法大多是单模态的,未能充分利用错误报告中包含的截图等视觉信息。这些视觉信息(如布局损坏、控件缺失)通常包含关键的诊断信号,但直接将密集的视觉输入用于多模态大语言模型(MLLM)会导致上下文丢失和噪声,难以将视觉观察精确地定位到故障并生成可执行补丁。
Result: 在多个基准测试上取得了最先进的性能:在SWE-Bench M上达到36.47%的准确率,在MMCode上达到38.02%,在CodeVision上达到95.12%。
Insight: 核心创新点在于将异构视觉信息(如截图、控制流图)统一转换为结构化的语义场景图,这为多模态程序修复提供了规范化、代码相关的上下文表示。此外,迭代式视觉工件分割策略能够逐步聚焦于错误相关区域,有效抑制无关上下文并减少幻觉,提升了故障定位和补丁生成的精度。
Abstract: Large language models (LLMs) have recently shown strong potential for Automated Program Repair (APR), yet most existing approaches remain unimodal and fail to leverage the rich diagnostic signals contained in visual artifacts such as screenshots and control-flow graphs. In practice, many bug reports convey critical information visually (e.g., layout breakage or missing widgets), but directly using such dense visual inputs often causes context loss and noise, making it difficult for MLLMs to ground visual observations into precise fault localization and executable patches. To bridge this semantic gap, we propose \textbf{SVRepair}, a multimodal APR framework with structured visual representation. SVRepair first fine-tunes a vision-language model, \textbf{Structured Visual Representation (SVR)}, to uniformly transform heterogeneous visual artifacts into a \emph{semantic scene graph} that captures GUI elements and their structural relations (e.g., hierarchy), providing normalized, code-relevant context for downstream repair. Building on the graph, SVRepair drives a coding agent to localize faults and synthesize patches, and further introduces an iterative visual-artifact segmentation strategy that progressively narrows the input to bug-centered regions to suppress irrelevant context and reduce hallucinations. Extensive experiments across multiple benchmarks demonstrate state-of-the-art performance: SVRepair achieves \textbf{36.47%} accuracy on SWE-Bench M, \textbf{38.02%} on MMCode, and \textbf{95.12%} on CodeVision, validating the effectiveness of SVRepair for multimodal program repair.
eess.IV [Back]
[71] COSMOS: Coherent Supergaussian Modeling with Spatial Priors for Sparse-View 3D Splatting eess.IV | cs.CV | cs.GRPDF
Chaeyoung Jeong, Kwangsu Kim
TL;DR: 本文提出了COSMOS方法,旨在解决3D高斯泼溅(3DGS)在稀疏输入视图下训练时出现的过拟合和结构退化问题。该方法通过引入基于局部几何和外观特征的超高斯分组来整合3D结构先验,利用组间全局自注意力和组内稀疏局部注意力融合全局与局部空间信息,并通过组内位置正则化增强结构一致性和训练稳定性,从而在无外部深度监督的稀疏视图设置下实现更一致的3D重建。
Details
Motivation: 3DGS在稀疏视图下训练时,仅依赖光度损失优化,缺乏3D结构先验,导致过拟合、结构退化和新视角泛化能力差。
Result: 在Blender和DTU数据集上的实验表明,COSMOS在稀疏视图设置下超越了现有最先进方法,且无需外部深度监督。
Insight: 创新点在于将3D分割中的超点概念推广为超高斯分组,以此引入结构先验;通过结合全局与局部注意力的机制以及组内位置正则化,有效提升了稀疏视图下的重建一致性和稳定性,为基于点的3D重建提供了新的结构约束思路。
Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a promising approach for 3D reconstruction, providing explicit, point-based representations and enabling high-quality real time rendering. However, when trained with sparse input views, 3DGS suffers from overfitting and structural degradation, leading to poor generalization on novel views. This limitation arises from its optimization relying solely on photometric loss without incorporating any 3D structure priors. To address this issue, we propose Coherent supergaussian Modeling with Spatial Priors (COSMOS). Inspired by the concept of superpoints from 3D segmentation, COSMOS introduces 3D structure priors by newly defining supergaussian groupings of Gaussians based on local geometric cues and appearance features. To this end, COSMOS applies inter group global self-attention across supergaussian groups and sparse local attention among individual Gaussians, enabling the integration of global and local spatial information. These structure-aware features are then used for predicting Gaussian attributes, facilitating more consistent 3D reconstructions. Furthermore, by leveraging supergaussian-based grouping, COSMOS enforces an intra-group positional regularization to maintain structural coherence and suppress floaters, thereby enhancing training stability under sparse-view conditions. Our experiments on Blender and DTU show that COSMOS surpasses state-of-the-art methods in sparse-view settings without any external depth supervision.
[72] Zero-shot Multi-Contrast Brain MRI Registration by Intensity Randomizing T1-weighted MRI (LUMIR25) eess.IV | cs.CVPDF
Hengjie Liu, Yimeng Dou, Di Xu, Xinyi Fu, Dan Ruan
TL;DR: 本文总结了在Learn2Reg 2025 LUMIR25挑战赛中取得测试集第一名的提交方法和结果。该任务专注于零样本跨域配准(涉及高场MRI、病理大脑及多种MRI对比度),而训练数据仅包含域内T1加权脑MRI。通过分析LUMIR24优胜方案,作者采用三种简单有效的策略:基于模态无关邻域描述符(MIND)的多模态损失、强度随机化进行外观增强,以及推理时在特征编码器上进行轻量级实例特定优化(ISO),以实现从T1加权MRI训练模型到多种对比度的良好泛化。
Details
Motivation: 解决在训练数据仅为域内T1加权脑MRI的情况下,实现零样本跨域(包括高场MRI、病理大脑和多种MRI对比度)配准的问题,以应对实际应用中常见的模态差异和领域偏移挑战。
Result: 在LUMIR25挑战赛测试集上获得第一名;在验证集上,该方法在T1-T2配准中实现了合理的精度,同时保持了良好的形变规律性。
Insight: 创新点包括:使用MIND的多模态损失来处理不同对比度间的差异;通过强度随机化进行外观增强以提升模型泛化能力;在推理时引入轻量级实例特定优化(ISO)来适应特定样本。从客观角度看,这些策略结合了损失函数设计、数据增强和推理优化,为有限训练数据下的跨模态配准提供了实用且高效的解决方案。
Abstract: In this paper, we summarize the methods and results of our submission to the LUMIR25 challenge in Learn2Reg 2025, which achieved 1st place overall on the test set. Extended from LUMIR24, this year’s task focuses on zero-shot registration under domain shifts (high-field MRI, pathological brains, and various MRI contrasts), while the training data comprise only in-domain T1-weighted brain MRI. We start with a meticulous analysis of LUMIR24 winners to identify the main contributors to good monomodal registration performance. To achieve good generalization with diverse contrasts from a model trained with T1-weighted MRI only, we employ three simple but effective strategies: (i) a multimodal loss based on the modality-independent neighborhood descriptor (MIND), (ii) intensity randomization for appearance augmentation, and (iii) lightweight instance-specific optimization (ISO) on feature encoders at inference time. On the validation set, our approach achieves reasonable T1-T2 registration accuracy while maintaining good deformation regularity.
[73] Orientation-Robust Latent Motion Trajectory Learning for Annotation-free Cardiac Phase Detection in Fetal Echocardiography eess.IV | cs.CVPDF
Yingyu Yang, Qianye Yang, Can Peng, Elena D’Alberti, Olga Patey
TL;DR: 本文提出了一种名为ORBIT的自监督框架,用于在胎儿超声心动图中自动检测心脏舒张末期(ED)和收缩末期(ES)相位,无需人工标注。该方法通过学习心脏形变的潜在运动轨迹,利用轨迹的转折点来识别心脏相位,并对胎儿心脏的不同朝向具有鲁棒性。
Details
Motivation: 胎儿超声心动图对于检测先天性心脏病至关重要,但缺乏胎儿心电图时,手动识别ED和ES帧非常耗时。现有无标注方法受限于固定的心脏朝向假设,因此需要一种能适应不同胎儿心脏朝向的自动化相位检测方法。
Result: 在仅使用正常胎儿超声心动图视频训练的情况下,ORBIT在正常病例(ED的MAE=1.9帧,ES的MAE=1.6帧)和先天性心脏病病例(ED的MAE=2.4帧,ES的MAE=2.1帧)上均取得了一致的性能,优于受固定朝向假设约束的现有无标注方法。
Insight: 创新点在于将配准作为自监督任务,学习心脏形变的潜在运动轨迹,利用轨迹的几何特性(转折点)来鲁棒地定位心脏相位,从而摆脱了对固定朝向假设和人工标注的依赖。这为医学图像分析中的时序相位检测提供了一种新的自监督学习思路。
Abstract: Fetal echocardiography is essential for detecting congenital heart disease (CHD), facilitating pregnancy management, optimized delivery planning, and timely postnatal interventions. Among standard imaging planes, the four-chamber (4CH) view provides comprehensive information for CHD diagnosis, where clinicians carefully inspect the end-diastolic (ED) and end-systolic (ES) phases to evaluate cardiac structure and motion. Automated detection of these cardiac phases is thus a critical component toward fully automated CHD analysis. Yet, in the absence of fetal electrocardiography (ECG), manual identification of ED and ES frames remains a labor-intensive bottleneck. We present ORBIT (Orientation-Robust Beat Inference from Trajectories), a self-supervised framework that identifies cardiac phases without manual annotations under various fetal heart orientation. ORBIT employs registration as self-supervision task and learns a latent motion trajectory of cardiac deformation, whose turning points capture transitions between cardiac relaxation and contraction, enabling accurate and orientation-robust localization of ED and ES frames across diverse fetal positions. Trained exclusively on normal fetal echocardiography videos, ORBIT achieves consistent performance on both normal (MAE = 1.9 frames for ED and 1.6 for ES) and CHD cases (MAE = 2.4 frames for ED and 2.1 for ES), outperforming existing annotation-free approaches constrained by fixed orientation assumptions. These results highlight the potential of ORBIT to facilitate robust cardiac phase detection directly from 4CH fetal echocardiography.
cs.MM [Back]
[74] Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space cs.MM | cs.AI | cs.CL | cs.CVPDF
Zihang Wang, Siyue Zhang, Yilun Zhao, Jingyi Yang, Tingyu Song
TL;DR: 本文首次系统性地研究了将多模态扩散语言模型(Multimodal dLLMs)转换为嵌入模型的有效性,并在分类、视觉问答和信息检索三类任务上评估了最先进的多模态扩散模型与自回归视觉语言模型(Autoregressive VLMs)的嵌入性能。
Details
Motivation: 随着大型扩散语言模型(dLLMs)和多模态扩散模型作为自回归模型的竞争性替代方案出现,具备双向注意力和并行生成等优势,一个关键且尚未探索的问题是:多模态扩散模型能否作为有效的多模态嵌入模型?
Result: 评估结果显示,多模态扩散模型嵌入的性能普遍低于自回归视觉语言模型。其中较强的扩散模型LaViDa在分类、视觉问答和检索任务上分别落后3.5、2.5和4.4个百分点,而另一扩散模型MMaDA在所有任务上的性能差距超过20个百分点。
Insight: 论文的创新点在于首次系统研究多模态扩散模型作为嵌入模型的潜力,并通过实验揭示了扩散模型在图像-文本对齐方面的不足是导致其嵌入性能受限的主要原因,这为未来改进扩散模型的表示学习提供了方向。
Abstract: Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation. Recent advances in large foundation models have substantially accelerated the development of embedding models, including those based on Large Language Models (LLMs), Vision Language Models (VLMs), and Multimodal LLMs. More recently, Large Diffusion Language Models (dLLMs) and Multimodal dLLMs have emerged as competitive alternatives to autoregressive models, offering advantages such as bidirectional attention and parallel generation. This progress naturally raises a critical yet unexplored question: can Multimodal dLLMs serve as effective multimodal embedding models? To answer this, we present the first systematic study of converting Multimodal dLLMs into embedding models. We evaluate state-of-the-art Multimodal dLLMs and Autoregressive VLMs across three categories of embedding tasks: classification, visual question answering, and information retrieval. Our results show that Multimodal dLLM embeddings generally underperform their autoregressive VLM counterparts. The stronger diffusion-based model, LaViDa, lags by only 3.5 points on classification, 2.5 points on VQA, and 4.4 points on retrieval tasks, whereas the other diffusion-based model, MMaDA, exhibits substantially larger performance gaps, exceeding 20 points across all tasks. Further analysis reveals insufficient image-text alignment in diffusion-based models, accounting for the observed limitations in their embedding performance.
cs.LG [Back]
[75] Self-Improving World Modelling with Latent Actions cs.LG | cs.AI | cs.CLPDF
Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen
TL;DR: 本文提出了一种名为SWIRL的自改进世界建模框架,该框架通过将动作视为隐变量,并交替进行前向世界建模和逆动力学建模,仅从状态序列中学习世界模型,无需昂贵的动作标注轨迹。
Details
Motivation: 动机在于为LLMs和VLMs构建内部世界模型以支持推理和规划,但学习此类模型通常需要成本高昂的动作标注轨迹,因此旨在从仅状态序列中学习。
Result: 在多个环境(包括单轮/多轮开放世界视觉动态和合成文本环境)的评估中,SWIRL在AURORABench上提升了16%,ByteMorph上提升了28%,WorldPredictionBench上提升了16%,StableToolBench上提升了14%。
Insight: 创新点在于将动作作为隐变量的自改进学习框架,通过变分信息最大化和ELBO最大化交替更新,并利用强化学习(GRPO)进行训练,为仅从状态序列学习世界模型提供了理论保证和有效方法。
Abstract: Internal modelling of the world – predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ – is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) $P_θ(Y|X,Z)$ and an Inverse Dynamics Modelling (IDM) $Q_φ(Z|X,Y)$. SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model’s log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.
[76] Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning cs.LG | cs.AI | cs.CLPDF
Nan Chen, Soledad Villar, Soufiane Hayou
TL;DR: 本文提出了Maximal-Update Adaptation (μA)理论框架,用于确定LoRA微调中学习率随适配器秩(rank)变化的缩放规律,并识别出学习率在不同初始化配置下保持不变或与秩成反比缩放两种模式,进一步实现了从LoRA微调到全参数微调的学习率迁移,显著降低了调参成本。
Details
Motivation: LoRA微调的训练动态复杂,其最优学习率如何随适配器秩缩放尚不明确,导致每次改变秩时都需要重新调整学习率,增加了调参负担。
Result: 在语言、视觉、视觉-语言、图像生成和强化学习等多个任务上的实验验证了所提出的缩放规则,并表明在LoRA上调优的学习率可以可靠地迁移到全参数微调中。
Insight: 创新点在于将预训练中的Maximal-Update Parametrization (μP)思想扩展到LoRA微调,建立了学习率与模型宽度及适配器秩的理论关联,为高效调参提供了可迁移的缩放规则,减少了全参数微调的调优开销。
Abstract: Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation ($μ$A), a theoretical framework that characterizes how the “optimal” learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. $μ$A is inspired from the Maximal-Update Parametrization ($μ$P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision–language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.
[77] CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers cs.LG | cs.CVPDF
Boxiang Zhang, Baijian Yang
TL;DR: 本文提出了一种名为CORP的闭式单次结构化剪枝框架,专门针对视觉Transformer(ViT)模型。该方法无需标签、梯度或微调,仅使用少量未标记校准数据,通过将结构化剪枝建模为表示恢复问题,并推导闭式岭回归解来补偿剪枝带来的表示误差,从而在保持模型精度的同时显著降低计算和内存开销。
Details
Motivation: 视觉Transformer虽然精度高,但计算和内存成本高昂。现有的结构化剪枝方法通常依赖重训练或多阶段优化,限制了训练后部署的灵活性。本文旨在开发一种在严格训练后约束下、无需额外优化即可高效剪枝的方法。
Result: 在ImageNet数据集上使用DeiT模型进行实验,结果表明:未经补偿的单次结构化剪枝会导致精度严重下降,而CORP在激进稀疏度下仍能保持精度。例如,在DeiT-Huge模型上剪除50%的MLP和注意力结构后,CORP仍能保持82.8%的Top-1准确率,且单GPU上剪枝过程可在20分钟内完成,并带来显著的现实效率提升。
Insight: 创新点在于将结构化剪枝形式化为表示恢复问题,通过将移除的激活和注意力对数建模为保留成分的仿射函数,并推导闭式岭回归解来直接补偿权重,从而在无需微调的情况下最小化校准分布下的预期表示误差。这为训练后高效剪枝提供了一种可解释的数学框架。
Abstract: Vision Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning can reduce inference cost, but most methods rely on retraining or multi-stage optimization. These requirements limit post-training deployment. We propose \textbf{CORP}, a closed-form one-shot structured pruning framework for Vision Transformers. CORP removes entire MLP hidden dimensions and attention substructures without labels, gradients, or fine-tuning. It operates under strict post-training constraints using only a small unlabeled calibration set. CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes expected representation error under the calibration distribution. Experiments on ImageNet with DeiT models show strong redundancy in MLP and attention representations. Without compensation, one-shot structured pruning causes severe accuracy degradation. With CORP, models preserve accuracy under aggressive sparsity. On DeiT-Huge, CORP retains 82.8% Top-1 accuracy after pruning 50% of both MLP and attention structures. CORP completes pruning in under 20 minutes on a single GPU and delivers substantial real-world efficiency gains.
[78] AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models cs.LG | cs.AI | cs.CVPDF
Yuming Li, Qingyu Li, Chengyu Bai, Xiangyang Luo, Zeyue Xue
TL;DR: 本文提出了一种名为AEGPO的自适应熵引导策略优化方法,用于改进扩散模型的强化学习对齐。该方法通过分析训练过程中的注意力熵动态,利用熵的相对变化和峰值作为双重信号,分别在全局层面动态分配计算资源给高学习价值的提示,在局部层面引导探索关键的去噪步骤,从而实现了更高效和有效的策略优化。
Details
Motivation: 现有基于人类反馈的强化学习方法(如GRPO)在对齐扩散模型时,采用统一且静态的采样策略,忽略了不同样本学习价值的巨大差异以及关键探索时刻的动态性,导致优化效率低下。
Result: 在文本到图像生成任务上的实验表明,AEGPO相比标准的GRPO变体,能显著加速收敛,并取得更优的对齐性能。
Insight: 创新点在于将注意力熵作为双重信号代理:熵的相对变化(ΔEntropy)作为样本学习价值的指标,熵的峰值(Entropy(t))作为识别关键探索时刻的指标,并据此构建了全局和局部的自适应优化策略,实现了计算资源的精准聚焦。
Abstract: Reinforcement learning from human feedback (RLHF) shows promise for aligning diffusion and flow models, yet policy optimization methods such as GRPO suffer from inefficient and static sampling strategies. These methods treat all prompts and denoising steps uniformly, ignoring substantial variations in sample learning value as well as the dynamic nature of critical exploration moments. To address this issue, we conduct a detailed analysis of the internal attention dynamics during GRPO training and uncover a key insight: attention entropy can serve as a powerful dual-signal proxy. First, across different samples, the relative change in attention entropy (ΔEntropy), which reflects the divergence between the current policy and the base policy, acts as a robust indicator of sample learning value. Second, during the denoising process, the peaks of absolute attention entropy (Entropy(t)), which quantify attention dispersion, effectively identify critical timesteps where high-value exploration occurs. Building on this observation, we propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy. At the global level, AEGPO uses ΔEntropy to dynamically allocate rollout budgets, prioritizing prompts with higher learning value. At the local level, it exploits the peaks of Entropy(t) to guide exploration selectively at critical high-dispersion timesteps rather than uniformly across all denoising steps. By focusing computation on the most informative samples and the most critical moments, AEGPO enables more efficient and effective policy optimization. Experiments on text-to-image generation tasks demonstrate that AEGPO significantly accelerates convergence and achieves superior alignment performance compared to standard GRPO variants.
[79] Vision Transformer Finetuning Benefits from Non-Smooth Components cs.LG | cs.CV | stat.MLPDF
Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko
TL;DR: 本文研究了视觉Transformer(ViT)在微调过程中,其组件的可塑性(即对输入变化的适应能力,与平滑性相反)对性能的影响。通过理论分析和实验发现,注意力模块和前馈层的高可塑性(即低平滑性)能带来更好的微调效果,挑战了平滑性有益的传统假设。
Details
Motivation: 现有研究广泛探讨了Transformer架构的平滑性在泛化、训练稳定性和对抗鲁棒性方面的作用,但其在迁移学习(特别是微调)中的角色尚不明确。本文旨在分析ViT组件的可塑性(即输出对输入变化的敏感度)如何影响微调性能。
Result: 在ImageNet-1K等基准数据集上的综合实验表明,优先适应具有高可塑性(低平滑性)的组件(如注意力模块和前馈层)能持续提升微调性能。这为微调策略提供了原则性指导。
Insight: 论文的主要创新点在于提出了从组件可塑性(而非平滑性)的视角来指导ViT微调,挑战了平滑性总是有益的普遍假设,为理解Transformer的功能特性提供了新思路。从客观角度看,将‘可塑性’作为衡量指标并用于指导微调组件选择,是一个新颖且具有实践价值的见解。
Abstract: The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at https://github.com/ambroiseodt/vit-plasticity.
cs.AI [Back]
[80] Large Language Model Reasoning Failures cs.AI | cs.CL | cs.LGPDF
Peiyang Song, Pengrui Han, Noah Goodman
TL;DR: 这篇论文是关于大型语言模型推理失败的首个全面综述。作者提出了一个新颖的分类框架,将推理分为具身与非具身类型,并将推理失败分为三类:模型架构固有的根本性失败、特定应用领域的局限性以及由微小变化导致的鲁棒性问题。
Details
Motivation: 尽管LLMs在推理任务上取得了显著成果,但在看似简单的场景中仍存在显著的推理失败。为了系统性地理解和解决这些缺陷,作者进行了这项全面的综述研究。
Result: 论文是一项综述性研究,没有提出新模型或进行定量实验,但整合了该领域的现有工作,并发布了一个GitHub仓库作为该研究领域的入口点。
Insight: 论文的主要创新点在于提出了一个系统性的分类框架来理解和归类LLM的推理失败,将推理类型(具身/非具身)与失败类型(根本性/应用特定/鲁棒性)相结合,为未来研究提供了结构化的视角和指导。
Abstract: Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.
[81] LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models cs.AI | cs.CLPDF
Brian Rabern, Philipp Mondorf, Barbara Plank
TL;DR: 该论文提出了LogicSkills基准测试,旨在评估大语言模型在形式推理中的三种核心逻辑技能:形式符号化、反模型构建和有效性评估。通过从一阶逻辑双变量片段中抽取问题,并以自然英语和Carroll风格语言呈现,研究发现主流模型在有效性评估上表现良好,但在符号化和反模型构建上表现较差,表明模型可能依赖表面模式而非真正的符号或规则推理。
Details
Motivation: 尽管大语言模型在多种逻辑推理基准测试中表现突出,但其真正掌握的核心逻辑技能尚不明确,因此需要设计一个结构化基准来隔离和评估形式推理中的基本能力。
Result: 在LogicSkills基准测试中,主流模型在有效性评估任务上表现较高,但在形式符号化和反模型构建任务上性能显著较低,揭示了模型在深层逻辑推理方面的局限性。
Insight: 论文的创新点在于构建了一个统一的基准测试,能够细粒度地评估大语言模型的形式推理技能,并揭示了模型可能过度依赖表面模式而非真正的符号推理,为未来改进模型逻辑能力提供了方向。
Abstract: Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) $\textit{formal symbolization}\unicode{x2014}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$formulating a finite structure in which all premises are true while the conclusion is false; and (iii) $\textit{validity assessment}\unicode{x2014}$deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.
[82] AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research cs.AI | cs.CLPDF
Yishan Li, Wentong Chen, Yukun Yan, Mingwei Li, Sen Mei
TL;DR: 本文提出了AgentCPM-Report,一个轻量级、高性能的本地深度研究报告生成系统。它采用一个模仿人类写作过程的框架和一个8B参数的研究智能体,通过‘写作即推理策略’(WARP),在报告生成过程中动态修订大纲,交替进行‘基于证据的起草’和‘推理驱动的深化’,以支持信息获取、知识精炼和迭代式大纲演化。
Details
Motivation: 解决当前语言模型在生成深度研究报告时面临的大规模信息获取和洞察驱动分析的挑战。现有方法多遵循‘先规划后写作’范式,其性能严重依赖初始大纲质量,而构建全面大纲本身需要强大的推理能力,导致现有深度研究系统严重依赖闭源或在线大模型,带来了部署障碍以及用户数据的安全和隐私问题。
Result: 在DeepResearch Bench、DeepConsult和DeepResearch Gym基准测试上的实验表明,AgentCPM-Report的性能优于领先的闭源系统,在Insight指标上取得了显著提升。
Insight: 主要创新点包括:1)提出了‘写作即推理策略’(WARP),允许模型在生成过程中动态修订大纲,打破了传统‘先规划后写作’的僵化流程;2)设计了‘基于证据的起草’和‘推理驱动的深化’交替进行的框架,实现了信息获取与知识精炼的迭代融合;3)引入了‘多阶段智能体训练策略’(冷启动、原子技能强化学习、整体流程强化学习),有效赋能小模型具备深度研究能力,为本地化、轻量级的高性能研究系统提供了可行路径。
Abstract: Generating deep research reports requires large-scale information acquisition and the synthesis of insight-driven analysis, posing a significant challenge for current language models. Most existing approaches follow a plan-then-write paradigm, whose performance heavily depends on the quality of the initial outline. However, constructing a comprehensive outline itself demands strong reasoning ability, causing current deep research systems to rely almost exclusively on closed-source or online large models. This reliance raises practical barriers to deployment and introduces safety and privacy concerns for user-authored data. In this work, we present AgentCPM-Report, a lightweight yet high-performing local solution composed of a framework that mirrors the human writing process and an 8B-parameter deep research agent. Our framework uses a Writing As Reasoning Policy (WARP), which enables models to dynamically revise outlines during report generation. Under this policy, the agent alternates between Evidence-Based Drafting and Reasoning-Driven Deepening, jointly supporting information acquisition, knowledge refinement, and iterative outline evolution. To effectively equip small models with this capability, we introduce a Multi-Stage Agentic Training strategy, consisting of cold-start, atomic skill RL, and holistic pipeline RL. Experiments on DeepResearch Bench, DeepConsult, and DeepResearch Gym demonstrate that AgentCPM-Report outperforms leading closed-source systems, with substantial gains in Insight.
[83] OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention cs.AI | cs.CVPDF
Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen
TL;DR: 本文提出OmniVideo-R1,一种通过查询意图和模态注意力增强音视频推理的强化框架,旨在提升模型在混合模态理解任务上的性能。
Details
Motivation: 现有全视频模型在音视频理解任务上面临挑战,人类通过多模态协同感知世界,因此需要增强模型的混合模态推理能力。
Result: 在多个基准测试上的广泛实验表明,OmniVideo-R1持续优于强基线,显示出其有效性和鲁棒的泛化能力。
Insight: 创新点包括基于自监督学习的查询密集定位和基于对比学习的模态注意力融合,使模型能够利用全模态线索进行推理。
Abstract: While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to “think with omnimodal cues” by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.
[84] Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion cs.AI | cs.CVPDF
Longhui Ma, Di Zhao, Siwei Wang, Zhao Lv, Miao Wang
TL;DR: Trifuse是一种基于注意力的GUI grounding框架,通过整合注意力机制、OCR提取的文本线索和图标级描述语义,利用Consensus-SinglePeak融合策略增强跨模态一致性并保持精确的定位峰值,从而在无需任务特定微调的情况下实现强大的GUI元素定位性能。
Details
Motivation: 现有基于多模态大语言模型(MLLMs)微调的方法数据密集且对未见界面泛化能力差,而基于注意力的替代方案因缺乏显式互补的空间锚点导致可靠性低,Trifuse旨在通过显式整合互补空间锚点来解决这一问题。
Result: 在四个grounding基准测试上的广泛评估表明,Trifuse在无需任务特定微调的情况下实现了强大的性能,显著减少了对昂贵标注数据的依赖。
Insight: 创新点在于提出了一种Consensus-SinglePeak(CS)融合策略,将注意力、OCR文本和图标描述语义进行整合,以增强跨模态一致性和定位精度;客观分析认为,该方法通过引入显式空间锚点(OCR和描述)作为补充信息,有效提升了基于注意力机制的GUI grounding的可靠性和泛化能力,可作为一种通用框架应用。
Abstract: GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.
[85] Same Answer, Different Representations: Hidden instability in VLMs cs.AI | cs.CVPDF
Farooq Ahmad Wani, Alessandro Suglia, Rohit Saxena, Aryo Pradipta Gema, Wai-Chung Kwan
TL;DR: 本文质疑了仅通过输出不变性评估视觉语言模型鲁棒性的传统假设,提出了一个结合表示层和频域分析的评估框架,用于测量内部嵌入漂移、频谱敏感性和结构平滑性。在SEEDBench、MMMU和POPE数据集上对现代VLMs的评估揭示了三种失效模式:预测答案不变但内部表示发生显著漂移;模型规模增大不提升鲁棒性,反而可能增加敏感性;扰动对不同任务的影响具有差异性。
Details
Motivation: 传统评估方法假设稳定的预测输出反映了稳定的多模态处理过程,但作者认为这一假设不足,需要更深入地评估模型内部表示的稳定性。
Result: 在SEEDBench、MMMU和POPE数据集上的评估表明,模型在文本覆盖等扰动下,内部表示漂移接近图像间变异的幅度;更大规模的模型虽然准确率更高,但敏感性未降低甚至增加;扰动对推理任务有害,但在幻觉基准上可能通过促使模型生成更保守的答案来减少误报。
Insight: 创新点在于提出了一个超越输出层、关注内部表示稳定性的评估框架,揭示了VLMs在表示层面的脆弱性,并指出模型规模扩大可能不会改善鲁棒性,以及扰动影响的复杂性(可能在某些任务中意外地减少幻觉)。这强调了评估多模态模型时需要深入分析其内部处理过程的重要性。
Abstract: The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.
cs.RO [Back]
[86] MultiGraspNet: A Multitask 3D Vision Model for Multi-gripper Robotic Grasping cs.RO | cs.CVPDF
Stephany Ortuno-Chanelo, Paolo Rabino, Enrico Civitelli, Tatiana Tommasi, Raffaello Camoriano
TL;DR: 本文提出MultiGraspNet,一种新颖的多任务3D深度学习模型,能够在统一框架中同时预测平行夹爪和真空吸盘的可行抓取姿态,使单个机器人能够处理多种末端执行器。该模型在统一对齐的GraspNet-1Billion和SuctionNet-1Billion数据集上训练,生成量化每个场景点抓取适宜性的抓取性掩码。通过共享早期特征并保持夹爪特定的细化器,模型有效利用了不同抓取模式间的互补信息,增强了在杂乱场景中的鲁棒性和适应性。
Details
Motivation: 解决现有视觉抓取模型通常局限于单一夹爪或依赖需要特定学习程序的定制混合夹爪,导致通用性受限的问题,旨在开发一个能同时处理多种夹爪的统一框架。
Result: 在相关基准测试中与单任务模型具有竞争力;在真实世界单臂多夹爪机器人实验中,真空抓取任务比基线多抓取16%的已知物体和32%的新物体,平行夹爪任务也取得了有竞争力的结果。
Insight: 创新点在于提出了一个统一的多任务3D视觉模型,通过共享早期特征和夹爪特定细化器的架构,实现了对不同抓取模态互补信息的有效利用;客观来看,其将大规模标注数据集进行对齐用于多任务训练的方法,以及生成的抓取性掩码量化评估,是提升机器人抓取通用性和适应性的有效途径。
Abstract: Vision-based models for robotic grasping automate critical, repetitive, and draining industrial tasks. Existing approaches are typically limited in two ways: they either target a single gripper and are potentially applied on costly dual-arm setups, or rely on custom hybrid grippers that require ad-hoc learning procedures with logic that cannot be transferred across tasks, restricting their general applicability. In this work, we present MultiGraspNet, a novel multitask 3D deep learning method that predicts feasible poses simultaneously for parallel and vacuum grippers within a unified framework, enabling a single robot to handle multiple end effectors. The model is trained on the richly annotated GraspNet-1Billion and SuctionNet-1Billion datasets, which have been aligned for the purpose, and generates graspability masks quantifying the suitability of each scene point for successful grasps. By sharing early-stage features while maintaining gripper-specific refiners, MultiGraspNet effectively leverages complementary information across grasping modalities, enhancing robustness and adaptability in cluttered scenes. We characterize MultiGraspNet’s performance with an extensive experimental analysis, demonstrating its competitiveness with single-task models on relevant benchmarks. We run real-world experiments on a single-arm multi-gripper robotic setup showing that our approach outperforms the vacuum baseline, grasping 16% percent more seen objects and 32% more of the novel ones, while obtaining competitive results for the parallel task.
[87] Think Proprioceptively: Embodied Visual Reasoning for VLA Manipulation cs.RO | cs.CVPDF
Fangyuan Wang, Peng Zhou, Jiaming Qi, Shipeng Lyu, David Navarro-Alarcon
TL;DR: 这篇论文提出了ThinkProprio方法,用于改进视觉-语言-动作(VLA)模型。该方法将机器人的本体感觉(proprioception)转换为VLM嵌入空间中的文本标记序列,并与任务指令在输入端进行早期融合,从而使机器人状态能够参与后续的视觉推理和标记选择,提升对关键视觉证据的关注并抑制冗余标记。
Details
Motivation: 现有VLA模型通常仅在后期将本体感觉作为条件信号注入,这阻碍了机器人状态对指令理解的影响,也使其无法在策略执行过程中影响对视觉标记的关注。论文旨在解决这一问题,让具身状态更早、更深入地参与推理过程。
Result: 在CALVIN、LIBERO基准测试和真实世界操作任务中,ThinkProprio匹配或超越了强基线模型,同时将端到端推理延迟降低了50%以上。实验表明,将本体感觉编码为文本标记比学习投影器更有效,且仅保留约15%的视觉标记即可达到使用完整标记集的性能。
Insight: 核心创新点在于将本体感觉早期融合为文本标记,使其能够动态引导视觉注意力,实现更高效的视觉推理。这为VLA模型提供了一种轻量且有效的方法,将具身状态整合到高层次理解和决策中,从而提升任务性能并降低计算开销。
Abstract: Vision-language-action (VLA) models typically inject proprioception only as a late conditioning signal, which prevents robot state from shaping instruction understanding and from influencing which visual tokens are attended throughout the policy. We introduce ThinkProprio, which converts proprioception into a sequence of text tokens in the VLM embedding space and fuses them with the task instruction at the input. This early fusion lets embodied state participate in subsequent visual reasoning and token selection, biasing computation toward action-critical evidence while suppressing redundant visual tokens. In a systematic ablation over proprioception encoding, state entry point, and action-head conditioning, we find that text tokenization is more effective than learned projectors, and that retaining roughly 15% of visual tokens can match the performance of using the full token set. Across CALVIN, LIBERO, and real-world manipulation, ThinkProprio matches or improves over strong baselines while reducing end-to-end inference latency over 50%.
[88] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO | cs.AI | cs.CV | cs.LGPDF
Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye
TL;DR: DreamDojo是一个从大规模人类视频中学习的通用机器人世界模型,它利用44k小时的第一人称视角视频数据,通过引入连续潜在动作作为统一代理动作来解决动作标签稀缺的问题,并在小规模目标机器人数据上进行微调后,展示了强大的物理理解和精确动作控制能力。
Details
Motivation: 为了解决通用智能体开发中模拟多样化环境动作结果的挑战,尤其是在灵巧机器人任务中面临的数据覆盖有限和动作标签稀缺的问题。
Result: 在多个具有挑战性的分布外(OOD)基准测试中进行了系统评估,验证了该方法在模拟开放世界、接触丰富任务方面的重要性,并通过蒸馏流程将模型加速到10.81 FPS的实时速度,进一步提升了上下文一致性。
Insight: 创新点包括使用大规模无标签人类视频数据预训练世界模型,引入连续潜在动作作为统一代理动作以增强交互知识迁移,以及设计蒸馏流程实现实时推理和提升一致性,为通用机器人世界模型的发展铺平了道路。
Abstract: Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.