Table of Contents
- cs.CL [Total: 28]
- cs.CV [Total: 63]
- cs.RO [Total: 3]
- cs.SD [Total: 1]
- eess.SP [Total: 1]
- cs.LG [Total: 6]
- cs.IR [Total: 1]
- eess.IV [Total: 1]
- cs.AI [Total: 2]
cs.CL [Back]
[1] PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents cs.CL | cs.AIPDF
Jahidul Islam, Md Ataullha, Saiful Azad
TL;DR: 本文针对孟加拉语到Python代码生成任务,提出了一个名为BanglaCodeAct的基于多智能体提示和迭代自校正的框架。该框架利用开源多语言大模型,通过思想-代码-观察循环动态生成、测试和优化代码,在mHumanEval数据集上评估了多个小参数开源模型,其中Qwen3-8B结合BanglaCodeAct取得了最佳性能。
Details
Motivation: 解决大语言模型在英语提示下代码生成能力突出,但在低资源语言(如孟加拉语)上进展不足的问题,旨在提升孟加拉语到Python代码的生成可靠性。
Result: 在mHumanEval数据集的孟加拉语NL2Code任务上,Qwen3-8B模型结合BanglaCodeAct框架在开发集上达到94.0%的pass@1准确率,在盲测集上达到71.6%的pass@1准确率,为孟加拉语到Python翻译设立了新的基准。
Insight: 创新点在于提出了一个不依赖任务特定微调的、基于多智能体提示和迭代自校正的框架(BanglaCodeAct),通过思想-代码-观察循环实现动态代码生成与修正,展示了基于智能体的推理在低资源语言代码生成中的潜力。
Abstract: LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. We address Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches relying on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought-Code-Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0% on the development set and 71.6% on the blind test set. These results establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages. Experimental scripts are publicly available at github.com/jahidulzaid/PyBanglaCodeActAgent.
[2] Emergent World Beliefs: Exploring Transformers in Stochastic Games cs.CLPDF
Adam Kamel, Tanish Rastogi, Michael Ma, Kailash Ranganathan, Kevin Zhu
TL;DR: 本文研究了基于Transformer的大型语言模型在随机游戏中的涌现世界信念,特别关注不完全信息领域,以德州扑克作为典型的部分可观测马尔可夫决策过程。通过预训练GPT风格模型于扑克手牌历史数据,并探测其内部激活,发现模型能学习确定性和随机性特征,如手牌等级和胜率,且这些表示与理论信念状态相关。
Details
Motivation: 扩展LLMs在完美信息游戏中涌现世界模型的研究至不完全信息领域,探索模型在随机环境(如扑克)中如何学习内部表示,以理解其推理能力。
Result: 模型在扑克手牌历史数据上预训练后,通过非线性探针解码内部激活,显示其学习了手牌等级和胜率等特征,这些表示与理论信念状态相关,表明LLMs能自主表示德州扑克的随机环境。
Insight: 创新点在于将LLMs的世界模型研究扩展到POMDPs,揭示了模型在不完全信息游戏中能无监督学习随机环境表示,这为理解LLMs的推理机制和应用于复杂决策任务提供了新视角。
Abstract: Transformer-based large language models (LLMs) have demonstrated strong reasoning abilities across diverse fields, from solving programming challenges to competing in strategy-intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT-style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure, such as hand ranks, and stochastic features, such as equity, without explicit instruction. Furthermore, by using primarily nonlinear probes, we demonstrated that these representations are decodeable and correlate with theoretical belief states, suggesting that LLMs are learning their own representation of the stochastic environment of Texas Hold’em Poker.
[3] When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection cs.CL | cs.AIPDF
Anwar Alajmi, Gabriele Pergola
TL;DR: 本文提出了一种用于在线性别歧视内容检测的两阶段框架,旨在解决数据中存在的类别不平衡、标注噪声和概念模糊性等挑战。该框架结合了针对性的训练策略(如类别平衡的焦点损失、类别感知批处理和后验阈值校准)来处理稀缺和嘈杂的数据,并在推理阶段引入动态路由机制:高置信度样本直接分类,而低置信度样本则被路由到一个新颖的‘协作专家判断’模块,该模块通过模拟多个专家角色并进行推理整合来处理模糊或边界案例。
Details
Motivation: 在线性别歧视内容日益呈现微妙且依赖上下文的形式,传统检测方法难以应对。其解释涉及语言、心理、法律和文化等多个重叠维度,导致数据标注存在噪声和矛盾,加之标签稀缺和类别不平衡,使得微调模型决策边界不稳定,容易忽略更微妙、代表性不足的伤害形式。因此,需要一种设计来同时解决数据与模型预测中的代表性不足、噪声和概念模糊性问题。
Result: 该方法在多个基准测试中取得了最先进(SOTA)的结果。具体而言,在EXIST 2025 Task 1.1上F1分数提升了2.72%,在EDOS Task A和Task B上分别提升了4.48%和1.30%。
Insight: 论文的核心创新点在于将针对不平衡和噪声数据的鲁棒训练策略,与一个基于推理的、选择性推理的推理阶段相结合。特别是提出的‘协作专家判断’模块,通过模拟多角色专家辩论和法官模型整合推理来处理不确定性案例,为处理分类任务中的模糊性和边界情况提供了一种新颖的、可解释的决策机制。这种‘不确定时深思’的动态路由范式,可能泛化到其他需要处理复杂、多维度社会计算任务中。
Abstract: Sexist content online increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals, even in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. Together, these limitations point to the need for a design that explicitly addresses the combined effects of (i) underrepresentation, (ii) noise, and (iii) conceptual ambiguity in both data and model predictions. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to adapt supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. Our training setup applies class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. At inference time, a dynamic routing mechanism classifies high-confidence cases directly and escalates uncertain instances to a novel \textit{Collaborative Expert Judgment} (CEJ) module, which prompts multiple personas and consolidates their reasoning through a judge model. Our approach achieves state-of-the-art results across several benchmarks, with a +2.72% improvement in F1 on the EXIST 2025 Task 1.1, and a gains of +4.48% and +1.30% on the EDOS Tasks A and B, respectively.
[4] Break Out the Silverware – Semantic Understanding of Stored Household Items cs.CL | cs.AI | cs.CV | cs.ROPDF
Michaela Levi-Richter, Reuth Mirsky, Oren Glickman
TL;DR: 本文提出了一个名为‘存储家居物品挑战’的基准任务,用于评估服务机器人在家庭场景中推断日常物品(如盘子)最可能存储位置(如抽屉、橱柜)的认知能力。为此,作者构建了两个数据集,并引入了一个名为NOAM的混合智能体模型,该模型结合了结构化场景理解和大型语言模型推理,以解决这一挑战。
Details
Motivation: 解决家庭服务机器人面临的核心挑战:如何根据常识推理,在物品不可见的情况下(如存放在抽屉或橱柜中),推断出日常物品最可能的存储位置,以执行如‘拿个盘子’这样的简单指令。
Result: 在提出的基准任务上,NOAM模型显著优于随机选择、视觉语言流程、领先的多模态模型等基线方法,其预测准确率接近人类水平。
Insight: 创新点在于将视觉输入转化为自然语言描述(空间上下文和可见容器),然后利用大型语言模型(如GPT-4)进行常识推理,以预测隐藏的存储位置。这种结合结构化场景理解与LLM推理的混合智能体架构,为在家庭环境中部署具备认知能力的机器人系统提供了有效范例。
Abstract: ``Bring me a plate.’’ For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots’ cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants’ kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.
[5] Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning cs.CL | cs.AIPDF
Tiancheng Su, Meicong Zhang, Guoxiu He
TL;DR: 本文提出了一种名为熵感知推测解码(EASD)的训练免费增强方法,用于改进大语言模型(LLM)的推理加速。该方法在标准推测解码(SD)的基础上,引入了一个基于动态熵的惩罚机制,通过量化模型不确定性来拒绝低置信度的候选标记,防止错误传播,从而可能超越目标LLM的固有性能。
Details
Motivation: 标准推测解码(SD)通过小草案模型生成候选标记来加速LLM推理,但草案模型与目标模型的过度对齐限制了其性能上限,无法超越目标LLM本身。
Result: 在多个推理基准测试上的实验表明,EASD始终优于现有的SD方法,并且在大多数情况下超越了目标LLM本身的性能,同时其效率与SD相当。
Insight: 创新点在于引入了基于动态熵的惩罚机制,利用采样分布的熵来量化模型不确定性,并在两个模型都表现出高熵且其top-N预测有显著重叠时拒绝标记,这允许草案模型进行验证,从而可能突破目标模型的性能限制,是一种无需训练的性能增强策略。
Abstract: Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model’s inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.
[6] Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs cs.CL | cs.CE | cs.LGPDF
Yukun Zhang, Stefan Elbl Droguett, Samyak Jain
TL;DR: 本研究针对金融数值推理问答任务中因缺乏领域知识导致的错误,提出了一种多检索器的检索增强生成系统,结合外部领域知识和内部问题上下文,利用最新大语言模型处理金融数值问题。通过实验发现,使用SecBERT编码器的领域特定训练显著提升了模型性能,最佳提示式LLM生成器实现了SOTA性能,但仍低于人类专家水平。
Details
Motivation: 解决金融数值推理问答任务中因缺乏金融领域知识和复杂多步数值推理导致的错误,提升模型在金融QA任务中的表现。
Result: 最佳神经符号模型超越了FinQA论文的顶级基线模型;最佳提示式LLM生成器在基准测试中实现了SOTA性能,显著提升超过7%,但仍低于人类专家水平。
Insight: 采用多检索器RAG系统整合外部领域知识和内部上下文;领域特定训练(如SecBERT编码器)对性能提升有重要贡献;大模型中外部事实的收益通常超过幻觉损失,而小模型则需权衡幻觉损失与外部知识收益;最新LLM在少样本学习下的数值推理能力得到增强。
Abstract: This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper’s top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (>7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.
[7] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics cs.CL | cs.CYPDF
Conrad Borchers, Manit Patel, Seiyon M. Lee, Anthony F. Botelho
TL;DR: 本文提出了一种分析优先的框架,用于开放回答分析,通过分离内容信号和评分者倾向,使评分判断变得可见和可审计。该方法利用教师历史作为动态先验,结合句子嵌入的文本表示,并通过中心化和残差化处理来减少提示和教师的混杂影响。结果表明,结合教师先验和内容嵌入的模型表现最佳(AUC0.815),而仅基于内容的模型表现较弱(AUC0.626)。
Details
Motivation: 解决开放回答自动评分中内容信号与教师评分倾向混淆的问题,旨在使评分过程更加透明和可审计,以支持教学反思和研究。
Result: 在ASSISTments数学回答数据集上,结合教师先验和内容嵌入的模型达到AUC0.815,而仅内容模型为AUC0.626,表明教师先验对评分预测有重要影响。
Insight: 创新点在于将教师历史建模为动态先验,并采用中心化和残差化技术来分离内容与评分者效应,从而提升嵌入表示的信息量,使学习分析能够更清晰地反映学生推理和学习的证据。
Abstract: Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade. We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics. Using de-identified ASSISTments mathematics responses, we model teacher histories as dynamic priors and derive text representations from sentence embeddings, incorporating centering and residualization to mitigate prompt and teacher confounds. Temporally-validated linear models quantify the contributions of each signal, and a projection surfaces model disagreements for qualitative inspection. Results show that teacher priors heavily influence grade predictions; the strongest results arise when priors are combined with content embeddings (AUC0.815), while content-only models remain above chance but substantially weaker (AUC0.626). Adjusting for rater effects sharpens the residual content representation, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding as opposed to surface-level differences in how students respond. The contribution presents a practical pipeline that transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align (or conflict) with evidence of student reasoning and learning.
[8] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling cs.CL | cs.AI | cs.LGPDF
Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou
TL;DR: 本文提出了一种基于超图的记忆机制HGMem,用于增强多步检索增强生成(RAG)系统在长上下文复杂关系建模中的表现。该方法将记忆从静态存储扩展为动态的超图结构,以捕捉事实间的高阶关联,从而提升全局理解和推理能力。
Details
Motivation: 现有RAG系统中的工作记忆模块主要作为被动存储,积累孤立事实,忽略了原始事实间关键的高阶相关性,导致在扩展上下文中出现碎片化推理和全局感知能力弱的问题。
Result: 在多个为全局感知设计的挑战性数据集上的广泛实验表明,HGMem方法持续改进了多步RAG性能,并在多样任务上显著优于强基线系统。
Insight: 核心创新点在于将记忆概念从简单存储扩展为动态、富有表现力的超图结构,其超边对应不同的记忆单元,能够渐进地形成记忆内部的高阶交互,从而构建一个整合的、情境化的知识结构,为后续步骤的深度推理提供强有力的命题支持。
Abstract: Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.
[9] Efficient Context Scaling with LongCat ZigZag Attention cs.CL | cs.AIPDF
Chen Zhang, Yang Bai, Jiahuan Li, Anchun Gui, Keheng Wang
TL;DR: 本文提出了一种名为LongCat ZigZag Attention (LoZA)的稀疏注意力机制,旨在以有限的计算预算将现有的全注意力模型转换为稀疏版本。在长上下文场景中,LoZA能在预填充密集型和解码密集型任务上实现显著加速。通过在中段训练中将LoZA应用于LongCat-Flash模型,作者推出了LongCat-Flash-Exp作为长上下文基础模型,能够高效处理长达100万个令牌的序列,从而支持高效的长程推理和长视野智能体能力。
Details
Motivation: 解决在长上下文场景下,全注意力模型计算成本过高的问题,旨在设计一种高效的稀疏注意力方案,以有限的预算扩展模型上下文处理能力。
Result: 通过将LoZA应用于LongCat-Flash进行中段训练,得到的LongCat-Flash-Exp模型能够高效处理长达100万个令牌的上下文,在预填充密集型(如检索增强生成)和解码密集型(如工具集成推理)任务上均实现了显著加速。
Insight: 创新点在于LoZA稀疏注意力方案,它提供了一种通用的方法,能够将现有全注意力模型高效地稀疏化,从而以较低计算成本支持超长上下文处理,这为构建高效的长上下文基础模型提供了新思路。
Abstract: We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.
[10] CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards cs.CLPDF
Zhiming Lin, Kai Zhao, Sophie Zhang, Peilai Yu, Canran Xiao
TL;DR: CEC-Zero是一个零监督的强化学习框架,用于中文拼写纠错(CSC)。它通过从干净文本合成错误输入、利用语义相似性和候选一致性计算集群共识奖励,并使用PPO优化策略,使大语言模型能够自我纠正错误。该方法在9个基准测试中超越了有监督基线和强LLM微调方法,且具有无偏奖励和收敛的理论保证。
Details
Motivation: 解决大规模中文拼写纠错中现有LLM和有监督方法对新型错误缺乏鲁棒性、依赖昂贵标注数据的问题。
Result: 在9个基准测试上,F1分数比有监督基线高出10-13点,比强LLM微调方法高出5-8点,达到了新的SOTA水平。
Insight: 创新点在于提出了一种完全无标签的强化学习范式,通过自生成奖励(基于语义相似性和候选一致性)实现LLM的自我纠错,为鲁棒、可扩展的CSC提供了新思路,并具有理论保证。
Abstract: Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10–13 F$_1$ points and strong LLM fine-tunes by 5–8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.
[11] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process cs.CL | cs.AI | cs.LGPDF
Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang
TL;DR: 本文提出了一种名为RISE的无监督框架,通过稀疏自编码器(SAE)在LLM激活空间中自动发现编码不同推理行为的‘推理向量’,从而揭示并可控地干预推理过程,包括识别已知行为(如反思、回溯)和发现新行为(如置信度相关向量)。
Details
Motivation: 现有方法依赖人工定义的概念(如过度思考、反思)在词级别上以监督方式分析LLM推理过程,这难以捕捉全部潜在的推理行为,且许多行为难以在词元空间中定义。
Result: 通过将思维链轨迹分割为句子级‘步骤’并在步骤级激活上训练SAE,发现了可解释行为(如反思、回溯)对应的解耦特征,可视化与聚类分析显示这些行为在解码器列空间中占据可分离区域;对SAE衍生向量的针对性干预可以可控地放大或抑制特定推理行为,改变推理轨迹而无需重新训练。
Insight: 创新点在于提出无监督框架RISE,利用SAE自动发现LLM激活空间中编码推理行为的潜在方向,不仅解耦了已知行为,还能发现超越人类监督的新行为(如与置信度相关的向量),为解释和可控引导LLM推理提供了新途径。
Abstract: Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level ‘steps’ and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.
[12] iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning cs.CL | cs.AIPDF
Sijia Chen, Di Niu
TL;DR: 本文提出iCLP框架,通过模仿人类内隐认知过程,使大语言模型能够自适应地生成紧凑的潜在规划,从而在推理时进行隐式规划,显著提升数学推理和代码生成任务的准确性和效率,并展现出强大的跨领域泛化能力。
Details
Motivation: 针对大语言模型在生成显式文本规划时因幻觉和任务问题多样性导致的准确性和有效性挑战,受人类内隐认知启发,旨在开发一种能生成紧凑、泛化性强的隐式规划方法。
Result: 在数学推理和代码生成任务上的实验表明,iCLP框架能显著提高准确性和效率,并展现出强大的跨领域泛化能力,同时保持了思维链推理的可解释性。
Insight: 创新点在于将人类内隐认知机制引入LLM规划过程,通过向量量化自编码器和代码本学习离散的潜在规划表示,使模型能在潜在空间进行规划而在语言空间推理,实现了隐式规划与显式推理的结合。
Abstract: Large language models (LLMs), when guided by explicit textual plans, can perform reliable step-by-step reasoning during problem-solving. However, generating accurate and effective textual plans remains challenging due to LLM hallucinations and the high diversity of task-specific questions. To address this, we draw inspiration from human Implicit Cognition (IC), the subconscious process by which decisions are guided by compact, generalized patterns learned from past experiences without requiring explicit verbalization. We propose iCLP, a novel framework that enables LLMs to adaptively generate latent plans (LPs), which are compact encodings of effective reasoning instructions. iCLP first distills explicit plans from existing step-by-step reasoning trajectories. It then learns discrete representations of these plans via a vector-quantized autoencoder coupled with a codebook. Finally, by fine-tuning LLMs on paired latent plans and corresponding reasoning steps, the models learn to perform implicit planning during reasoning. Experimental results on mathematical reasoning and code generation tasks demonstrate that, with iCLP, LLMs can plan in latent space while reasoning in language space. This approach yields significant improvements in both accuracy and efficiency and, crucially, demonstrates strong cross-domain generalization while preserving the interpretability of chain-of-thought reasoning.
[13] HY-MT1.5 Technical Report cs.CLPDF
Mao Zheng, Zheng Li, Tao Chen, Mingyang Song, Di Wang
TL;DR: 本文介绍了HY-MT1.5系列机器翻译模型,包括1.8B和7B参数版本。它们通过一个集成了通用与翻译导向预训练、监督微调、策略蒸馏和强化学习的多阶段训练框架开发而成。该系列模型在标准中-外、英-外翻译任务中超越了众多大型开源模型和主流商业API,并在特定规模上达到或接近顶级专有模型的性能,同时支持术语干预、上下文感知翻译等高级约束功能。
Details
Motivation: 开发一个高性能、参数高效的机器翻译模型系列,通过定制化的整体训练框架,在保持较小模型规模的同时,达到或超越更大规模模型及商业翻译服务的翻译质量。
Result: HY-MT1.5-1.8B在标准中-外、英-外任务上全面超越了Tower-Plus-72B、Qwen3-32B等大型开源模型以及Microsoft Translator等商业API,性能达到Gemini-3.0-Pro的约90%。HY-MT1.5-7B在其规模级别上达到了新的SOTA,在Flores-200上达到Gemini-3.0-Pro性能的95%,并在WMT25和普通话-少数民族语言测试集上超越了Gemini-3.0-Pro。
Insight: 创新点在于一个为高性能翻译量身定制的、整合了多阶段训练(预训练、微调、蒸馏、强化学习)的整体框架。该框架有效提升了模型的参数效率,使得较小规模的模型能够匹敌或超越远大于自身的模型。模型还集成了术语干预等高级约束功能,增强了实用性。
Abstract: In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro’s performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.
[14] Large Emotional World Model cs.CLPDF
Changhao Song, Yazhou Zhang, Hui Gao, Chang Yang, Peng Zhang
TL;DR: 本文提出了一种大型情感世界模型(LEWM),该模型将情感状态作为世界知识的关键组成部分进行显式建模,以增强对情感驱动社会行为的预测能力。通过构建Emotion-Why-How(EWH)数据集,将情感融入因果关系推理中,使模型能够预测未来状态和情感转变。实验表明,LEWM在保持基础任务性能的同时,能更准确地预测情感驱动的社会行为。
Details
Motivation: 现有的大型语言模型(LLMs)主要关注物理世界规律的建模,缺乏对情感因素的系统性探索。而情感作为世界知识的关键部分,显著影响人类决策。本文旨在通过证明移除情感信息会降低推理性能,来强调情感在理解世界中的重要性,并构建一个能显式建模情感的世界模型。
Result: 实验结果表明,LEWM在预测情感驱动的社会行为方面更准确,同时在基础任务上保持了与通用世界模型相当的性能。
Insight: 论文的创新点在于首次将情感状态系统地整合到世界模型的因果推理框架中,构建了专门的EWH数据集来连接情感、原因和行动。从客观角度看,这为构建更符合人类认知、能理解社会交互的AI系统提供了新思路,特别是在情感计算与社会智能的交叉领域具有借鉴价值。
Abstract: World Models serve as tools for understanding the current state of the world and predicting its future dynamics, with broad application potential across numerous fields. As a key component of world knowledge, emotion significantly influences human decision-making. While existing Large Language Models (LLMs) have shown preliminary capability in capturing world knowledge, they primarily focus on modeling physical-world regularities and lack systematic exploration of emotional factors. In this paper, we first demonstrate the importance of emotion in understanding the world by showing that removing emotionally relevant information degrades reasoning performance. Inspired by theory of mind, we further propose a Large Emotional World Model (LEWM). Specifically, we construct the Emotion-Why-How (EWH) dataset, which integrates emotion into causal relationships and enables reasoning about why actions occur and how emotions drive future world states. Based on this dataset, LEWM explicitly models emotional states alongside visual observations and actions, allowing the world model to predict both future states and emotional transitions. Experimental results show that LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks.
[15] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring cs.CLPDF
Qipeng Wang, Rui Sheng, Yafei Li, Huamin Qu, Yushi Sun
TL;DR: 本文提出了MedKGI框架,通过整合医学知识图谱和信息增益引导的询问策略,模拟临床诊断中的迭代推理过程,以解决大语言模型在医疗诊断中存在的幻觉、低效提问和对话不一致问题。
Details
Motivation: 当前大语言模型在临床诊断中难以模拟真实的迭代假设驱动推理,存在生成幻觉医学内容、提问冗余低效以及多轮对话中失去连贯性三大关键局限。
Result: 在临床基准测试中,MedKGI在诊断准确性和询问效率上均优于强LLM基线,平均提升对话效率30%,同时保持了最先进的准确率。
Insight: 创新点在于将医学知识图谱用于约束推理至已验证的医学本体,基于信息增益选择问题以最大化诊断效率,并采用OSCE格式的结构化状态来维持跨轮次的一致证据追踪,从而增强诊断的可靠性和效率。
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant promise in clinical diagnosis. However, current models struggle to emulate the iterative, diagnostic hypothesis-driven reasoning of real clinical scenarios. Specifically, current LLMs suffer from three critical limitations: (1) generating hallucinated medical content due to weak grounding in verified knowledge, (2) asking redundant or inefficient questions rather than discriminative ones that hinder diagnostic progress, and (3) losing coherence over multi-turn dialogues, leading to contradictory or inconsistent conclusions. To address these challenges, we propose MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns. Experiments on clinical benchmarks show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.
[16] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking cs.CLPDF
Meiqi Chen, Fandong Meng, Jie Zhou
TL;DR: 本文提出了FIGR模型,通过强化学习将主动视觉思维整合到多轮推理中,以解决复杂推理问题中隐含的空间、几何和结构关系难以用纯文本表示的问题。
Details
Motivation: 纯文本推理难以捕捉复杂场景中的全局结构约束,因此需要引入视觉表示来外部化中间结构假设,以提升推理的稳定性和连贯性。
Result: 在AIME 2025和BeyondAIME等数学推理基准测试中,FIGR相比纯文本思维链基线模型分别提升了13.12%和11.00%,达到了SOTA水平。
Insight: 创新点在于通过端到端强化学习自适应地调控视觉推理的调用时机和方式,实现了基于图形的多模态推理,从而增强了复杂推理的稳定性和可靠性。
Abstract: Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.
[17] QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs cs.CLPDF
Shupeng Li, Weipeng Lu, Linyun Liu, Chen Lin, Shaofei Li
TL;DR: 本文介绍了QianfanHuijin金融大语言模型,并提出了一种通用的多阶段训练范式,旨在通过持续预训练和逐步细化的后训练(包括金融SFT、金融推理RL、金融智能体RL和通用RL)来增强模型在金融领域的知识、推理和智能体能力。
Details
Motivation: 解决现有金融大模型(如BloombergGPT和Baichuan-Finance)主要关注知识增强,而金融服务日益复杂化对模型的金融推理和智能体能力提出更高需求的问题。
Result: 在多个权威金融基准测试中取得了优越性能,消融研究证实针对性的推理RL和智能体RL阶段显著提升了相应能力。
Insight: 创新点在于提出了一种细粒度、渐进式的后训练方法论,将金融能力增强分解为知识、推理、智能体及业务对齐的连续阶段,为工业级领域大模型训练提供了可推广的范式。
Abstract: Domain-specific enhancement of Large Language Models (LLMs) within the financial context has long been a focal point of industrial application. While previous models such as BloombergGPT and Baichuan-Finance primarily focused on knowledge enhancement, the deepening complexity of financial services has driven a growing demand for models that possess not only domain knowledge but also robust financial reasoning and agentic capabilities. In this paper, we present QianfanHuijin, a financial domain LLM, and propose a generalizable multi-stage training paradigm for industrial model enhancement. Our approach begins with Continual Pre-training (CPT) on financial corpora to consolidate the knowledge base. This is followed by a fine-grained Post-training pipeline designed with increasing specificity: starting with Financial SFT, progressing to Finance Reasoning RL and Finance Agentic RL, and culminating in General RL aligned with real-world business scenarios. Empirical results demonstrate that QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Furthermore, ablation studies confirm that the targeted Reasoning RL and Agentic RL stages yield significant gains in their respective capabilities. These findings validate our motivation and suggest that this fine-grained, progressive post-training methodology is poised to become a mainstream paradigm for various industrial-enhanced LLMs.
[18] World model inspired sarcasm reasoning with large language model agents cs.CLPDF
Keito Inoshita, Shinnosuke Mizuno
TL;DR: 该论文提出了一种受世界模型启发的讽刺理解方法WM-SAR,通过将讽刺理解分解为字面意义、语境、规范期望和意图等组件,并利用基于大语言模型的专门代理进行建模,最后通过逻辑回归整合不一致性得分和意图得分来推断讽刺概率,从而在保持可解释性的同时提升性能。
Details
Motivation: 解决讽刺理解中难以结构化解释认知因素的问题,以及现有方法多为黑盒预测、缺乏对语义评估与规范期望或意图之间不匹配的显式建模框架的局限性。
Result: 在代表性的讽刺检测基准测试中,WM-SAR持续优于现有的深度学习和基于大语言模型的方法,并通过消融研究和案例分析验证了其有效性和高可解释性。
Insight: 创新点在于将讽刺理解重新表述为受世界模型启发的推理过程,通过分解组件并利用LLM代理进行专门建模,显式量化不一致性得分,并结合意图得分,实现了性能与可解释性的平衡;从客观角度看,该方法提供了一种结构化的、可解释的讽刺推理框架,可能适用于其他需要复杂上下文推理的NLP任务。
Abstract: Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker’s intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.
[19] IELTS Writing Revision Platform with Automated Essay Scoring and Adaptive Feedback cs.CL | cs.HCPDF
Titas Ramancauskas, Kotryna Ramancauske
TL;DR: 本文介绍了一个为雅思写作考试设计的智能修订平台,该平台集成了自动作文评分系统和个性化反馈功能。研究采用基于设计的研究方法,从基于规则的评分系统迭代到基于DistilBERT的回归模型,并评估了自适应反馈对考生分数提升的效果。
Details
Motivation: 传统的雅思写作备考方法缺乏针对雅思评分标准的个性化反馈,该平台旨在解决这一问题,通过自动化评分和定制化反馈来辅助考生备考。
Result: 基于DistilBERT的回归模型在评分上取得显著改进,平均绝对误差为0.66,R²为正。自适应反馈使考生分数平均提升0.060分(p=0.011),效果具有统计显著性,但不同修订策略效果存在差异。
Insight: 创新点在于将对话式引导与专用写作界面分离以降低认知负荷,并采用基于Transformer的AES系统实现个性化反馈。研究发现,自动化反馈更适合作为人工指导的补充,且保守的表面修正比激进的结构干预在雅思备考中更可靠;研究证实自动化反馈更适合作为人工指导的补充,尤其在雅思备考中保守的表面修正比激进的结构干预更有效。
Abstract: This paper presents the design, development, and evaluation of a proposed revision platform assisting candidates for the International English Language Testing System (IELTS) writing exam. Traditional IELTS preparation methods lack personalised feedback, catered to the IELTS writing rubric. To address these shortcomings, the platform features an attractive user interface (UI), an Automated Essay Scoring system (AES), and targeted feedback tailored to candidates and the IELTS writing rubric. The platform architecture separates conversational guidance from a dedicated writing interface to reduce cognitive load and simulate exam conditions. Through iterative, Design-Based Research (DBR) cycles, the study progressed from rule-based to transformer-based with a regression head scoring, mounted with adaptive feedback. Early cycles (2-3) revealed fundamental limitations of rule-based approaches: mid-band compression, low accuracy, and negative $R^2$ values. DBR Cycle 4 implemented a DistilBERT transformer model with a regression head, yielding substantial improvements with MAE of 0.66 and positive $R^2$. This enabled Cycle 5’s adaptive feedback implementation, which demonstrated statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen’s d = 0.504), though effectiveness varied by revision strategy. Findings suggest automated feedback functions are most suited as a supplement to human instruction, with conservative surface-level corrections proving more reliable than aggressive structural interventions for IELTS preparation contexts. Challenges remain in assessing higher-band essays, and future work should incorporate longitudinal studies with real IELTS candidates and validation from official examiners.
[20] Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech cs.CLPDF
Fabian Retkowski, Alexander Waibel
TL;DR: 该论文将段落分割重新定义为语音处理中的结构化步骤,提出了首个针对语音领域的段落分割基准数据集TEDPara和YTSegPara,并开发了一种基于约束解码的方法,使大语言模型能在保持原始转录文本的同时插入段落分隔符,同时提出一个紧凑模型MiniSeg,在实现SOTA准确率的同时能以分层方式联合预测章节和段落。
Details
Motivation: 解决自动语音转录文本因缺乏段落结构而导致的阅读困难和再利用不便的问题,填补语音处理与文本分割交叉领域的研究空白。
Result: 在提出的TEDPara和YTSegPara基准测试中,MiniSeg模型达到了最先进的准确率,且其分层扩展版本能以极低计算成本联合预测章节和段落。
Insight: 创新点包括建立首个语音领域段落分割基准、提出保持转录文本原样的约束解码方法,以及开发高效紧凑的分层分割模型,为语音处理中的结构化任务提供了标准化方案。
Abstract: Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.
[21] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs cs.CLPDF
Muhammad Abdullahi Said, Muhammad Sammani Sani
TL;DR: 本研究系统审计了GPT-5.1、Gemini 3 Pro和Claude 4.5 Opus三种先进大语言模型在豪萨语(一种西非语言)安全对齐上的表现,揭示了安全性能并非简单随语言资源匮乏而线性下降,而是由语言和时态框架等变量复杂交互决定的。研究发现模型存在显著的时态不对称性:过去时态场景容易绕过安全防御,而未来时态则触发过度保守的拒绝,同时挑战了多语言安全性能必然更差的普遍假设。
Details
Motivation: 动机在于检验大语言模型安全对齐能否零样本从英语迁移到其他语言这一危险假设,并针对西非本土威胁场景(如雅虎-雅虎诈骗、土制猎枪制造)评估模型漏洞,以揭示当前安全机制的盲点。
Result: 在基于HausaSafety数据集的1440次评估中,发现了复杂的干扰机制:Claude 4.5 Opus在豪萨语中(45.0%安全)反而比英语(36.7%)更安全,但模型在时态推理上出现灾难性失败,过去时态框架下仅15.6%安全,未来时态下则达57.2%安全,最安全与最脆弱配置间存在9.2倍的差异。
Insight: 创新点在于提出了‘复杂干扰’机制和‘时态不对称性’概念,证明模型安全是上下文依赖的状态而非固定属性,其依赖表面启发式而非稳健语义理解,形成了使全球南方用户暴露于本土化风险的‘安全口袋’;提出了‘不变对齐’作为确保跨语言和时态变化下安全稳定的必要范式转变。
Abstract: As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the prevailing multilingual safety gap narrative. Instead of a simple degradation in low-resource settings, we identified a mechanism of Complex Interference where safety is determined by the intersection of variables. While models exhibited a Reverse Linguistic with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.
[22] Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs’ Legal Reasoning Capabilities cs.CLPDF
Hongseok Oh, Wonseok Hwang, Kyoung-Woon On
TL;DR: 本文提出了韩国规范法律基准(KCL),这是一个旨在独立于领域知识评估语言模型法律推理能力的基准。KCL包含多项选择题(KCL-MCQA)和开放式生成题(KCL-Essay)两个部分,并为问题提供支持性判例以实现推理能力与参数化知识的分离。对30多个模型的系统评估显示,特别是在KCL-Essay上仍存在较大差距,且专门用于推理的模型普遍优于通用模型。
Details
Motivation: 动机是创建一个能够独立评估语言模型法律推理能力的基准,以解决现有评估方法难以将推理能力与领域特定知识分离的问题。
Result: 在KCL基准上对30多个模型进行了评估,结果显示模型性能存在较大差距,尤其是在开放式生成任务(KCL-Essay)上;专门用于推理的模型表现始终优于通用模型。
Insight: 创新点在于通过提供问题级别的支持性判例(precedents),实现了对模型法律推理能力更纯粹的评估,有效分离了推理能力与参数化知识。这为评估大语言模型在专业领域的核心能力提供了新思路和工具。
Abstract: We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models’ legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at https://github.com/lbox-kr/kcl.
[23] Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time cs.CL | cs.AI | cs.LGPDF
Zhenyu Zhang, Xiaoxia Wu, Zhongzhu Zhou, Qingyang Wu, Yineng Zhang
TL;DR: 本文提出了一种名为CREST的训练时无需调整的方法,用于在推理阶段引导大型语言模型的认知推理行为。该方法通过离线校准识别与特定认知行为相关的注意力头,并在推理时对隐藏表示进行旋转以抑制低效推理模式,从而在提高准确率的同时减少计算开销。
Details
Motivation: 大型语言模型在解决复杂任务时依赖长链思维推理,但现有方法常导致推理效率低下,表现为高延迟或不稳定的推理行为,如思维不足或思维过度。
Result: 在多种推理基准测试和模型上,CREST将准确率最高提升17.5%,同时减少37.6%的令牌使用量,实现了更快、更可靠的推理性能。
Insight: 创新点在于揭示了推理轨迹中与认知行为相关的专用注意力头,并通过轻量级干预在推理时引导模型远离低效模式;客观来看,该方法提供了一种无需训练即可优化推理效率和稳定性的实用途径。
Abstract: Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.
[24] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models cs.CLPDF
Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai
TL;DR: 本文介绍了Youtu-LLM,一个轻量级但功能强大的语言模型,它通过从零开始的预训练,系统性地培养了推理和规划能力。模型采用紧凑的MLA架构和STEM导向的词表,支持128k上下文窗口,并通过精心设计的’常识-STEM-智能体’课程和多阶段训练策略,在约11T令牌的语料上进行训练。评估表明,它在通用基准测试中与更大模型竞争,在智能体特定任务上显著超越现有SOTA,为低于20亿参数的模型设立了新标杆。
Details
Motivation: 解决典型小型模型依赖知识蒸馏、缺乏原生智能体能力的问题,旨在开发一个计算高效且具备内在推理与规划能力的轻量级大语言模型。
Result: 在通用基准测试中取得与更大模型竞争的性能,在智能体特定任务上显著超越现有SOTA基线,为低于20亿参数的模型设立了新的最先进水平。
Insight: 创新点包括:1) 从零开始预训练以系统培养能力,而非依赖蒸馏;2) 紧凑的MLA架构与STEM导向词表结合,实现长上下文支持与低内存占用;3) 设计’常识-STEM-智能体’的课程学习策略,确保深度认知能力;4) 在智能体中期训练中采用多样化的数据构建方案,有效内化规划与反思行为。从客观角度看,其将架构设计、训练课程与数据合成系统性结合,为轻量级模型赋予强大智能体潜力提供了可借鉴的路径。
Abstract: We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled “Commonsense-STEM-Agent” Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.
[25] Do Large Language Models Know What They Are Capable Of? cs.CL | cs.AIPDF
Casey O. Barkan, Sid Black, Oliver Sourbut
TL;DR: 该论文研究了大型语言模型(LLMs)是否能预测自己在给定任务上的成功概率,以及其预测能力在多步任务中的变化。研究发现,所有测试的LLMs都表现出过度自信,但大多数模型在预测成功与否时具有优于随机的判别能力。较新、较大的模型通常并未展现出更强的判别能力,而Claude模型则显示出这种趋势。在多步任务中,一些前沿LLMs的过度自信会随着任务进展而加剧,推理模型与非推理模型表现相当或更差。通过上下文中的失败经验,部分LLMs能减少过度自信并改善决策,但并非所有模型都如此。有趣的是,所有LLMs的决策在给定其估计的成功概率下近似理性,但过度乐观的估计导致了糟糕的决策。这些结果表明,当前LLM代理因缺乏对自身能力的认知而受限。
Details
Motivation: 研究动机是探究LLMs是否能预测自身在任务中的成功概率,以及这种预测能力如何影响其在多步任务和成本高昂失败场景中的决策,旨在评估LLMs的自我认知能力及其对AI误用和错位风险的影响。
Result: 实验结果表明,所有测试的LLMs都过度自信,但大多数在预测成功时具有优于随机的判别能力;较新、较大的模型通常未显示出更强的判别能力(Claude模型除外);在多步任务中,过度自信会加剧,推理模型与非推理模型表现相当或更差;通过上下文失败经验,部分LLMs能减少过度自信并显著改善决策。
Insight: 论文的创新点在于系统评估了LLMs的自我认知能力(即预测自身成功概率的能力),揭示了LLMs普遍存在过度自信问题,且这种问题在多步任务中可能恶化;客观分析认为,该研究强调了LLM代理缺乏能力意识是当前限制其性能的关键因素,并提出了通过上下文学习来改善决策的可能性,这对AI安全性和可靠性有重要启示。
Abstract: We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs’ decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs’ awareness of their capabilities for AI misuse and misalignment risks.
[26] Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models cs.CLPDF
Ákos Prucs, Márton Csutora, Mátyás Antal, Márk Marosi
TL;DR: 本文对开源推理大语言模型进行了计算-精度帕累托前沿分析,评估了不同模型在数学和推理密集型基准测试上的性能与计算效率权衡,发现混合专家架构在平衡性能与效率方面表现突出,并揭示了推理时间计算存在饱和点。
Details
Motivation: 当前研究常忽视生成长推理序列带来的巨大计算负担,而工业应用中的模型选择不仅取决于原始精度,还受资源约束和推理成本影响,因此需要系统评估模型在计算效率与精度之间的权衡。
Result: 在数学和推理密集型基准测试上,混合专家架构在性能与效率平衡方面表现优异;研究还发现推理时间计算存在饱和点,超过阈值后精度增益显著减弱。
Insight: 创新点在于首次系统绘制了开源LLMs的计算-精度帕累托前沿,揭示了混合专家架构的效率优势以及推理计算收益递减规律,为工业场景下的模型选型提供了量化依据。
Abstract: Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an emergent trend regarding accuracy gain per unit of compute. Finally, we demonstrate that there is a saturation point for inference-time compute. Beyond a certain threshold, accuracy gains diminish, indicating that while extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities.
[27] Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements cs.CL | cs.AIPDF
Yiming Liang, Yizhi Li, Yantao Du, Ge Zhang, Jiayi Zhou
TL;DR: 本文提出了Encyclo-K,一个基于知识陈述的动态组合基准测试,用于评估大语言模型(LLMs)的综合理解能力。该基准从权威教科书中提取独立的知识陈述,并在测试时通过随机采样动态组合成评估问题,以解决现有基准在数据污染、单知识点评估和专家标注成本方面的局限性。
Details
Motivation: 现有基准测试主要在问题层面构建,存在三个根本局限:易受数据污染、局限于单知识点评估、依赖昂贵的领域专家标注。本文旨在通过重新思考基准构建的基本单元来解决这些问题。
Result: 在超过50个LLMs上的实验表明,Encyclo-K具有强大的区分能力和挑战性。性能最佳的OpenAI-GPT-5.1准确率仅为62.07%,模型性能呈现清晰的梯度分布:推理模型准确率在16.04%到62.07%之间,而聊天模型在9.71%到50.40%之间。
Insight: 核心创新在于将基准构建的基本单元从问题转变为知识陈述,并通过动态组合生成问题。这直接解决了现有基准的三大痛点,并提供了一个可扩展的动态评估框架,用于测试模型对多个细粒度学科知识的综合理解。从客观角度看,其动态生成机制有效规避了记忆风险,并降低了标注成本,具有很好的实用性和可扩展性。
Abstract: Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution–reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs’ comprehensive understanding over multiple fine-grained disciplinary knowledge statements.
[28] Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline cs.CL | cs.LGPDF
Minjun Zhao, Xinyu Zhang, Shuai Zhang, Deyang Li, Ruifeng Shi
TL;DR: 本文提出了ADOPT框架,一种自适应依赖感知的提示优化方法,用于优化多步LLM流程中的提示。该方法通过显式建模每个LLM步骤与最终任务结果之间的依赖关系,实现精确的文本梯度估计,并将梯度估计与更新解耦,从而将多提示优化简化为灵活的单提示优化步骤。实验表明,ADOPT在真实数据集和多样流程结构上均有效且鲁棒,性能优于现有最先进的提示优化基线。
Details
Motivation: 多步LLM流程通过结构化序列多次调用大语言模型,能有效解决复杂任务,但其性能严重依赖于每一步使用的提示。由于缺乏步骤级监督和步骤间依赖关系,联合优化这些提示非常困难,现有端到端提示优化方法在此条件下往往产生次优或不稳定的更新。
Result: 在真实世界数据集和多样化的流程结构上的实验表明,ADOPT是有效且鲁棒的,始终优于最先进的提示优化基线方法。
Insight: 创新点在于显式建模步骤间依赖关系以实现精确的文本梯度估计(类似于计算解析导数),并将梯度估计与更新解耦以简化优化过程,同时采用基于Shapley值的机制自适应分配优化资源。从客观角度看,该方法将多提示联合优化问题分解为更易处理的子问题,并引入博弈论概念进行资源分配,是解决多步流程提示优化挑战的一种新颖且系统化的思路。
Abstract: Multi-step LLM pipelines invoke large language models multiple times in a structured sequence and can effectively solve complex tasks, but their performance heavily depends on the prompts used at each step. Jointly optimizing these prompts is difficult due to missing step-level supervision and inter-step dependencies. Existing end-to-end prompt optimization methods struggle under these conditions and often yield suboptimal or unstable updates. We propose ADOPT, an Adaptive Dependency-aware Prompt Optimization framework for multi-step LLM pipelines. ADOPT explicitly models the dependency between each LLM step and the final task outcome, enabling precise text-gradient estimation analogous to computing analytical derivatives. It decouples textual gradient estimation from gradient updates, reducing multi-prompt optimization to flexible single-prompt optimization steps, and employs a Shapley-based mechanism to adaptively allocate optimization resources. Experiments on real-world datasets and diverse pipeline structures show that ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.
cs.CV [Back]
[29] Video-Based Performance Evaluation for ECR Drills in Synthetic Training Environments cs.CV | cs.AIPDF
Surya Rayala, Marcos Quinones-Grueiro, Naveeduddin Mohammed, Ashwin T S, Benjamin Goldberg
TL;DR: 本文提出了一种基于视频的自动评估流程,用于在合成训练环境中评估城市作战中的’进入并清理房间’(ECR)演练表现。该方法利用计算机视觉模型从训练视频中提取2D骨骼、视线向量和运动轨迹,并开发了衡量心理运动流畅性、情境意识和团队协调性的特定任务指标,最终通过扩展的认知任务分析(CTA)层次结构生成团队合作和认知能力的综合评分。
Details
Motivation: 解决在合成训练环境中对ECR等复杂军事演练进行客观、可扩展的自动性能评估的挑战,克服传统依赖昂贵传感器或主观人工观察方法的局限性。
Result: 通过真实ECR演练的案例研究进行了演示,提供了可操作的、领域特定的指标,能够捕捉个人和团队表现,并讨论了如何将分析结果集成到Gamemaster和GIFT等系统的交互式仪表板中以支持行动后评估。
Insight: 创新点在于提出了一种无需额外硬件的纯视频分析流程,将计算机视觉提取的低级特征(如骨骼、视线)转化为高级的、任务相关的认知与团队协作指标,并整合到现有的训练分析框架(如CTA、GIFT)中,为实现合成训练环境中的可扩展评估提供了新途径。
Abstract: Effective urban warfare training requires situational awareness and muscle memory, developed through repeated practice in realistic yet controlled environments. A key drill, Enter and Clear the Room (ECR), demands threat assessment, coordination, and securing confined spaces. The military uses Synthetic Training Environments that offer scalable, controlled settings for repeated exercises. However, automatic performance assessment remains challenging, particularly when aiming for objective evaluation of cognitive, psychomotor, and teamwork skills. Traditional methods often rely on costly, intrusive sensors or subjective human observation, limiting scalability and accuracy. This paper introduces a video-based assessment pipeline that derives performance analytics from training videos without requiring additional hardware. By utilizing computer vision models, the system extracts 2D skeletons, gaze vectors, and movement trajectories. From these data, we develop task-specific metrics that measure psychomotor fluency, situational awareness, and team coordination. These metrics feed into an extended Cognitive Task Analysis (CTA) hierarchy, which employs a weighted combination to generate overall performance scores for teamwork and cognition. We demonstrate the approach with a case study of real-world ECR drills, providing actionable, domain specific metrics that capture individual and team performance. We also discuss how these insights can support After Action Reviews with interactive dashboards within Gamemaster and the Generalized Intelligent Framework for Tutoring (GIFT), providing intuitive and understandable feedback. We conclude by addressing limitations, including tracking difficulties, ground-truth validation, and the broader applicability of our approach. Future work includes expanding analysis to 3D video data and leveraging video analysis to enable scalable evaluation within STEs.
[30] Pretraining Frame Preservation in Autoregressive Video Memory Compression cs.CVPDF
Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu
TL;DR: 本文提出了一种名为PFP的神经网络结构,用于将长视频压缩为短上下文,其预训练目标明确旨在保留任意时间位置单帧的高频细节。基线模型能将20秒视频压缩至约5k长度的上下文,其中随机帧的感知外观得以保留。此类预训练模型可直接微调作为自回归视频模型的记忆编码器,实现低上下文成本与相对低保真度损失的长历史记忆。
Details
Motivation: 解决长视频压缩为短上下文时,如何有效保留任意时间点单帧的高频细节,以支持自回归视频模型的长历史记忆需求。
Result: 通过消融实验评估框架,并讨论了不同神经架构设计的权衡;基线模型在压缩20秒视频至约5k长度上下文时,能实现随机帧的感知外观保留。
Insight: 创新点在于引入明确的预训练目标来保留单帧高频细节,使模型可直接微调为自回归视频模型的记忆编码器,平衡了上下文成本与保真度损失。
Abstract: We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.
[31] Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale cs.CVPDF
Charith Wickrema, Eliza Mace, Hunter Brown, Heidys Cabrera, Nick Krall
TL;DR: 本文研究了在高分辨率遥感(EO)数据上训练基础模型的扩展行为,使用了超过一万亿像素的商业卫星数据,探索了模型容量、计算资源和数据集规模之间的权衡关系。研究发现,即使在如此巨大的数据规模下,性能仍受限于数据而非模型参数,这为未来遥感基础模型的发展提供了实践指导。
Details
Motivation: 动机在于为高价值但数据稀缺的遥感领域建立类似于自然图像领域的扩展定律,以指导大规模、领域专用基础模型的训练,解决当前缺乏对模型、计算和数据联合扩展关系的理解问题。
Result: 通过使用MITRE联邦AI沙盒和超过1千万亿像素的EO数据,训练了逐步增大的视觉Transformer(ViT)骨干网络,报告了在PB级规模下观察到的成功与失败模式,并分析了其对跨遥感模态领域差距的影响。结果表明,性能仍处于数据受限而非模型参数受限的状态。
Insight: 宣称的创新点在于首次在遥感领域进行PB级规模的扩展实验,揭示了数据限制的关键作用。客观分析认为,其实践性见解(如数据收集策略、计算预算和优化计划)对推进前沿遥感基础模型开发具有重要借鉴意义,特别是在跨模态领域适应方面。
Abstract: We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural-image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well-understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report success and failure modes observed at petascale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data limited regime rather than a model parameter-limited one. These practical insights are intended to inform data-collection strategies, compute budgets, and optimization schedules that advance the future development of frontier-scale RS foundation models.
[32] Learning to learn skill assessment for fetal ultrasound scanning cs.CVPDF
Yipei Wang, Qianye Yang, Lior Drukker, Aris T. Papageorghiou, Yipeng Hu
TL;DR: 本文提出了一种新颖的双层优化框架,用于胎儿超声扫描技能评估,该框架通过分析获取的胎儿超声图像上的任务执行质量来评估技能,无需依赖人工预定义的技能评级。
Details
Motivation: 传统超声技能评估依赖专家监督和反馈,存在主观性强且耗时的问题;现有自动化评估方法多采用监督学习,局限于预定义的影响因素,本文旨在克服这些限制。
Result: 在真实临床胎儿头部超声扫描视频上验证了该方法的可行性,结果表明框架能够通过量化优化任务性能作为技能指标来预测超声技能。
Insight: 创新点在于提出无监督的双层优化框架,联合优化临床任务预测器和技能预测器,避免了手动定义技能评级,实现了基于任务性能的客观技能量化评估。
Abstract: Traditionally, ultrasound skill assessment has relied on expert supervision and feedback, a process known for its subjectivity and time-intensive nature. Previous works on quantitative and automated skill assessment have predominantly employed supervised learning methods, often limiting the analysis to predetermined or assumed factors considered influential in determining skill levels. In this work, we propose a novel bi-level optimisation framework that assesses fetal ultrasound skills by how well a task is performed on the acquired fetal ultrasound images, without using manually predefined skill ratings. The framework consists of a clinical task predictor and a skill predictor, which are optimised jointly by refining the two networks simultaneously. We validate the proposed method on real-world clinical ultrasound videos of scanning the fetal head. The results demonstrate the feasibility of predicting ultrasound skills by the proposed framework, which quantifies optimised task performance as a skill indicator.
[33] MGML: A Plug-and-Play Meta-Guided Multi-Modal Learning Framework for Incomplete Multimodal Brain Tumor Segmentation cs.CVPDF
Yulong Zou, Bo Liu, Cun-Jing Zheng, Yuan-ming Geng, Siyue Li
TL;DR: 本文提出了一种名为MGML的即插即用元引导多模态学习框架,用于处理不完整多模态MRI数据的脑肿瘤分割任务。该框架包含元参数化自适应模态融合和一致性正则化两个模块,旨在最大化利用不完整的模态信息,提升分割性能。
Details
Motivation: 在临床实践中,多模态MRI数据常不完整,难以充分利用可用信息,因此如何有效利用不完整多模态信息成为关键研究挑战。
Result: 在BraTS2020和BraTS2023数据集上进行了广泛实验,相比多种先前的最先进方法,本方法取得了更优性能。在BraTS2020上,针对十五种缺失模态组合的平均Dice分数,对于整个肿瘤、肿瘤核心和增强肿瘤分别达到了87.55、79.36和62.67。
Insight: 创新点包括元参数化自适应模态融合,能根据可用模态生成自适应软标签监督信号,显式促进更一致的多模态融合;以及一致性正则化模块,隐式增强框架的鲁棒性和泛化性。该方法不改变原始模型架构,可方便集成到训练流程中进行端到端优化。
Abstract: Leveraging multimodal information from Magnetic Resonance Imaging (MRI) plays a vital role in lesion segmentation, especially for brain tumors. However, in clinical practice, multimodal MRI data are often incomplete, making it challenging to fully utilize the available information. Therefore, maximizing the utilization of this incomplete multimodal information presents a crucial research challenge. We present a novel meta-guided multi-modal learning (MGML) framework that comprises two components: meta-parameterized adaptive modality fusion and consistency regularization module. The meta-parameterized adaptive modality fusion (Meta-AMF) enables the model to effectively integrate information from multiple modalities under varying input conditions. By generating adaptive soft-label supervision signals based on the available modalities, Meta-AMF explicitly promotes more coherent multimodal fusion. In addition, the consistency regularization module enhances segmentation performance and implicitly reinforces the robustness and generalization of the overall framework. Notably, our approach does not alter the original model architecture and can be conveniently integrated into the training pipeline for end-to-end model optimization. We conducted extensive experiments on the public BraTS2020 and BraTS2023 datasets. Compared to multiple state-of-the-art methods from previous years, our method achieved superior performance. On BraTS2020, for the average Dice scores across fifteen missing modality combinations, building upon the baseline, our method obtained scores of 87.55, 79.36, and 62.67 for the whole tumor (WT), the tumor core (TC), and the enhancing tumor (ET), respectively. We have made our source code publicly available at https://github.com/worldlikerr/MGML.
[34] Kinematic-Based Assessment of Surgical Actions in Microanastomosis cs.CVPDF
Yan Meng, Daniel Donoho, Marcelle Altshuler, Omar Arnaout
TL;DR: 本文提出了一种用于显微血管吻合手术动作自动分割与技能评估的AI框架。该系统通过目标尖端跟踪、基于自相似性矩阵的无监督动作分割以及有监督技能分类三个模块,实现了对手术视频中精细动作的客观分析与评估。
Details
Motivation: 显微血管吻合手术的评估传统上依赖专家主观评分,存在评分者间差异大、耗时且不一致的问题,因此需要自动化、可扩展的客观评估方案。
Result: 在包含58个专家评分视频的数据集上,该方法在帧级别动作分割准确率达到92.4%,在复现专家评估的整体技能分类准确率达到85.5%。
Insight: 创新点在于将目标跟踪(YOLO+DeepSORT)、基于自相似性矩阵的无监督动作边界检测与有监督技能分类相结合,构建了一个可在边缘计算平台高效运行的端到端评估框架,为外科手术技能提供了实时、客观的量化反馈。
Abstract: Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.
[35] U-Net-Like Spiking Neural Networks for Single Image Dehazing cs.CVPDF
Huibin Li, Haoran Liu, Mingzhe Liu, Yulong Xiao, Peng Li
TL;DR: 本文提出了一种名为DehazeSNN的新型图像去雾架构,该架构结合了U-Net结构和脉冲神经网络(SNN),旨在高效捕获多尺度特征并处理局部与长程依赖关系。通过引入正交泄漏积分发放块(OLIFBlock)来增强跨通道通信,该方法在减少计算开销的同时,在多个基准数据集上取得了与最先进方法相媲美的去雾性能。
Details
Motivation: 现有基于CNN的方法难以建模长程依赖,而基于Transformer的方法计算开销大。本文旨在克服这些限制,开发一种既能有效分析图像特征,又计算高效的图像去雾方法。
Result: 在基准数据集上的大量实验表明,DehazeSNN与最先进(SOTA)方法相比具有很强的竞争力,能以更小的模型尺寸和更少的乘加运算(MACs)生成高质量的无雾图像。
Insight: 主要创新点在于将U-Net结构与脉冲神经网络(SNN)相结合用于图像去雾任务,并设计了OLIFBlock来提升特征交互。从客观角度看,其核心创新在于探索了SNN在密集预测任务(如图像去雾)中的潜力,并针对其特性设计了高效的架构,为低功耗视觉处理提供了新思路。
Abstract: Image dehazing is a critical challenge in computer vision, essential for enhancing image clarity in hazy conditions. Traditional methods often rely on atmospheric scattering models, while recent deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Transformers, have improved performance by effectively analyzing image features. However, CNNs struggle with long-range dependencies, and Transformers demand significant computational resources. To address these limitations, we propose DehazeSNN, an innovative architecture that integrates a U-Net-like design with Spiking Neural Networks (SNNs). DehazeSNN captures multi-scale image features while efficiently managing local and long-range dependencies. The introduction of the Orthogonal Leaky-Integrate-and-Fire Block (OLIFBlock) enhances cross-channel communication, resulting in superior dehazing performance with reduced computational burden. Our extensive experiments show that DehazeSNN is highly competitive to state-of-the-art methods on benchmark datasets, delivering high-quality haze-free images with a smaller model size and less multiply-accumulate operations. The proposed dehazing method is publicly available at https://github.com/HaoranLiu507/DehazeSNN.
[36] T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models cs.CVPDF
Changzhen Li, Yuecong Min, Jie Zhang, Zheng Yuan, Shiguang Shan
TL;DR: 本文提出了T2VAttack,一个针对文本到视频(T2V)扩散模型的对抗攻击框架。该研究从语义和时间两个维度评估模型的脆弱性,并提出了两种攻击方法:T2VAttack-S通过贪婪搜索替换提示中的关键词语义或时间同义词,T2VAttack-I则迭代插入优化后的词语以最小扰动。研究在多个SOTA T2V模型上进行了全面评估,揭示了即使对提示进行微小修改也能导致视频生成质量显著下降。
Details
Motivation: 尽管文本到视频扩散模型在生成高质量、时序连贯的视频方面取得了显著进展,但其对抗攻击的脆弱性尚未被充分探索。本文旨在填补这一空白,从语义和时间两个关键维度系统性地研究T2V模型的对抗鲁棒性。
Result: 在多个最先进的T2V模型(包括ModelScope、CogVideoX、Open-Sora和HunyuanVideo)上进行了全面评估。实验结果表明,即使是对提示进行微小的修改(如替换或插入一个单词),也能导致生成的视频在语义保真度和时间动态性方面出现显著退化,从而凸显了当前模型的严重脆弱性。
Insight: 论文的创新点在于首次从语义和时间两个维度对T2V扩散模型进行对抗攻击的系统性研究,并提出了两种高效攻击策略。从客观角度看,其将对抗攻击的评估框架从静态图像或文本扩展到动态视频领域,特别是关注时间连贯性这一视频特有的属性,为理解和提升视频生成模型的鲁棒性提供了新的视角和方法论。
Abstract: The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.
[37] DriveExplorer: Images-Only Decoupled 4D Reconstruction with Progressive Restoration for Driving View Extrapolation cs.CVPDF
Yuang Jia, Jinlong Wang, Jiayi Zhao, Chunlam Li, Shunzhou Wang
TL;DR: 本文提出了一种名为DriveExplorer的有效解决方案,用于自动驾驶场景下的视角外推。该方法仅使用图像和可选的相机位姿,首先估计全局静态点云和逐帧动态点云,并将其融合为统一表示,然后采用可变形4D高斯框架进行场景重建。通过结合4D高斯模型渲染和视频扩散模型的迭代优化,逐步生成高质量的外推视角图像。
Details
Motivation: 现有方法通常依赖激光雷达点云、3D边界框和车道标注等先验信息,这些信息需要昂贵的传感器或人工标注,限制了在实际部署中的适用性。本文旨在仅使用图像和可选相机位姿,解决自动驾驶场景中的视角外推问题,以降低对昂贵或标注数据的依赖。
Result: 与基线方法相比,该方法在新外推视角下生成了更高质量的图像,在相关基准测试中表现出优越的视觉质量。
Insight: 创新点在于提出了一种仅依赖图像的解耦4D重建方法,结合可变形4D高斯框架和视频扩散模型的渐进式优化策略,通过迭代渲染和增强过程逐步提升外推视角的图像质量,避免了对外部先验数据的强依赖。
Abstract: This paper presents an effective solution for view extrapolation in autonomous driving scenarios. Recent approaches focus on generating shifted novel view images from given viewpoints using diffusion models. However, these methods heavily rely on priors such as LiDAR point clouds, 3D bounding boxes, and lane annotations, which demand expensive sensors or labor-intensive labeling, limiting applicability in real-world deployment. In this work, with only images and optional camera poses, we first estimate a global static point cloud and per-frame dynamic point clouds, fusing them into a unified representation. We then employ a deformable 4D Gaussian framework to reconstruct the scene. The initially trained 4D Gaussian model renders degraded and pseudo-images to train a video diffusion model. Subsequently, progressively shifted Gaussian renderings are iteratively refined by the diffusion model,and the enhanced results are incorporated back as training data for 4DGS. This process continues until extrapolation reaches the target viewpoints. Compared with baselines, our method produces higher-quality images at novel extrapolated viewpoints.
[38] Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis cs.CVPDF
Hao Wu, Hui Li, Yiyun Su
TL;DR: 本文提出了一种名为Hilbert-VLM的新型两阶段融合框架,用于解决基于视觉语言模型(VLM)的自动化医学诊断在处理复杂三维多模态医学图像时面临的挑战。该框架首先利用HilbertMed-SAM模块进行精确的病灶分割,然后利用生成的多模态增强提示引导VLM进行准确的疾病分类。
Details
Motivation: 解决当前视觉语言模型在处理复杂三维多模态医学图像时,难以有效整合互补信息以及可能忽略细微但关键的病理特征的问题,旨在提升医学诊断的准确性和可靠性。
Result: 在BraTS2021分割基准测试中,模型取得了82.35%的Dice分数,疾病分类准确率(ACC)达到78.85%,展示了其在医学图像分析中的有效性。
Insight: 核心创新在于对SAM2架构的系统性重新设计:将希尔伯特空间填充曲线集成到Mamba状态空间模型的扫描机制中,以最大化保留三维数据的空间局部性;同时引入了新颖的希尔伯特-曼巴交叉注意力机制和尺度感知解码器来捕获细粒度细节。此外,提示增强模块将分割掩码及其对应的文本属性统一为信息密集的提示,以支持VLM推理。
Abstract: Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the preservation of spatial locality in 3D data, a property critical for medical image analysis. We also introduce a novel Hilbert-Mamba Cross-Attention (HMCA) mechanism and a scale-aware decoder to capture fine-grained details. Meanwhile, the prompt enhancement module unifies segmentation masks and their corresponding textual attributes into an information-dense prompt to support VLM inference. Extensive experiments were conducted to validate the effectiveness of the Hilbert-VLM model. On the BraTS2021 segmentation benchmark, it achieves a Dice score of 82.35 percent, with a diagnostic classification accuracy (ACC) of 78.85 percent. These results demonstrate that the proposed model offers substantial potential to improve the accuracy and reliability of medical VLM-based analysis.
[39] FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing cs.CV | cs.AIPDF
Yunkai Dang, Donghao Wang, Jiacheng Yang, Yifan Jiang, Meiyi Zhu
TL;DR: 该论文提出了一种名为MF-RSVLM的多特征融合遥感视觉-语言模型,旨在解决现有遥感视觉-语言模型在提取细粒度视觉特征和深度语言处理过程中出现视觉遗忘的问题。该模型通过提取并融合多尺度视觉表征,结合全局上下文与局部细节,并采用循环视觉特征注入方案,以增强对遥感场景中复杂结构的理解并减少视觉遗忘。
Details
Motivation: 现有的大型视觉-语言模型在应用于遥感领域时面临挑战,因为遥感图像与自然图像存在固有差异,导致模型难以提取细粒度视觉特征且在语言生成过程中容易遗忘视觉信息。
Result: 在多个遥感基准测试上的广泛实验表明,MF-RSVLM在遥感图像分类、图像描述和视觉问答任务上达到了最先进的或极具竞争力的性能。
Insight: 创新点在于提出了一种多特征融合架构,通过多尺度视觉表征学习和循环视觉特征注入,有效结合了全局与局部信息,并缓解了视觉遗忘问题,为遥感领域的视觉-语言建模提供了新思路。
Abstract: Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision–Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at https://github.com/Yunkaidang/RSVLM.
[40] RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations cs.CV | cs.AIPDF
Xingqi He, Yujie Zhang, Shuyong Gao, Wenjie Li, Lingyi Hong
TL;DR: 本文提出RSAgent,一种基于多模态大语言模型(MLLM)的智能体,通过多轮工具调用交替进行推理与动作,以解决文本引导物体分割任务中单次定位的局限性,实现目标重定位与掩码迭代优化。
Details
Motivation: 现有文本引导分割方法通常将任务视为一次性定位,当初始定位错误时缺乏验证、重聚焦和细化能力,因此需要一种能够交互式推理和修正的方法。
Result: 在ReasonSeg测试集上零样本性能达到66.5% gIoU,比Seg-Zero-7B提升9%;在RefCOCOg上达到81.5% cIoU,在领域内和领域外基准上均达到最先进水平。
Insight: 创新点在于将文本引导分割建模为多轮交互过程,结合分割工具箱的视觉反馈和历史观察进行迭代优化;通过合成多轮推理轨迹数据与两阶段训练(监督微调+强化学习)提升模型性能,体现了智能体在视觉任务中通过工具使用进行自我修正的潜力。
Abstract: Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.
[41] PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing cs.CV | cs.AIPDF
Mustafa Munir, Md Mostafijur Rahman, Kartikeya Bhardwaj, Paul Whatmough, Radu Marculescu
TL;DR: PipeFlow是一种针对长视频编辑的可扩展流水线方法,通过运动分析跳过低运动帧、并行处理视频分段以及基于神经网络的插值技术,显著提升了编辑效率,实现了编辑时间与视频长度的线性增长。
Details
Motivation: 解决长视频编辑中因联合编辑和DDIM反演带来的计算成本指数增长问题,提升编辑的可扩展性和效率。
Result: 在实验中,PipeFlow相比TokenFlow实现了最高9.6倍的加速,相比Diffusion Motion Transfer(DMT)实现了31.7倍的加速,展示了显著的性能提升。
Insight: 创新点包括基于SSIM和光流的运动分析以跳过低运动帧、流水线任务调度算法实现并行处理,以及神经网络插值平滑分段边界和补全跳过的帧,这些方法共同实现了线性扩展的编辑能力。
Abstract: Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow’s editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).
[42] Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising cs.CVPDF
Xinran Qin, Yuhui Quan, Ruotao Xu, Hui Ji
TL;DR: 本文提出了一种基于强化学习的可训练各向异性扩散框架,用于图像去噪。该方法将去噪过程建模为一系列由深度Q学习排序的简单扩散动作,从而构成一个适应不同图像结构的随机各向异性扩散过程。
Details
Motivation: 传统各向异性扩散方法使用显式扩散算子,难以适应复杂图像结构,性能不及基于学习的方法,因此需要一种可学习的扩散框架来提升去噪能力。
Result: 实验表明,该方法在去除三种常见噪声类型上优于现有基于扩散的方法,并与代表性的基于深度CNN的方法竞争。
Insight: 创新点在于将强化学习引入各向异性扩散,通过深度Q学习动态选择扩散动作,形成自适应的随机扩散过程,提升了传统扩散方法的灵活性和性能。
Abstract: Image denoising is an important problem in low-level vision and serves as a critical module for many image recovery tasks. Anisotropic diffusion is a wide family of image denoising approaches with promising performance. However, traditional anisotropic diffusion approaches use explicit diffusion operators which are not well adapted to complex image structures. As a result, their performance is limited compared to recent learning-based approaches. In this work, we describe a trainable anisotropic diffusion framework based on reinforcement learning. By modeling the denoising process as a series of naive diffusion actions with order learned by deep Q-learning, we propose an effective diffusion-based image denoiser. The diffusion actions selected by deep Q-learning at different iterations indeed composite a stochastic anisotropic diffusion process with strong adaptivity to different image structures, which enjoys improvement over the traditional ones. The proposed denoiser is applied to removing three types of often-seen noise. The experiments show that it outperforms existing diffusion-based methods and competes with the representative deep CNN-based methods.
[43] Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval cs.CV | cs.MMPDF
Yizhi Liu, Ruitao Pu, Shilin Xu, Yingke Chen, Quan-Hui Liu
TL;DR: 本文提出了一种名为NIRNL的鲁棒跨模态检索框架,旨在解决多模态数据中噪声标签对检索性能的影响。该框架通过跨模态边界保持(CMP)增强样本对之间的区分度,并利用邻居感知实例精炼(NIR)将数据细分为纯净子集、困难子集和噪声子集,针对不同子集设计定制化优化策略,以最大化数据利用率并减少错误传播。
Details
Motivation: 由于大规模多模态数据标注耗时费力,其标签中不可避免地存在噪声,这会降低跨模态检索模型的性能。现有鲁棒方法(如鲁棒学习范式、标签校准策略和实例选择机制)往往难以同时满足模型性能上限、校准可靠性和数据利用率的要求。
Result: 在三个基准数据集上的大量实验表明,NIRNL实现了最先进的性能,展现出显著的鲁棒性,尤其是在高噪声率下。
Insight: 创新点在于结合了跨模态边界保持(CMP)来调整正负样本对的相对距离,以及邻居感知实例精炼(NIR)通过跨模态邻居共识进行细粒度数据划分,并针对不同子集设计定制化优化策略,从而在提升鲁棒性的同时充分利用所有可用数据。
Abstract: In recent years, Cross-Modal Retrieval (CMR) has made significant progress in the field of multi-modal analysis. However, since it is time-consuming and labor-intensive to collect large-scale and well-annotated data, the annotation of multi-modal data inevitably contains some noise. This will degrade the retrieval performance of the model. To tackle the problem, numerous robust CMR methods have been developed, including robust learning paradigms, label calibration strategies, and instance selection mechanisms. Unfortunately, they often fail to simultaneously satisfy model performance ceilings, calibration reliability, and data utilization rate. To overcome the limitations, we propose a novel robust cross-modal learning framework, namely Neighbor-aware Instance Refining with Noisy Labels (NIRNL). Specifically, we first propose Cross-modal Margin Preserving (CMP) to adjust the relative distance between positive and negative pairs, thereby enhancing the discrimination between sample pairs. Then, we propose Neighbor-aware Instance Refining (NIR) to identify pure subset, hard subset, and noisy subset through cross-modal neighborhood consensus. Afterward, we construct different tailored optimization strategies for this fine-grained partitioning, thereby maximizing the utilization of all available data while mitigating error propagation. Extensive experiments on three benchmark datasets demonstrate that NIRNL achieves state-of-the-art performance, exhibiting remarkable robustness, especially under high noise rates.
[44] RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention cs.CVPDF
Aiyue Chen, Yaofu Liu, Junjian Huang, Guang Lian, Yiwu Yao
TL;DR: 该论文提出了RainFusion2.0,一种用于加速视频和图像生成扩散变换器(DiT)模型的在线自适应、硬件高效且低开销的稀疏注意力机制。它通过利用分块均值作为代表令牌进行稀疏掩码预测、实现时空感知的令牌置换,并针对视频生成引入首帧汇聚机制,以降低注意力计算成本。
Details
Motivation: 动机是解决扩散变换器(DiT)模型在视频和图像生成任务中因注意力机制导致计算成本极高的问题,并克服现有稀疏注意力方法在稀疏模式预测开销和硬件通用性(主要针对GPU设计)方面的局限性,以支持在包括ASIC在内的多种硬件平台上高效推理。
Result: 实验结果表明,RainFusion2.0能够实现80%的稀疏度,在不损失视频质量的情况下,获得1.5~1.8倍的端到端加速。该方法在多种生成模型上有效,并验证了其在多样化硬件平台上的泛化能力。
Insight: 宣称的创新点包括:使用分块均值作为代表令牌进行低开销稀疏掩码预测、时空感知的令牌置换策略、以及针对视频生成的专用首帧汇聚机制。从客观角度看,其核心创新在于将稀疏注意力与硬件效率及跨平台通用性紧密结合,通过在线自适应和块级操作来平衡性能、开销与泛化能力。
Abstract: In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.
[45] Factorized Learning for Temporally Grounded Video-Language Models cs.CV | cs.AI | cs.CL | cs.MMPDF
Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng
TL;DR: 本文提出了一种因子化学习方法D²VLM,用于解决视频语言模型在事件级感知中时间定位不准确的问题。该方法通过解耦时间定位和文本响应两个任务的学习,并引入证据标记和因子化偏好优化算法,强调任务间的层次依赖关系,从而提升模型性能。
Details
Motivation: 现有视频语言模型通常以耦合方式处理时间定位和文本响应,缺乏清晰的逻辑层次结构,导致目标函数次优,难以实现准确的事件级时间定位。
Result: 在多个任务上的实验表明,该方法具有明显优势,但摘要未具体说明在哪些基准测试上达到何种水平(如SOTA)。
Insight: 创新点在于提出了‘先定位后基于证据回答’的范式,引入证据标记以超越现有工作对时间戳表示的关注,并设计了因子化偏好优化算法,将概率化时间定位建模显式纳入优化目标。
Abstract: Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a “grounding then answering with evidence referencing” paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.
[46] Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation cs.CVPDF
Yijie Qian, Juncheng Wang, Yuxiang Feng, Chao Xu, Wang Lu
TL;DR: 本文提出了一种名为潜在运动推理(LMR)的新方法,用于文本到运动(T2M)生成。该方法将生成过程重构为‘先思考后行动’的两阶段决策过程,通过解耦运动为语义丰富的推理潜在空间和高频执行的潜在空间,以解决语义与运动学之间的阻抗失配问题。
Details
Motivation: 当前T2M生成方法将文本直接映射为连续姿态,面临‘语义-运动学阻抗失配’的理论瓶颈,即难以一次性将密集的离散语义意图接地到高频运动数据中。
Result: 在T2M-GPT和MotionStreamer两个代表性基线模型上实施LMR,实验表明其在语义对齐和物理合理性方面均取得显著提升,验证了运动规划的最佳基础是学习到的运动对齐概念空间。
Insight: 创新点在于引入认知科学中的分层运动控制思想,提出双粒度分词器将运动解耦为两个潜在流形,并通过自回归推理先规划粗粒度轨迹再实例化帧,有效弥合语言与物理之间的不可言说性鸿沟。
Abstract: Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this System 1 approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively reason (plan the coarse trajectory) before it moves (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR’s versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space. Codes and demos can be found in \hyperlink{https://chenhaoqcdyq.github.io/LMR/}{https://chenhaoqcdyq.github.io/LMR/}
[47] GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation cs.CVPDF
Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen
TL;DR: 本文提出了GeoBench,一个用于评估视觉语言模型(VLMs)几何问题解决能力的层次化基准。该基准通过四个推理层级(视觉感知、目标导向规划、严格定理应用、自反思回溯)和六个经形式化验证的任务,系统性地评估从属性提取到逻辑错误纠正的能力。实验表明,尽管推理模型(如OpenAI-o3)优于通用多模态大语言模型(MLLMs),但性能随任务复杂度增加而显著下降。关键发现包括子目标分解和无关前提过滤对最终准确性至关重要,而思维链提示在某些任务中意外地降低了性能。
Details
Motivation: 当前几何推理评估存在局限性,包括基于教科书的基准测试数据污染风险、过度强调最终答案而非推理过程、以及诊断粒度不足。为了解决这些问题,作者旨在创建一个更全面、层次化的评估基准。
Result: 实验在GeoBench上进行,结果显示推理模型(如OpenAI-o3)表现优于通用MLLMs,但随着任务复杂性增加,所有模型性能均显著下降。关键定量发现是子目标分解和无关前提过滤对问题解决准确性有重要影响。
Insight: 论文的主要创新点是提出了一个层次化、诊断性的几何问题解决基准(GeoBench),它通过四个明确的推理层级和形式化验证的任务,提供了比现有基准更细致的评估。客观来看,其将几何推理分解为不同认知层次并进行系统评估的方法,以及对思维链提示在复杂几何任务中潜在负面影响的发现,具有借鉴意义。
Abstract: Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.
[48] Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design cs.CV | cs.AIPDF
Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte
TL;DR: 本文针对计算机视觉中自动化神经网络架构设计的挑战,提出了基于大语言模型(LLM)的架构生成方法。核心贡献包括:首次系统研究了少样本提示(FSAP)中支持示例数量(n=1-6)对生成效果的影响,发现n=3在架构多样性和任务专注性之间达到最佳平衡;并提出了一种轻量级的去重验证方法(Whitespace-Normalized Hash Validation),速度比AST解析快100倍。在七个视觉基准测试上生成了1,900个独特架构,并引入了数据集平衡评估方法以解决跨异构任务比较的难题。
Details
Motivation: 自动化神经网络架构设计面临任务多样性和计算资源限制的双重挑战。传统神经架构搜索(NAS)计算成本高,而LLM作为一种有潜力的替代方案,其在计算机视觉架构生成中的应用,特别是在提示工程和验证策略方面,尚未得到系统研究。
Result: 在七个计算机视觉基准(MNIST、CIFAR-10、CIFAR-100、CelebA、ImageNette、SVHN、Places365)上进行的大规模实验中,生成了1,900个独特架构。提出的轻量级去重方法实现小于1毫秒的验证速度,比AST解析快100倍。少样本提示研究表明,使用n=3个示例能最佳平衡视觉任务的架构多样性和上下文聚焦。
Insight: 创新点包括:首次系统研究LLM少样本提示中示例数量对视觉架构生成的影响,为提示工程提供了具体指导(n=3);提出了一种极高效的架构去重验证方法,大幅降低了计算开销;引入了数据集平衡评估方法,为跨异构视觉任务的公平比较建立了严谨的实践。这些贡献使得计算资源有限的研究者也能更便捷地进行自动化设计。
Abstract: Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.
[49] Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning cs.CVPDF
Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen
TL;DR: 本文针对基于人类反馈的强化学习(RLHF)对齐文本到图像扩散模型时出现的偏好模式崩溃(PMC)问题,提出了一个名为D^2-Align的新框架。PMC是指模型过度优化奖励模型的固有偏差,导致生成图像风格单一、多样性严重下降。作者首先量化了PMC现象并提出了衡量其程度的基准DivGenBench,然后通过在学习过程中对奖励信号进行方向性校正来缓解PMC,从而在保持生成质量的同时提升多样性。
Details
Motivation: 现有方法虽然能在自动化奖励指标上获得高分,但会导致偏好模式崩溃(PMC),即模型收敛于狭窄的高分输出(如单一风格或普遍过曝的图像),严重损害生成多样性。本文旨在识别、量化并解决这一问题。
Result: 在提出的DivGenBench基准上进行综合评估(结合质量和多样性的定性与定量指标),D^2-Align在保持多样性的同时,实现了与人类偏好的更优对齐。
Insight: 创新点在于首次系统性地识别和量化了扩散模型RLHF中的偏好模式崩溃现象,并提出了通过方向性解耦对齐(在奖励模型的嵌入空间中学习方向性校正并应用于优化过程)来缓解该问题,从而平衡奖励优化与生成多样性。这为未来避免奖励黑客和模式崩溃的RLHF方法提供了新思路。
Abstract: Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model’s inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model’s embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.
[50] Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset cs.CVPDF
TsaiChing Ni, ZhenQi Chen, YuanFu Yang
TL;DR: 该论文提出了首个大规模工业多模态缺陷数据集IMDD-1M,包含100万个对齐的图像-文本对,覆盖60多种材料类别和400多种缺陷类型,并基于此数据集从头训练了一个专为工业场景定制的扩散式视觉-语言基础模型。该模型通过轻量级微调即可高效适应特定领域,在仅需专家模型5%任务数据的情况下达到相当性能,为可扩展、领域自适应和知识驱动的制造智能铺平道路。
Details
Motivation: 解决工业缺陷理解中缺乏大规模、高质量多模态数据集的问题,推动制造业和质量检测领域的多模态学习发展。
Result: 在工业检测和生成任务上,该基础模型仅需专家模型5%的任务特定数据即可达到相当性能,展示了数据高效的基础模型适应潜力。
Insight: 创新点在于构建了首个大规模工业多模态缺陷数据集IMDD-1M,并训练了专为工业场景定制的扩散式视觉-语言基础模型,实现了通过轻量微调高效适应不同工业领域,降低了数据需求并提升了泛化能力。
Abstract: We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.
[51] DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models cs.CVPDF
Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang
TL;DR: 本文提出了一种新的生成式多模态推理范式,并引入了DiffThinker,这是一个基于扩散模型的推理框架。该框架将多模态推理重新定义为原生的生成式图像到图像任务,旨在解决现有多模态大语言模型在复杂、长视野、以视觉为中心的任务中表现不佳的问题。
Details
Motivation: 当前多模态大语言模型的推理过程仍以文本为中心,导致其在复杂的、以视觉为中心的长视野任务中性能欠佳。本文旨在建立一种新的生成式多模态推理范式来克服这一局限。
Result: 在四个领域(顺序规划、组合优化、约束满足、空间配置)的广泛实验表明,DiffThinker显著优于领先的闭源模型(如GPT-5和Gemini-3-Flash)以及微调的Qwen3-VL-32B基线,提升幅度巨大(例如,相比GPT-5提升314.2%)。
Insight: 核心创新在于将多模态推理重新定义为原生的图像生成任务,利用扩散模型实现更好的逻辑一致性和空间精度。该范式展现出效率、可控性、原生并行性和协作性四大核心特性,为以视觉为中心的推理提供了一种有前景的新方法。
Abstract: While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2%) and Gemini-3-Flash (+111.6%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.
[52] Guiding a Diffusion Transformer with the Internal Dynamics of Itself cs.CV | cs.LGPDF
Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, Shuhang Gu
TL;DR: 本文提出了一种名为内部引导(IG)的新策略,用于提升扩散变换器(Diffusion Transformer)的生成质量。该方法通过在训练过程中对中间层引入辅助监督,并在采样过程中外推中间层和深层输出来引导生成,从而在不增加额外训练或采样步骤的情况下,显著提高了训练效率和生成性能。
Details
Motivation: 标准分类器无引导(CFG)等方法在引导扩散模型生成时,常导致样本过于简化或失真;而利用模型“坏版本”进行引导的方法则受限于精心设计的退化策略、额外训练和采样步骤。本文旨在提出一种更简单有效的引导策略,以克服这些限制。
Result: 在ImageNet 256x256基准上,SiT-XL/2+IG在80和800轮训练后分别达到FID=5.31和FID=1.75;LightningDiT-XL/1+IG达到FID=1.34,显著优于其他方法。结合CFG后,LightningDiT-XL/1+IG取得了当前最优(SOTA)的FID=1.19。
Insight: 创新点在于利用模型自身的内部动力学(即中间层输出)进行引导,无需额外训练或复杂退化策略,通过辅助监督和外推机制有效提升了生成质量,为扩散模型的高效训练和采样提供了新思路。
Abstract: The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer’s outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.
[53] Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes cs.CVPDF
Shuyun Wang, Haiyang Sun, Bing Wang, Hangjun Ye, Xin Yu
TL;DR: 本文提出Mirage,一种用于驾驶场景中逼真且连贯资产编辑的一步视频扩散模型。该方法基于文本到视频扩散先验确保时间一致性,通过注入预训练2D编码器的时序无关潜在变量来恢复细节,并采用两阶段数据对齐策略解决分布不匹配问题。
Details
Motivation: 现有视频对象编辑方法难以同时保持高视觉保真度和时间连贯性,而视觉为中心的自动驾驶系统需要多样化和可扩展的训练数据,因此需要一种能实现逼真且连贯资产编辑的方法。
Result: 大量实验表明,Mirage在多种编辑场景中实现了高真实感和时间一致性,并可作为未来研究的可靠基线。
Insight: 创新点包括:利用文本到视频扩散先验确保时间一致性;通过注入2D编码器潜在变量恢复空间细节;提出两阶段数据对齐策略解决资产与场景对象的分布不匹配问题,提升对齐质量。
Abstract: Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at https://github.com/wm-research/mirage.
[54] MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model cs.CV | cs.LGPDF
Rahul Medicharla, Alper Yilmaz
TL;DR: 本文提出了MotivNet,一个基于人类视觉基础模型Meta-Sapiens构建的、用于鲁棒真实世界应用的面部表情识别模型。该模型无需跨域训练,通过在Sapiens骨干网络上添加下游任务,实现了在多个数据集上的竞争性性能,并验证了其作为Sapiens下游任务的可行性。
Details
Motivation: 当前最先进的面部表情识别模型在多样化数据上泛化能力弱,导致在真实世界应用中性能下降,阻碍了该领域的发展。现有解决泛化问题的复杂架构通常需要跨域训练,这与真实世界应用需求相矛盾。
Result: MotivNet在多个数据集上实现了竞争性性能,无需跨域训练。通过基准性能、模型相似性和数据相似性三个标准评估,验证了其作为Sapiens下游任务的可行性,并可与现有SOTA模型进行基准比较。
Insight: 创新点在于利用大规模预训练的Masked Autoencoder构建的、具有卓越真实世界泛化能力的人类视觉基础模型Meta-Sapiens作为骨干,通过添加下游任务的方式,解决了FER模型泛化能力不足且依赖跨域训练的矛盾,为真实世界应用提供了更可行的方案。
Abstract: In this paper, we introduce MotivNet, a generalizable facial emotion recognition model for robust real-world application. Current state-of-the-art FER models tend to have weak generalization when tested on diverse data, leading to deteriorated performance in the real world and hindering FER as a research domain. Though researchers have proposed complex architectures to address this generalization issue, they require training cross-domain to obtain generalizable results, which is inherently contradictory for real-world application. Our model, MotivNet, achieves competitive performance across datasets without cross-domain training by using Meta-Sapiens as a backbone. Sapiens is a human vision foundational model with state-of-the-art generalization in the real world through large-scale pretraining of a Masked Autoencoder. We propose MotivNet as an additional downstream task for Sapiens and define three criteria to evaluate MotivNet’s viability as a Sapiens task: benchmark performance, model similarity, and data similarity. Throughout this paper, we describe the components of MotivNet, our training approach, and our results showing MotivNet is generalizable across domains. We demonstrate that MotivNet can be benchmarked against existing SOTA models and meets the listed criteria, validating MotivNet as a Sapiens downstream task, and making FER more incentivizing for in-the-wild application. The code is available at https://github.com/OSUPCVLab/EmotionFromFaceImages.
[55] MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation cs.CVPDF
Fuqiang Gu, Yuanke Li, Xianlei Long, Kangping Ji, Chao Chen
TL;DR: 本文提出了MambaSeg,一种新颖的双分支语义分割框架,利用并行的Mamba编码器分别高效建模RGB图像和事件流。为了解决跨模态模糊性问题,引入了双维度交互模块(DDIM),包含跨空间交互模块(CSIM)和跨时间交互模块(CTIM),在空间和时间维度上进行细粒度融合,以利用两种模态的互补特性。
Details
Motivation: 现有RGB方法在快速运动、低光或高动态范围条件下性能下降,而事件相机虽具有高时间分辨率和低延迟优势,但缺乏颜色和纹理信息。现有RGB与事件融合方法通常计算成本高,且主要关注空间融合,忽略了事件流固有的时间动态特性。
Result: 在DDD17和DSEC数据集上的大量实验表明,MambaSeg实现了最先进(SOTA)的分割性能,同时显著降低了计算成本。
Insight: 主要创新点在于采用并行的Mamba编码器高效处理双模态,并设计了新颖的双维度交互模块(DDIM)来联合执行空间和时间维度的细粒度融合,从而改善跨模态对齐、减少模糊性。从客观角度看,将Mamba架构引入多模态视觉任务,并系统性地建模事件流的时间动态,是值得借鉴的思路。
Abstract: Semantic segmentation is a fundamental task in computer vision with wide-ranging applications, including autonomous driving and robotics. While RGB-based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low-light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross-modal ambiguity, we introduce the Dual-Dimensional Interaction Module (DDIM), comprising a Cross-Spatial Interaction Module (CSIM) and a Cross-Temporal Interaction Module (CTIM), which jointly perform fine-grained fusion along both spatial and temporal dimensions. This design improves cross-modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.
[56] Taming Hallucinations: Boosting MLLMs’ Video Understanding via Counterfactual Video Generation cs.CV | cs.AIPDF
Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu
TL;DR: 该论文针对多模态大语言模型在视频理解中过度依赖语言先验导致视觉幻觉的问题,提出了一种名为DualityForge的反事实视频生成框架,通过可控扩散视频编辑将真实视频转换为反事实场景,并自动生成高质量问答对,构建了大规模视频数据集DualityVidQA用于对比训练。同时,论文提出了Duality-Normalized Advantage Training训练策略,通过两阶段监督微调与强化学习优化模型性能。实验表明,该方法显著减少了模型在反事实视频上的幻觉,并在多个基准测试上取得提升。
Details
Motivation: 解决多模态大语言模型在视频理解中因文本-视频数据不平衡导致的视觉幻觉问题,特别是处理违背常识的反事实视频时模型过度依赖语言先验的局限性。
Result: 在DualityVidQA-Test测试集上,该方法相比Qwen2.5-VL-7B基线模型将幻觉相对减少了24.0%,并在幻觉和通用基准测试上均取得显著提升,展现了较强的泛化能力。
Insight: 创新点在于提出反事实视频数据合成框架DualityForge,通过可控扩散编辑自动生成高质量对比训练数据;同时设计了DNA-Train训练策略,利用配对数据的对比性质进行稳定的策略优化,有效缓解了数据收集成本高的问题并提升了模型鲁棒性。
Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.
[57] UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots cs.CV | cs.ROPDF
Nan Jiang, Zimo He, Wanhe Yu, Lexi Pang, Yunhao Li
TL;DR: UniAct是一个两阶段框架,通过微调的多模态大语言模型(MLLM)与因果流式处理管道相结合,实现了人形机器人以低于500毫秒的延迟执行多模态指令(如语言、音乐和轨迹),并在统一感知与控制方面取得了进展。
Details
Motivation: 解决人形机器人将高层多模态感知与全身执行桥接的瓶颈问题,现有方法难以将异构指令稳定、实时地转化为动作。
Result: 在自建的20小时人形运动基准UniMoCap上验证,零样本跟踪不完美参考动作的成功率提升了19%,并展示了在多样真实场景中的鲁棒泛化能力。
Insight: 通过FSQ共享离散码本统一输入,确保跨模态对齐并将运动约束在物理可行的流形上,创新性地结合MLLM与流式处理以实现低延迟、通用的动作生成。
Abstract: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions – such as language, music, and trajectories – into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.
[58] Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention cs.CVPDF
Haijing Liu, Zhiyuan Song, Hefeng Wu, Tao Pu, Keze Wang
TL;DR: 本文提出了一种名为CERES的因果干预框架,用于解决第一人称视角视频中的指称视频对象分割任务。该框架通过双模态因果干预,包括后门调整来纠正语言表示偏差,以及前门调整来整合语义视觉特征和几何深度信息,以应对第一人称视频中的视觉混淆问题。实验表明,CERES在Ego-RVOS基准测试中达到了最先进的性能。
Details
Motivation: 解决第一人称指称视频对象分割任务中因视频固有模糊性和训练数据偏差导致的鲁棒性挑战,避免现有方法从数据中学习虚假相关性及受第一人称视角视觉混淆因素(如快速运动和频繁遮挡)的影响。
Result: 在Ego-RVOS基准测试上实现了最先进的性能。
Insight: 创新点在于将因果推理应用于指称视频对象分割,通过双模态因果干预(后门调整和前门调整)来纠正语言偏差和视觉混淆,从而提升模型在第一人称视频中的鲁棒性和可靠性;客观分析认为,该方法通过整合深度信息并基于因果原则指导特征融合,为更广泛的以自我为中心的视频理解任务提供了可借鉴的框架。
Abstract: Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.
[59] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning cs.CVPDF
Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen
TL;DR: 本文提出了SenseNova-MARS,一个通过强化学习增强多模态代理推理与搜索能力的框架。该框架使视觉语言模型能够动态交错使用图像搜索、文本搜索和图像裁剪等工具,以解决细粒度和知识密集型的视觉理解任务。
Details
Motivation: 现有视觉语言模型在代理推理中主要局限于文本链式思考或孤立工具调用,缺乏在知识密集和视觉复杂场景中动态交错使用工具与持续推理的能力。
Result: 在搜索导向的基准测试中,SenseNova-MARS-8B模型在MMSearch上得分为67.84,在作者新提出的HR-MMSearch基准上得分为41.64,性能超越了Gemini-3-Flash和GPT-5等专有模型,达到了SOTA水平。
Insight: 核心创新点在于通过强化学习(特别是提出的BN-GSPO算法)赋予模型动态交错使用多种工具进行视觉推理的能力,并构建了首个面向搜索的高分辨率图像基准HR-MMSearch以推动该领域评估。
Abstract: While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model’s ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.
[60] Spatial-aware Vision Language Model for Autonomous Driving cs.CVPDF
Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong
TL;DR: 本文提出LVLDrive框架,通过引入LiDAR点云作为额外输入模态,增强现有视觉语言模型在自动驾驶中的三维度量空间理解能力,以解决纯视觉方法在复杂场景理解和决策中的可靠性瓶颈。
Details
Motivation: 现有基于图像的视觉语言模型依赖2D图像线索进行复杂场景理解和决策,缺乏精确的度量空间推理和几何推断能力,导致驾驶策略不可靠,限制了自动驾驶的安全性和可靠性。
Result: 在驾驶基准测试上的大量实验表明,LVLDrive在场景理解、度量空间感知和可靠驾驶决策方面均优于纯视觉模型,实现了优越性能。
Insight: 创新点包括提出渐进融合Q-Former逐步注入LiDAR特征以稳定预训练模型知识,并构建空间感知问答数据集以显式教授模型高级3D感知和推理能力,强调了显式3D度量数据对于构建可信VLM自动驾驶系统的必要性。
Abstract: While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM’s existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.
[61] DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images cs.CV | cs.AI | cs.CLPDF
Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Meliha Yetisgen, Noel Codella
TL;DR: 本文提出了DermaVQA-DAS,一个扩展的皮肤病学视觉问答数据集,并引入了皮肤病评估模式(DAS),这是一个由专家开发的标准化框架,用于系统捕获具有临床意义的皮肤病特征。该工作支持封闭式问答和皮肤病病变分割两项任务,并提供了专家标注的数据集,对当前先进的多模态模型进行了基准测试。
Details
Motivation: 现有皮肤病图像分析数据集大多关注皮肤镜图像,缺乏患者自述的查询和临床背景,限制了其在以患者为中心的护理中的应用。本文旨在填补这一空白。
Result: 在分割任务中,BiomedParse模型在结合患者查询标题和内容的增强提示下,在基于多数投票的微评分评估中取得了最佳性能(Jaccard指数0.395,Dice分数0.566)。在封闭式问答任务中,o3模型取得了最佳总体准确率(0.798),GPT-4.1(0.796)紧随其后,Gemini-1.5-Pro(0.783)在Gemini家族中表现具有竞争力。
Insight: 创新点在于提出了一个结构化的专家开发框架(DAS)来标准化皮肤病特征评估,并创建了包含患者生成图像和查询的数据集以支持以患者为中心的护理研究。客观来看,其将系统化的临床评估模式与多模态任务(问答和分割)相结合,并深入研究了提示设计对模型性能的影响,这对医学视觉语言建模具有借鉴意义。
Abstract: Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).
[62] Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems cs.CV | cs.ROPDF
Song Wang, Lingdong Kong, Xiaolu Liu, Hao Shi, Wentong Li
TL;DR: 本文提出了一个用于自动驾驶系统的多模态数据预训练综合框架,旨在从摄像头、激光雷达等多传感器数据中构建统一的空间智能。论文系统分析了传感器特性与学习策略的交互,并建立了一个从单模态基线到统一框架的预训练范式分类体系,以支持3D目标检测和语义占据预测等高级任务。
Details
Motivation: 解决自动驾驶等自主系统中,如何有效整合不同传感器模态(如摄像头和激光雷达)的数据,以构建统一、鲁棒的空间理解能力这一核心挑战。
Result: 论文未在摘要中提供具体的定量实验结果或基准测试比较,其主要贡献在于提出了一个系统性的分类框架和未来路线图。
Insight: 创新点在于提出了一个统一的多模态预训练范式分类法,并探讨了结合文本输入和占据表示以实现开放世界感知与规划的路径,为构建通用多模态基础模型指明了方向。
Abstract: The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.
[63] RedunCut: Measurement-Driven Sampling and Accuracy Performance Modeling for Low-Cost Live Video Analytics cs.CV | cs.DCPDF
Gur-Eyal Sela, Kumar Krishna Agrawal, Bharathan Balaji, Joseph Gonzalez, Ion Stoica
TL;DR: RedunCut是一个用于低成本实时视频分析(LVA)的动态模型大小选择(DMSS)系统,通过测量驱动的规划器评估采样成本效益,并结合轻量级数据驱动性能模型来提升精度预测,从而在固定精度下显著降低计算成本。
Details
Motivation: 解决现有DMSS方法在多样化工作负载(特别是移动视频和低精度目标)上泛化能力不足的问题,其根源在于采样效率低下和每段视频的精度预测不准确。
Result: 在道路车辆、无人机和监控视频以及多种模型系列和任务上,RedunCut在固定精度下将计算成本降低了14-62%,并且对有限历史数据和漂移保持鲁棒性。
Insight: 创新点在于引入测量驱动的规划器来优化采样决策,以及使用数据驱动的性能模型替代传统黑盒统计方法进行更准确的精度预测,从而实现了更高效的自适应模型选择。
Abstract: Live video analytics (LVA) runs continuously across massive camera fleets, but inference cost with modern vision models remains high. To address this, dynamic model size selection (DMSS) is an attractive approach: it is content-aware but treats models as black boxes, and could potentially reduce cost by up to 10x without model retraining or modification. Without ground truth labels at runtime, we observe that DMSS methods use two stages per segment: (i) sampling a few models to calculate prediction statistics (e.g., confidences), then (ii) selection of the model size from those statistics. Prior systems fail to generalize to diverse workloads, particularly to mobile videos and lower accuracy targets. We identify that the failure modes stem from inefficient sampling whose cost exceeds its benefit, and inaccurate per-segment accuracy prediction. In this work, we present RedunCut, a new DMSS system that addresses both: It uses a measurement-driven planner that estimates the cost-benefit tradeoff of sampling, and a lightweight, data-driven performance model to improve accuracy prediction. Across road-vehicle, drone, and surveillance videos and multiple model families and tasks, RedunCut reduces compute cost by 14-62% at fixed accuracy and remains robust to limited historical data and to drift.
[64] DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model cs.CVPDF
Bohong Chen, Haiyang Liu
TL;DR: DyStream是一种基于流匹配的自回归模型,用于实时生成双人对话头部视频,通过流友好的自回归框架和因果编码器结合前瞻模块,在保证低延迟(每帧34毫秒)的同时实现高质量的唇部同步。
Details
Motivation: 现有基于分块的方法需要完整的非因果上下文窗口,导致高延迟,无法满足真实对话中即时非语言反馈的需求,因此需要一种超低延迟的生成方法。
Result: DyStream在HDTF数据集上实现了SOTA的唇部同步质量,离线与在线LipSync Confidence分数分别为8.13和7.61,每帧生成时间仅34毫秒,系统总延迟低于100毫秒。
Insight: 创新点包括采用流匹配自回归框架进行概率建模,以及结合短未来上下文(如60毫秒)的因果编码器设计,在保持低延迟的同时提升生成质量,为实时视频生成提供了有效解决方案。
Abstract: Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.
[65] AI-Driven Evaluation of Surgical Skill via Action Recognition cs.CVPDF
Yan Meng, Daniel A. Donoho, Marcelle Altshuler, Omar Arnaout
TL;DR: 本文提出了一种基于人工智能的手术技能自动化评估框架,通过视频Transformer架构(基于TimeSformer改进)进行动作识别,并结合YOLO目标检测与跟踪提取精细运动特征,以评估显微吻合术的五个技能维度。实验在58个专家标注视频数据集上验证了其有效性,动作分割帧级准确率达87.7%(后处理提升至93.62%),技能分类平均准确率为76%。
Details
Motivation: 传统手术技能评估依赖专家监督,存在主观性、评分者间差异大、耗时费力等问题,尤其在低收入国家难以规模化;本文旨在开发一种自动化、客观的AI驱动评估方法以解决这些局限性。
Result: 在58个专家标注视频数据集上,系统动作分割帧级准确率为87.7%(后处理后达93.62%),在复制专家对所有技能方面的评估中平均分类准确率为76%,展现了客观评估的潜力。
Insight: 创新点包括:结合改进的TimeSformer(引入分层时间注意力和加权空间注意力)进行手术视频动作识别,以及集成YOLO的目标检测与跟踪以提取仪器运动学特征,实现多维度技能评估;客观分析认为,该方法将视频理解与运动分析结合,为手术教育提供了可解释、数据驱动的标准化评估方案。
Abstract: The development of effective training and evaluation strategies is critical. Conventional methods for assessing surgical proficiency typically rely on expert supervision, either through onsite observation or retrospective analysis of recorded procedures. However, these approaches are inherently subjective, susceptible to inter-rater variability, and require substantial time and effort from expert surgeons. These demands are often impractical in low- and middle-income countries, thereby limiting the scalability and consistency of such methods across training programs. To address these limitations, we propose a novel AI-driven framework for the automated assessment of microanastomosis performance. The system integrates a video transformer architecture based on TimeSformer, improved with hierarchical temporal attention and weighted spatial attention mechanisms, to achieve accurate action recognition within surgical videos. Fine-grained motion features are then extracted using a YOLO-based object detection and tracking method, allowing for detailed analysis of instrument kinematics. Performance is evaluated along five aspects of microanastomosis skill, including overall action execution, motion quality during procedure-critical actions, and general instrument handling. Experimental validation using a dataset of 58 expert-annotated videos demonstrates the effectiveness of the system, achieving 87.7% frame-level accuracy in action segmentation that increased to 93.62% with post-processing, and an average classification accuracy of 76% in replicating expert assessments across all skill aspects. These findings highlight the system’s potential to provide objective, consistent, and interpretable feedback, thereby enabling more standardized, data-driven training and evaluation in surgical education.
[66] Exploring Compositionality in Vision Transformers using Wavelet Representations cs.CVPDF
Akshad Shyam Purushottamdas, Pranav K Nayak, Divya Mehul Rajparia, Deekshith Patel, Yashmitha Gogineni
TL;DR: 本文通过引入离散小波变换(DWT)作为获取视觉基元的方法,提出了一种框架来探索视觉Transformer(ViT)编码器表示中的组合性,并实证发现单层DWT分解得到的基元在潜在空间中近似组合,为理解ViT如何结构化信息提供了新视角。
Details
Motivation: 动机在于借鉴语言任务中分析Transformer组合性的方法,探究ViT编码器学习到的表示是否具有组合性,以深入理解其工作机制。
Result: 实验结果表明,基于单层DWT分解的基元能够在潜在空间中近似组合,从而再现原始图像表示,这为ViT表示的组合性提供了实证支持。
Insight: 创新点在于将离散小波变换引入视觉领域以获取输入依赖的基元,并建立了一个测量ViT编码器组合性的框架,为分析视觉Transformer的内部表示提供了新工具和视角。
Abstract: While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.
[67] Using Large Language Models To Translate Machine Results To Human Results cs.CVPDF
Trishna Niraula, Jonathan Stubblefield
TL;DR: 本研究提出了一种结合YOLOv5/YOLOv8目标检测模型与大语言模型(LLM)的流程,用于从胸部X光图像中检测异常并自动生成放射学报告。
Details
Motivation: 解决医学影像AI系统通常仅输出结构化预测结果,而放射科医生仍需将其转化为完整叙述性报告的问题,利用大语言模型(如GPT-4)来弥合这一差距。
Result: 在检测准确性、推理延迟和生成文本质量(通过余弦相似度衡量)方面比较了YOLOv5和YOLOv8;结果显示AI报告与人工报告具有强语义相似性,但人类评估表明GPT-4在清晰度(4.88/5)上表现优异,而在自然书写流畅度(2.81/5)上得分较低。
Insight: 创新点在于将目标检测模型的结构化输出与大语言模型的文本生成能力相结合,构建端到端的自动报告生成流程;客观来看,该研究探索了多模态AI在临床工作流中的实际应用,并揭示了当前系统在临床准确性上可行,但在文本风格上与专业报告仍存在差异。
Abstract: Artificial intelligence (AI) has transformed medical imaging, with computer vision (CV) systems achieving state-of-the-art performance in classification and detection tasks. However, these systems typically output structured predictions, leaving radiologists responsible for translating results into full narrative reports. Recent advances in large language models (LLMs), such as GPT-4, offer new opportunities to bridge this gap by generating diagnostic narratives from structured findings. This study introduces a pipeline that integrates YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The YOLO models produce bounding-box predictions and class labels, which are then passed to the LLM to generate descriptive findings and clinical summaries. YOLOv5 and YOLOv8 are compared in terms of detection accuracy, inference latency, and the quality of generated text, as measured by cosine similarity to ground-truth reports. Results show strong semantic similarity between AI and human reports, while human evaluation reveals GPT-4 excels in clarity (4.88/5) but exhibits lower scores for natural writing flow (2.81/5), indicating that current systems achieve clinical accuracy but remain stylistically distinguishable from radiologist-authored text.
[68] Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression cs.CVPDF
Manikanta Kotthapalli, Banafsheh Rekabdar
TL;DR: 本文提出了一种多尺度向量量化变分自编码器(MS-VQ-VAE),用于低分辨率视频的感知压缩。该方法扩展了VQ-VAE-2框架,引入了一个由3D残差卷积构建的两级分层潜在结构,旨在生成紧凑、高保真的视频潜在表示,适用于边缘设备的高效存储、传输和解码。
Details
Motivation: 视频流量的指数增长对带宽和存储基础设施提出了更高要求,特别是对于内容分发网络(CDN)和边缘设备。传统视频编解码器(如H.264和HEVC)主要针对像素域重建设计,缺乏对机器学习中心化潜在表示的原生支持,限制了其与深度学习流程的集成。
Result: 在UCF101数据集上使用2秒视频片段(32帧,16 FPS)进行训练,测试集上达到了25.96 dB PSNR和0.8375 SSIM。在验证集上,相比单尺度基线模型,PSNR提升了1.41 dB,SSIM提升了0.0248。
Insight: 创新点包括将VQ-VAE-2扩展到时空设置,引入基于3D残差卷积的两级分层潜在结构,并结合预训练VGG16网络的感知损失来提升重建质量。模型轻量化(约1850万参数)且针对64x64分辨率视频优化,适合在计算和内存资源受限的边缘设备上部署,适用于实时流媒体、移动视频分析和CDN级存储优化等带宽敏感场景。
Abstract: The exponential growth of video traffic has placed increasing demands on bandwidth and storage infrastructure, particularly for content delivery networks (CDNs) and edge devices. While traditional video codecs like H.264 and HEVC achieve high compression ratios, they are designed primarily for pixel-domain reconstruction and lack native support for machine learning-centric latent representations, limiting their integration into deep learning pipelines. In this work, we present a Multi-Scale Vector Quantized Variational Autoencoder (MS-VQ-VAE) designed to generate compact, high-fidelity latent representations of low-resolution video, suitable for efficient storage, transmission, and client-side decoding. Our architecture extends the VQ-VAE-2 framework to a spatiotemporal setting, introducing a two-level hierarchical latent structure built with 3D residual convolutions. The model is lightweight (approximately 18.5M parameters) and optimized for 64x64 resolution video clips, making it appropriate for deployment on edge devices with constrained compute and memory resources. To improve perceptual reconstruction quality, we incorporate a perceptual loss derived from a pre-trained VGG16 network. Trained on the UCF101 dataset using 2-second video clips (32 frames at 16 FPS), on the test set we achieve 25.96 dB PSNR and 0.8375 SSIM. On validation, our model improves over the single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. The proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios, including real-time streaming, mobile video analytics, and CDN-level storage optimization.
[69] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation cs.CVPDF
Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun
TL;DR: 本文提出了一种名为PhyGDPO的物理感知分组直接偏好优化框架,用于提升文本到视频生成的物理一致性。该方法首先通过一个物理增强的视频数据构建管道PhyAugPipe,利用视觉语言模型和思维链推理收集了一个大规模训练数据集PhyVidGen-135K。然后,基于分组Plackett-Luce概率模型,设计了物理引导奖励方案和LoRA切换参考方案,以高效地优化模型,使其生成的视频更符合物理规律。
Details
Motivation: 当前文本到视频生成方法在视觉质量上取得了进展,但生成的视频往往难以忠实遵循物理定律。现有方法(主要基于图形学或提示扩展)难以泛化到简单模拟环境之外,且缺乏学习隐式物理推理的能力,同时缺乏包含丰富物理交互和现象的训练数据。
Result: 实验表明,该方法在PhyGenBench和VideoPhy2基准测试上显著优于最先进的开源方法。
Insight: 主要创新点包括:1) 利用视觉语言模型和思维链推理自动构建大规模物理增强视频数据集;2) 基于分组Plackett-Luce模型构建物理感知的直接偏好优化框架,超越了成对比较;3) 设计了物理引导奖励方案,将基于VLM的物理奖励嵌入优化过程;4) 提出了LoRA切换参考方案,消除了内存密集型参考复制,实现了高效训练。
Abstract: Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO
[70] RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios cs.CVPDF
Tianyi Zhao, Jiawen Xi, Linhui Xiao, Junnan Li, Xue Yang
TL;DR: 本文提出了首个面向复杂真实场景的大规模视觉定位基准RGBT-Ground,包含空间对齐的RGB与热红外图像对、高质量指代表达、物体边界框及细粒度标注,并构建了支持单模态与多模态输入的统一视觉定位框架,提出了简单有效的基线模型RGBT-VGNet,在夜间和远距离场景中显著优于现有方法。
Details
Motivation: 现有视觉定位基准多基于COCO等清洁环境数据集,场景多样性有限,无法反映真实世界中光照、天气等复杂条件,难以评估模型在安全关键应用中的鲁棒性和泛化能力。
Result: 在RGBT-Ground基准上,提出的RGBT-VGNet显著优于现有适应方法,尤其在夜间和长距离场景中表现突出。
Insight: 创新点包括构建首个RGB-热红外对齐的复杂真实场景视觉定位基准,以及提出融合互补视觉模态的简单有效基线模型,为鲁棒视觉定位研究提供了新方向和数据支持。
Abstract: Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.
[71] Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning cs.CVPDF
Fuyu Dong, Ke Li, Di Wang, Nan Luo, Yiming Zhang
TL;DR: 本文提出了一种名为DARFT的决策模糊性引导强化微调框架,旨在解决变化检测视觉问答(CDVQA)任务中模型因决策模糊性导致的失败问题。该方法通过挖掘决策模糊样本,并利用组内相对优势进行优化,以提升模型在少样本设置下的性能。
Details
Motivation: 现有CDVQA方法通过监督微调提升性能,但许多失败案例源于模型在正确答案与强干扰项之间置信度相近的决策模糊性,这限制了模型的判别性和鲁棒性。
Result: 大量实验表明,DARFT在少样本设置下相比监督微调基线取得了一致的性能提升,特别是在决策模糊样本上优化了决策边界。
Insight: 创新点在于形式化定义了决策模糊样本(DAS),并提出了基于强化学习的组内相对策略优化方法,无需额外监督即可抑制强干扰项,增强模型判别能力。
Abstract: Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.
[72] SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks cs.CVPDF
Wei Zhang, Chaoqun Wang, Zixuan Guan, Sam Kao, Pengfei Zhao
TL;DR: 本文提出了SliceLens,一个用于多实例视觉任务(如检测、分割、姿态估计)的细粒度、可解释的错误切片发现框架。该框架利用大语言模型(LLM)和视觉语言模型(VLM)进行基于假设的视觉推理,以可靠地识别细粒度的错误切片。同时,作者构建了首个针对细粒度错误切片发现的基准FeSD,并在其上验证了SliceLens的优越性能。
Details
Motivation: 现有错误切片发现方法主要针对图像分类任务,难以适用于多实例视觉任务,且缺乏细粒度推理能力。同时,现有基准存在偏向性,无法反映真实模型失败情况。
Result: 在新建的FeSD基准上,SliceLens达到了SOTA性能,其Precision@10指标为0.73,相比基线方法(0.31)提升了0.42。在现有基准上的广泛实验也验证了其有效性。模型修复实验进一步证实了所发现切片对模型改进的指导作用。
Insight: 创新点在于:1) 提出了一个结合LLM和VLM的假设驱动框架,通过基于视觉的推理生成和验证多样化失败假设,实现了对多实例任务的细粒度、可解释错误切片发现;2) 构建了首个专门针对细粒度错误切片发现的专家标注基准FeSD,为领域评估提供了更真实、更精确的基础。从客观角度看,将大模型(LLM/VLM)的推理能力系统性地应用于模型错误诊断是一个有前景的方向,而构建高质量、细粒度的基准是推动该领域发展的关键。
Abstract: Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.
[73] Collaborative Low-Rank Adaptation for Pre-Trained Vision Transformers cs.CVPDF
Zheng Liu, Jinchao Zhu, Gao Huang
TL;DR: 本文提出了一种名为协作低秩适应(CLoRA)的新方法,用于微调预训练的视觉变换器。CLoRA通过基础空间共享和样本无关多样性增强(SADE)组件,在保持参数效率的同时提升学习能力,旨在平衡微调性能与参数效率。
Details
Motivation: 现有低秩适应方法在追求参数效率时往往牺牲性能,或引入过多可训练参数,未能平衡学习性能与参数效率,因此需要一种新方法来改善这一平衡。
Result: 在广泛使用的图像和点云数据集上的实验表明,CLoRA在平衡学习性能与参数效率方面优于现有方法,且在点云分析中需要最少的GFLOPs,达到了SOTA水平。
Insight: 创新点包括基础空间共享机制以扩展低秩模块的学习容量而不增加参数,以及SADE组件通过正则化相似性来鼓励多样表示,从而提升微调效果和效率。
Abstract: Low-rank adaptation (LoRA) has achieved remarkable success in fine-tuning pre-trained vision transformers for various downstream tasks. Existing studies mainly focus on exploring more parameter-efficient strategies or more effective representation learning schemes. However, these methods either sacrifice fine-tuning performance or introduce excessive trainable parameters, failing to strike a balance between learning performance and parameter efficiency. To address this problem, we propose a novel tuning method named collaborative low-rank adaptation (CLoRA) in this paper. CLoRA consists of base-space sharing and sample-agnostic diversity enhancement (SADE) components. To maintain parameter efficiency while expanding the learning capacity of low-rank modules (LRMs), base-space sharing allows all LRMs to share a set of down/up-projection spaces. In CLoRA, the low-rank matrices obtained from the shared spaces collaboratively construct each LRM. Since the representations extracted by these matrices may contain redundant information, SADE is employed to regularize the similarities among them to encourage diverse representations in the training process. We conduct extensive experiments on widely used image and point cloud datasets to evaluate the performance of CLoRA. Experimental results demonstrate that CLoRA strikes a better balance between learning performance and parameter efficiency, while requiring the fewest GFLOPs for point cloud analysis, compared with the state-of-the-art methods.
[74] MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding cs.CVPDF
Panquan Yang, Junfei Huang, Zongzhangbao Yin, Yingsong Hu, Anni Xu
TL;DR: 本文提出了一个面向路边监控场景的3D视觉定位新任务,并构建了首个真实世界大规模多模态数据集MoniRefer,包含约13.6万个物体和41.1万条自然语言描述。同时,作者提出了一种端到端方法Moni3DVG,利用图像的外观信息和点云的几何与光学信息进行多模态特征学习与3D物体定位。
Details
Motivation: 现有3D视觉定位的数据集和方法主要关注室内和车载驾驶场景,而由于缺乏路边基础设施传感器采集的配对点云-文本数据,户外监控场景尚未被探索。本文旨在解决这一空白,使基础设施能够理解交通场景。
Result: 在提出的基准测试上进行了大量实验和消融研究,证明了所提方法的优越性和有效性,但摘要未具体说明定量结果或是否达到SOTA水平。
Insight: 创新点包括:1) 首次定义了面向路边监控场景的3D视觉定位任务;2) 构建了首个真实世界大规模多模态数据集MoniRefer;3) 提出了Moni3DVG方法,融合图像外观与点云几何/光学信息进行多模态学习。
Abstract: 3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The dataset consists of about 136,018 objects with 411,128 natural language expressions collected from multiple complex traffic intersections in the real-world environments. To ensure the quality and accuracy of the dataset, we manually verified all linguistic descriptions and 3D labels for objects. Additionally, we also propose a new end-to-end method, named Moni3DVG, which utilizes the rich appearance information provided by images and geometry and optical information from point cloud for multi-modal feature learning and 3D object localization. Extensive experiments and ablation studies on the proposed benchmarks demonstrate the superiority and effectiveness of our method. Our dataset and code will be released.
[75] LLHA-Net: A Hierarchical Attention Network for Two-View Correspondence Learning cs.CVPDF
Shuyuan Lin, Yu Guo, Xiao Chen, Yanjie Liang, Guobao Xiao
TL;DR: 本文提出了一种名为LLHA-Net的逐层分层注意力网络,用于解决计算机视觉中特征点匹配时存在大量异常值(离群点)的问题。该方法通过阶段融合、分层提取和注意力机制,增强网络对特征点丰富语义信息的表征能力,从而提高了匹配的精度和鲁棒性。
Details
Motivation: 在计算机视觉中,特征点匹配是一个基础任务,但大量异常值的存在会严重影响匹配结果的准确性和鲁棒性。核心挑战在于如何在异常值比例很高的情况下,确保提取高质量信息并减少负样本带来的误差。
Result: 在YFCC100M和SUN3D两个公开数据集上的实验结果表明,该方法在异常值剔除和相机姿态估计任务上均优于多种最先进(SOTA)技术。
Insight: 创新点包括:1) 引入逐层通道融合模块,保留并融合各阶段特征语义信息;2) 设计分层注意力模块,自适应地捕获和融合全局感知与结构语义信息;3) 提出两种架构来提取和整合特征,提升网络适应性。从客观角度看,其分层注意力机制与阶段融合策略的结合,为处理高异常值比例的匹配问题提供了有效的端到端学习框架。
Abstract: Establishing the correct correspondence of feature points is a fundamental task in computer vision. However, the presence of numerous outliers among the feature points can significantly affect the matching results, reducing the accuracy and robustness of the process. Furthermore, a challenge arises when dealing with a large proportion of outliers: how to ensure the extraction of high-quality information while reducing errors caused by negative samples. To address these issues, in this paper, we propose a novel method called Layer-by-Layer Hierarchical Attention Network, which enhances the precision of feature point matching in computer vision by addressing the issue of outliers. Our method incorporates stage fusion, hierarchical extraction, and an attention mechanism to improve the network’s representation capability by emphasizing the rich semantic information of feature points. Specifically, we introduce a layer-by-layer channel fusion module, which preserves the feature semantic information from each stage and achieves overall fusion, thereby enhancing the representation capability of the feature points. Additionally, we design a hierarchical attention module that adaptively captures and fuses global perception and structural semantic information using an attention mechanism. Finally, we propose two architectures to extract and integrate features, thereby improving the adaptability of our network. We conduct experiments on two public datasets, namely YFCC100M and SUN3D, and the results demonstrate that our proposed method outperforms several state-of-the-art techniques in both outlier removal and camera pose estimation. Source code is available at http://www.linshuyuan.com.
[76] Renormalization Group Guided Tensor Network Structure Search cs.CV | cs.AIPDF
Maolin Wang, Bowen Yu, Sheng Zhang, Linjie Mi, Wanyu Wang
TL;DR: 本文提出了一种名为RGTN的物理启发式张量网络结构搜索框架,通过多尺度重整化群流来解决传统方法在计算可处理性、结构自适应性和优化鲁棒性方面的局限性。该方法利用动态尺度变换实现连续结构演化,并通过可学习的边门和基于物理量的智能提议来优化拓扑结构,从而在光场数据、高阶合成张量和视频补全任务中实现了最先进的压缩比和显著加速。
Details
Motivation: 现有张量网络结构搜索方法面临计算难处理、结构适应性差和优化鲁棒性不足的问题,特别是在多尺度结构缺失、离散搜索空间阻碍平滑演化以及结构与参数优化分离导致效率低下等方面存在挑战。
Result: 在光场数据、高阶合成张量和视频补全任务上的实验表明,RGTN达到了最先进的压缩比,并且运行速度比现有方法快4到600倍。
Insight: 创新点包括引入多尺度重整化群流实现连续结构演化,使用可学习的边门进行拓扑修改,以及基于节点张力和边信息流等物理量的智能提议,这些物理启发的设计有助于逃离局部最优并找到紧凑结构。
Abstract: Tensor network structure search (TN-SS) aims to automatically discover optimal network topologies and rank configurations for efficient tensor decomposition in high-dimensional data representation. Despite recent advances, existing TN-SS methods face significant limitations in computational tractability, structure adaptivity, and optimization robustness across diverse tensor characteristics. They struggle with three key challenges: single-scale optimization missing multi-scale structures, discrete search spaces hindering smooth structure evolution, and separated structure-parameter optimization causing computational inefficiency. We propose RGTN (Renormalization Group guided Tensor Network search), a physics-inspired framework transforming TN-SS via multi-scale renormalization group flows. Unlike fixed-scale discrete search methods, RGTN uses dynamic scale-transformation for continuous structure evolution across resolutions. Its core innovation includes learnable edge gates for optimization-stage topology modification and intelligent proposals based on physical quantities like node tension measuring local stress and edge information flow quantifying connectivity importance. Starting from low-complexity coarse scales and refining to finer ones, RGTN finds compact structures while escaping local minima via scale-induced perturbations. Extensive experiments on light field data, high-order synthetic tensors, and video completion tasks show RGTN achieves state-of-the-art compression ratios and runs 4-600$\times$ faster than existing methods, validating the effectiveness of our physics-inspired approach.
[77] Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting cs.CV | cs.AIPDF
Kai Ye, Xiaotong You, Jianghang Lin, Jiayi Ji, Pingyang Dai
TL;DR: 本文提出了一种名为EVOL-SAM3的零样本推理分割新框架,它将推理分割任务重新定义为推理时的进化搜索过程,通过一个‘生成-评估-进化’循环来迭代优化提示假设,从而避免了传统监督微调或强化学习方法的问题,并在零样本设置下显著超越了现有方法。
Details
Motivation: 当前主流的推理分割方法严重依赖监督微调(SFT)或强化学习(RL),但SFT存在灾难性遗忘和领域依赖问题,RL则受限于训练不稳定性和对预定义奖励函数的僵化依赖。尽管最近的无训练方法避免了训练负担,但其静态推理范式存在推理深度不足、缺乏自我纠正语言幻觉或空间误解能力等根本性限制。
Result: 在具有挑战性的ReasonSeg基准测试上,广泛的实验表明,EVOL-SAM3不仅大幅超越了静态基线方法,而且在零样本设置下显著超过了完全监督的最先进(SOTA)方法。
Insight: 论文的核心创新在于将推理分割任务重新构建为一个推理时的进化搜索过程,引入了‘生成-评估-进化’循环、通过无参考成对锦标赛评估提示适应度的视觉竞技场、用于注入多样性和纠正语义错误的语义突变算子,以及整合几何先验与语义推理的异构竞技场模块,实现了动态、迭代且自我纠正的推理能力。
Abstract: Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass “generate-then-segment” chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a “Generate-Evaluate-Evolve” loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at https://github.com/AHideoKuzeA/Evol-SAM3.
[78] CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement cs.CV | cs.CLPDF
Wentao Zhang, Tao Fang, Lina Lu, Lifei Wang, Weihe Zhong
TL;DR: 本文提出了一种无需训练的小样本框架CPJ(Caption-Prompt-Judge),用于可解释的农业病虫害诊断。该框架利用大型视觉语言模型生成多角度图像描述,并通过LLM-as-Judge模块迭代优化,进而驱动一个双答案VQA过程,同时输出识别和管理建议。
Details
Motivation: 现有农业病虫害诊断方法通常依赖成本高昂的监督微调,且在领域偏移下性能不佳,缺乏可解释性。本文旨在开发一种无需微调、鲁棒且可解释的诊断框架。
Result: 在CDDMBench基准测试中,CPJ显著提升了性能:使用GPT-5-mini生成的描述,GPT-5-Nano在病害分类任务上比无描述基线提高了22.7个百分点,在问答得分上提高了19.5分。
Insight: 创新点在于提出了一个训练免费、基于迭代优化描述(LLM-as-Judge)的结构化框架,将图像描述生成与VQA过程解耦并增强,实现了透明、基于证据的推理,提升了诊断的鲁棒性和可解释性。
Abstract: Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption–Prompt–Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.
[79] FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation cs.CVPDF
Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh
TL;DR: 本文提出了一种名为FlowBlending的阶段感知多模型采样策略,用于加速视频生成。研究发现模型容量在不同时间步的影响不同:在早期和晚期阶段至关重要,而在中间阶段基本可以忽略。因此,FlowBlending在容量敏感阶段使用大模型,在中间阶段使用小模型,从而在保持大模型视觉保真度、时间一致性和语义对齐的同时,显著提升推理速度并减少计算量。
Details
Motivation: 解决视频生成模型推理速度慢、计算成本高的问题。动机在于发现模型容量需求在扩散模型采样过程的不同阶段存在差异,这为混合使用不同容量模型以优化效率提供了机会。
Result: 在LTX-Video (2B/13B) 和 WAN 2.1 (1.3B/14B) 基准测试上,FlowBlending实现了高达1.65倍的推理加速,FLOPs减少57.35%,同时保持了与大模型相当的视觉质量、时间连贯性和语义对齐。该方法还能与现有采样加速技术兼容,实现额外高达2倍的加速。
Insight: 核心创新点是揭示了扩散模型采样过程中模型容量需求的阶段性差异,并据此提出了阶段感知的混合模型采样策略。其提出的速度散度分析作为识别容量敏感区域的有效代理方法,以及简单的阶段边界选择准则,都是具有借鉴价值的工程洞察。该方法是一种高效的模型级优化,可与现有技术叠加使用。
Abstract: In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: https://jibin86.github.io/flowblending_project_page.
[80] EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation cs.CVPDF
Bingxuan Li, Yiming Cui, Yicheng He, Yiwei Wang, Shu Zhang
TL;DR: 本文提出EchoFoley任务,旨在解决视频配乐生成中视觉主导、细粒度控制不足和指令理解弱的问题。通过引入事件级和层次化语义控制,构建了包含6000多个视频-指令-标注三元组的大规模基准数据集EchoFoley-6k,并提出了基于慢-快思维策略的事件中心化生成框架EchoVidia。实验表明,该框架在可控性和感知质量上显著优于现有VT2A模型。
Details
Motivation: 当前视频文本到音频(VT2A)方法存在视觉与文本条件不平衡导致视觉主导、缺乏细粒度可控生成定义以及指令理解弱(因数据集依赖简短类别标签)三大局限,需开发更精细可控的视频配乐生成方法。
Result: 在构建的EchoFoley-6k基准上,提出的EchoVidia框架在可控性上超越近期VT2A模型40.7%,在感知质量上提升12.5%,实现了SOTA性能。
Insight: 创新点包括:1)提出事件符号化表示,明确声音在视频或指令中的时间、内容和方式,支持生成、插入和编辑等细粒度控制;2)引入慢-快思维策略的事件中心化智能体生成框架,增强层次化语义控制能力;3)构建大规模专家标注数据集,为可控声音生成提供高质量训练与评估基础。
Abstract: Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.
[81] Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control cs.CV | cs.AIPDF
Jason Armitage, Rico Sennnrich
TL;DR: 本文提出了一种新方法,通过无导数优化的遗憾最小化来改进多元互信息估计,使现成的基于2D视觉输入训练的跨模态系统能够在多对象3D场景中在线适应对象遮挡和区分特征,从而提升跨模态任务性能,无需预训练或微调。
Details
Motivation: 解决跨模态系统在处理3D场景时面临的维度转换问题,特别是通过场景内相机桥接维度差距但需要学习控制模块的挑战。
Result: 该方法在多对象3D场景的跨模态任务中提高了性能,避免了预训练或微调的需求。
Insight: 创新点在于结合表达性度量和基于值的优化来控制场景内相机,直接从视觉语言模型的噪声输出中学习,利用无导数优化改进互信息估计以实现在线适应。
Abstract: Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.
[82] VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents cs.CV | cs.ROPDF
Xunyi Zhao, Gengze Zhou, Qi Wu
TL;DR: 本文提出了一个名为VLN-MME的统一可扩展评估框架,用于将多模态大语言模型(MLLMs)作为零样本智能体在视觉语言导航(VLN)任务中进行系统性评测。该框架将传统导航数据集标准化,并观察到增强基线智能体(如使用思维链和自我反思)反而导致性能下降,揭示了MLLMs在具身导航任务中三维空间推理能力不足的问题。
Details
Motivation: 尽管MLLMs在广泛的视觉语言任务中表现出色,但其作为需要多轮对话、空间推理和顺序动作预测的具身智能体的性能仍需深入探索。本文旨在通过一个标准化的基准来诊断MLLMs在语言引导的视觉导航任务中的能力。
Result: 在VLN-MME基准上的实验表明,增强基线智能体(例如加入思维链推理和自我反思)会导致意外的性能下降,这揭示了MLLMs在具身导航任务中上下文感知能力差、三维空间推理保真度低的问题。
Insight: 论文的主要创新点是提出了一个高度模块化、可访问的统一评估框架VLN-MME,用于系统性地评测和比较不同MLLM架构、智能体设计和导航任务。一个关键的客观发现是,MLLMs在遵循指令和结构化输出方面表现良好,但其在具身导航中的三维空间推理能力存在根本性局限,这为MLLMs作为具身智能体的后训练提供了重要指导。
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.
[83] OFL-SAM2: Prompt SAM2 with Online Few-shot Learner for Efficient Medical Image Segmentation cs.CVPDF
Meng Lan, Lefei Zhang, Xiaomeng Li
TL;DR: 本文提出了OFL-SAM2,一个无需手动提示的SAM2框架,用于高效标注的医学图像分割。其核心是利用少量标注样本在线训练一个轻量级映射网络,以捕获医学知识并将通用图像特征转换为目标特征,从而为每帧图像提供额外的判别性目标表示,并消除了对人工提示的需求。
Details
Motivation: 将SAM2适配到医学图像分割任务面临挑战,包括需要大量标注数据进行微调以及高质量的人工提示,这两者都耗费人力且需要医学专家介入。
Result: 在三个不同的医学图像分割数据集上进行的大量实验表明,OFL-SAM2在有限训练数据下实现了最先进的性能。
Insight: 创新点在于引入了一个支持在线参数更新的在线小样本学习器来生成目标特征,以及一个自适应融合模块来动态整合目标特征与冻结SAM2生成的内存注意力特征,从而获得准确鲁棒的目标表示,实现了无需人工提示的高效医学图像分割。
Abstract: The Segment Anything Model 2 (SAM2) has demonstrated remarkable promptable visual segmentation capabilities in video data, showing potential for extension to medical image segmentation (MIS) tasks involving 3D volumes and temporally correlated 2D image sequences. However, adapting SAM2 to MIS presents several challenges, including the need for extensive annotated medical data for fine-tuning and high-quality manual prompts, which are both labor-intensive and require intervention from medical experts. To address these challenges, we introduce OFL-SAM2, a prompt-free SAM2 framework for label-efficient MIS. Our core idea is to leverage limited annotated samples to train a lightweight mapping network that captures medical knowledge and transforms generic image features into target features, thereby providing additional discriminative target representations for each frame and eliminating the need for manual prompts. Crucially, the mapping network supports online parameter update during inference, enhancing the model’s generalization across test sequences. Technically, we introduce two key components: (1) an online few-shot learner that trains the mapping network to generate target features using limited data, and (2) an adaptive fusion module that dynamically integrates the target features with the memory-attention features generated by frozen SAM2, leading to accurate and robust target representation. Extensive experiments on three diverse MIS datasets demonstrate that OFL-SAM2 achieves state-of-the-art performance with limited training data.
[84] FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation cs.CV | cs.CEPDF
Zichen Tang, Haihong E, Rongjin Li, Jiacheng Liu, Linwei Jia
TL;DR: 本文提出了FinMMDocR,一个用于评估多模态大语言模型在真实世界金融数值推理任务上的双语多模态基准。相比现有基准,该基准在三个方面取得显著进展:场景感知、文档理解和多步计算。
Details
Motivation: 现有基准在评估MLLMs处理复杂、真实的金融多模态推理任务方面存在不足,特别是在场景感知、深度文档理解和多步计算方面。
Result: 在FinMMDocR基准上,表现最佳的MLLM仅达到58.0%的准确率,不同的检索增强生成方法在该任务上表现出显著的性能差异。
Insight: 创新点在于构建了一个集成了场景感知、深度文档理解和多步计算的综合性金融多模态推理基准,其问题设计(如隐含场景、长文档、跨页证据和多步推理)更贴近真实金融分析任务,能有效暴露现有MLLMs和RAG方法的局限性。
Abstract: We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.
[85] VIPER: Process-aware Evaluation for Generative Video Reasoning cs.CVPDF
Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du
TL;DR: 论文提出了VIPER,一个面向生成式视频推理(GVR)的过程感知评估基准和范式,旨在解决现有单帧评估方法导致的‘结果欺骗’问题。VIPER包含16个跨时间、结构、符号、空间、物理和规划推理的任务,并提出了利用VLM-as-Judge和分层评分标准的过程-结果一致性(POC@r)新指标。实验表明,当前最先进的视频模型POC@1.0得分仅约20%,存在显著的结果欺骗现象,揭示了当前视频生成与真正广义视觉推理之间的巨大差距。
Details
Motivation: 现有视频生成模型展现出链式帧推理能力,但评估框架多依赖单帧评估,容易导致模型通过错误过程得出正确结论的‘结果欺骗’问题,因此需要一种过程感知的评估范式来更准确地衡量生成式视频推理的质量。
Result: 在最先进的视频模型上进行实验,结果显示其过程-结果一致性指标POC@1.0得分仅为约20%,表明存在严重的‘结果欺骗’现象。测试时缩放和采样鲁棒性分析进一步凸显了当前模型与真正广义视觉推理之间的巨大性能鸿沟。
Insight: 创新点在于提出了首个专注于生成式视频推理过程评估的综合性基准(VIPER)和量化指标(POC@r),通过VLM-as-Judge结合分层评分标准,系统评估中间步骤的有效性和最终结果,为未来模型开发提供了更严谨的评估框架,并揭示了过程一致性是当前视频生成模型的关键短板。
Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.
[86] Evaluating the Impact of Compression Techniques on the Robustness of CNNs under Natural Corruptions cs.CV | cs.AIPDF
Itallo Patrick Castro Alves Da Silva, Emanuel Adler Medeiros Pereira, Erick de Andrade Barboza, Baldoino Fonseca dos Santos Neto, Marcio de Medeiros Ribeiro
TL;DR: 本文全面评估了量化、剪枝和权重聚类等压缩技术(单独或组合使用)对卷积神经网络(ResNet-50、VGG-19和MobileNetV2)在自然损坏条件下鲁棒性的影响。研究使用CIFAR-10-C和CIFAR-100-C数据集,分析了鲁棒性、准确性和压缩率之间的权衡。结果表明,某些压缩策略不仅能保持甚至能提高鲁棒性,尤其是在更复杂的网络架构上。通过多目标评估,确定了最佳配置,表明定制化的技术组合能产生有益的多目标结果。
Details
Motivation: 压缩的深度学习模型对于在资源受限设备上部署计算机视觉系统至关重要,但模型压缩可能影响鲁棒性,尤其是在自然损坏条件下。因此,在验证计算机视觉系统时,考虑鲁棒性评估非常重要。
Result: 在CIFAR-10-C和CIFAR-100-C基准数据集上的实验表明,某些压缩策略(如定制化组合)不仅能保持鲁棒性,还能提高鲁棒性,特别是在复杂架构的网络上,同时实现了压缩率与准确性的多目标优化。
Insight: 创新点在于对多种压缩技术(量化、剪枝、权重聚类)及其组合在自然损坏鲁棒性方面的系统性评估,并引入多目标评估方法来确定最佳配置。客观分析认为,该研究揭示了压缩技术与鲁棒性之间的非直观关系(如某些压缩可增强鲁棒性),为在损坏真实环境中鲁棒且高效的模型部署提供了选择压缩方法的见解。
Abstract: Compressed deep learning models are crucial for deploying computer vision systems on resource-constrained devices. However, model compression may affect robustness, especially under natural corruption. Therefore, it is important to consider robustness evaluation while validating computer vision systems. This paper presents a comprehensive evaluation of compression techniques - quantization, pruning, and weight clustering applied individually and in combination to convolutional neural networks (ResNet-50, VGG-19, and MobileNetV2). Using the CIFAR-10-C and CIFAR 100-C datasets, we analyze the trade-offs between robustness, accuracy, and compression ratio. Our results show that certain compression strategies not only preserve but can also improve robustness, particularly on networks with more complex architectures. Utilizing multiobjective assessment, we determine the best configurations, showing that customized technique combinations produce beneficial multi-objective results. This study provides insights into selecting compression methods for robust and efficient deployment of models in corrupted real-world environments.
[87] DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments cs.CV | cs.AI | cs.LG | cs.ROPDF
Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh
TL;DR: 本文提出了DarkEQA,一个用于在多种低光照条件下评估具身问答(EQA)相关感知能力的开源基准。该基准通过模拟物理真实的低光照退化(在RAW空间建模光照衰减和传感器噪声),隔离了感知瓶颈,以评估视觉语言模型(VLMs)和低光图像增强(LLIE)模型在黑暗室内环境下的性能。
Details
Motivation: 现有基准主要在理想光照条件下评估VLMs,而忽略了其在黑暗环境(如夜间)下的鲁棒性,这对于实现全天候运行的具身智能体至关重要。
Result: 通过在DarkEQA基准上广泛评估SOTA的VLMs和LLIE模型,系统性地揭示了这些模型在具有挑战性的低光照视觉条件下的局限性。
Insight: 主要创新点在于构建了一个物理真实的低光照EQA基准,通过在RAW空间模拟基于物理的退化来保证保真度,从而能够对模型的鲁棒性进行可归因的分析。这为评估和提升模型在真实世界退化条件下的感知能力提供了新工具和视角。
Abstract: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments–a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs’ limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.
[88] FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM cs.CVPDF
Yuchen Wu, Jiahe Li, Fabio Tosi, Matteo Poggi, Jin Zheng
TL;DR: FoundationSLAM是一个基于学习的单目稠密SLAM系统,它通过利用深度基础模型的指导,将光流估计与几何推理相结合,解决了以往基于光流的方法缺乏几何一致性的问题,从而实现了准确、鲁棒的跟踪与建图。
Details
Motivation: 解决先前基于光流的SLAM方法中几何一致性的缺失问题,以实现更准确和鲁棒的密集视觉SLAM。
Result: 在多个具有挑战性的数据集上,FoundationSLAM在轨迹精度和稠密重建质量方面均取得了优越的性能,同时能以18 FPS的速度实时运行,展现出强大的泛化能力和实际应用潜力。
Insight: 核心创新在于利用深度基础模型指导几何推理,具体包括:1)设计混合光流网络以生成几何感知的对应关系;2)提出双向一致束调整层进行联合优化;3)引入可靠性感知细化机制形成闭环反馈。这为结合基础模型与SLAM提供了新思路。
Abstract: We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.
[89] From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing cs.CVPDF
Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen
TL;DR: 本文提出了一种自引导框架,将视觉配音任务从基于掩码的图像修复范式重构为条件良好的视频到视频编辑问题。该方法首先使用扩散Transformer生成理想的训练数据(即唇部动作改变但其他视觉条件完全一致的视频对),然后在这些对齐的视频对上端到端训练一个基于DiT的音频驱动编辑器,从而实现精确的唇部同步、身份保持和鲁棒性。
Details
Motivation: 现有音频驱动视觉配音方法因缺乏理想的训练数据(仅唇部动作不同而其他视觉条件完全一致的配对视频),通常采用基于掩码的修复范式,导致模型需同时生成缺失内容并同步唇部,从而产生视觉伪影、身份漂移和同步性差的问题。
Result: 该方法在提出的ContextDubBench基准数据集上进行了评估,结果表明其能够实现高度准确的唇部同步、忠实的身份保持,并在具有挑战性的真实场景中表现出卓越的鲁棒性。
Insight: 核心创新点在于将视觉配音重构为条件良好的视频编辑问题,并通过自引导框架生成理想的对齐训练数据,使编辑器能专注于精确的音频驱动唇部修改。此外,提出的时间步自适应多阶段学习策略有效解耦了扩散过程中不同时间步的编辑目标冲突,促进了稳定训练并提升了同步性和视觉保真度。
Abstract: Audio-driven visual dubbing aims to synchronize a video’s lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject’s lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.
[90] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images cs.CVPDF
Jiageng Liu, Weijie Lyu, Xueting Li, Yejie Guo, Ming-Hsuan Yang
TL;DR: Edit3r是一个前馈式框架,能够直接从稀疏、未配准且视角不一致的指令编辑图像中,单次前向传播完成3D场景的重建与编辑。它无需进行逐场景优化或姿态估计,即可快速生成与指令对齐、具有照片级真实感的3D编辑结果。
Details
Motivation: 解决现有3D场景编辑方法依赖逐场景优化、计算成本高、速度慢的问题,旨在实现无需姿态估计和优化的快速、逼真3D编辑。
Result: 在提出的新基准DL3DV-Edit-Bench(包含20个场景、4种编辑类型、总计100个编辑)上,Edit3r在语义对齐和3D一致性方面优于近期基线方法,且推理速度显著更快。
Insight: 核心创新在于:1) 提出基于SAM2的重新着色策略,生成可靠、跨视角一致的监督信号以解决训练数据缺失问题;2) 采用非对称输入策略(将重新着色的参考视图与原始辅助视图配对),引导网络融合和对齐不一致的观测。模型在推理时能有效泛化处理训练中未见的2D方法(如InstructPix2Pix)编辑的图像。
Abstract: We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.
[91] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time cs.CV | cs.AI | cs.ROPDF
Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y. Wang
TL;DR: SpaceTimePilot是一种视频扩散模型,通过解耦空间和时间实现可控生成渲染。给定单目视频,它能独立改变生成过程中的摄像机视点和运动序列,实现跨时空的连续任意探索。
Details
Motivation: 解决现有方法难以在视频生成中独立、精确控制空间(摄像机视角)和时间(运动序列)的问题,以实现动态场景的灵活重渲染。
Result: 在真实世界和合成数据上评估,相比先前工作,展示了清晰的时空解耦和强大的生成效果。
Insight: 创新点包括:引入动画时间嵌入机制实现显式运动序列控制;提出时间扭曲训练方案,利用现有多视图数据集模拟时间差异进行监督;改进摄像机条件机制;创建首个合成时空全覆盖渲染数据集CamxTime以提升双控制精度。
Abstract: We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video’s motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot
cs.RO [Back]
[92] Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation cs.RO | cs.CVPDF
Guo Ye, Zexi Zhang, Xu Zhao, Shang Wu, Haoran Lu
TL;DR: 本文提出了DreamTacVLA框架,旨在解决现有视觉-语言-动作模型在接触丰富的操作任务中对物理接触感知不足的问题。该框架通过整合高分辨率触觉图像、手腕相机局部视觉和第三人称宏观视觉,并利用分层空间对齐损失和触觉世界模型进行训练,使模型能够预测未来触觉信号,从而更好地理解接触物理。
Details
Motivation: 现有VLA模型虽然能将网络知识映射到机器人控制,但缺乏对物理接触的感知,因此在需要推理力、纹理和滑动的接触丰富操作任务中表现不佳。
Result: 在接触丰富的操作任务中,DreamTacVLA超越了最先进的VLA基线模型,成功率高达95%,展现了其优越性能。
Insight: 创新点包括:采用分层感知方案整合多尺度感官输入;提出分层空间对齐损失来统一策略;利用触觉世界模型预测未来触觉以深化对接触动力学的理解;构建混合大规模数据集以解决数据稀缺和传感器易损问题。这些方法增强了模型对接触物理的建模能力,实现了更鲁棒的触觉感知机器人控制。
Abstract: Vision-Language-Action (VLA) models have shown remarkable generalization by mapping web-scale knowledge to robotic control, yet they remain blind to physical contact. Consequently, they struggle with contact-rich manipulation tasks that require reasoning about force, texture, and slip. While some approaches incorporate low-dimensional tactile signals, they fail to capture the high-resolution dynamics essential for such interactions. To address this limitation, we introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future. Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs coupled with wrist-camera local vision and third-person macro vision. To reconcile these multi-scale sensory streams, we first train a unified policy with a Hierarchical Spatial Alignment (HSA) loss that aligns tactile tokens with their spatial counterparts in the wrist and third-person views. To further deepen the model’s understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals. To mitigate tactile data scarcity and the wear-prone nature of tactile sensors, we construct a hybrid large-scale dataset sourced from both high-fidelity digital twin and real-world experiments. By anticipating upcoming tactile states, DreamTacVLA acquires a rich model of contact physics and conditions its actions on both real observations and imagined consequences. Across contact-rich manipulation tasks, it outperforms state-of-the-art VLA baselines, achieving up to 95% success, highlighting the importance of understanding physical contact for robust, touch-aware robotic agents.
[93] RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Contextual Adaptation cs.RO | cs.CVPDF
Ming-Ming Yu, Yi Chen, Börje F. Karlsson, Wenjun Wu
TL;DR: 本文提出了RANGER,一个仅使用单目相机的零样本开放词汇语义导航框架。它通过利用强大的3D基础模型,消除了对深度和姿态信息的依赖,并展现出强大的上下文学习能力,仅通过观察新环境的短视频即可显著提升任务效率,无需架构修改或微调。
Details
Motivation: 解决现有零样本目标导航方法严重依赖模拟器提供的精确深度与姿态信息(限制了现实世界适用性),以及缺乏上下文学习能力(难以快速适应新环境)这两个关键局限。
Result: 在HM3D基准测试和真实世界环境中的实验表明,RANGER在导航成功率和探索效率方面取得了有竞争力的性能,并展现出卓越的上下文学习适应能力,且无需环境的先验3D地图。
Insight: 创新点在于构建了一个不依赖深度/姿态的单目零样本导航框架,并集成了基于关键帧的3D重建、语义点云生成、VLM驱动的探索价值估计、自适应路径点选择等组件,实现了通过短视频进行快速上下文适应的能力。
Abstract: Efficiently finding targets in complex environments is fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on precise depth and pose information provided by simulators, which restricts applicability in real-world scenarios; and (2) lack of in-context learning (ICL) capability, making it difficult to quickly adapt to new environments, as in leveraging short videos. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong ICL capability. By simply observing a short video of a new environment, the system can also significantly improve task efficiency without requiring architectural modifications or fine-tuning. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior ICL adaptability, with no previous 3D mapping of the environment required.
[94] Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow cs.RO | cs.AI | cs.CVPDF
Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, Ruohan Zhang
TL;DR: Dream2Flow是一个将视频生成模型与机器人控制连接起来的框架,它通过3D物体流作为中间表示,从生成的视频中重建3D物体运动,并将操作任务转化为物体轨迹跟踪问题,从而实现了零样本的开放世界物体操作。
Details
Motivation: 解决如何将视频生成模型合成的人类引导的合理物体运动,转化为机器人系统所需的低级动作指令的挑战,克服机器人具身差距。
Result: 在仿真和真实世界实验中,Dream2Flow能够零样本地操作包括刚性、铰接、可变形和颗粒状在内的多种类别物体,证明了3D物体流作为通用接口的有效性和可扩展性。
Insight: 创新点在于提出以3D物体流作为连接视频生成与机器人控制的中间表示,将状态变化与执行器分离,从而利用预训练视频模型进行零样本指导,并通过轨迹优化或强化学习生成可执行命令,无需任务特定演示。
Abstract: Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthesizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories-including rigid, articulated, deformable, and granular. Through trajectory optimization or reinforcement learning, Dream2Flow converts reconstructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos and visualizations are available at https://dream2flow.github.io/.
cs.SD [Back]
[95] AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives cs.SD | cs.AI | cs.CL | cs.MMPDF
Yanxi Chen, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Xin Li
TL;DR: 本文提出AHA框架,通过反事实硬负样本挖掘构建高质量偏好数据集,以解决大型音频-语言模型中的幻觉问题,并建立AHA-Eval诊断基准。应用该框架对齐Qwen2.5-Omni后得到的Qwen-Audio-AHA模型在AHA-Eval上提升13.7%,并在公开基准MMAUTest和MMAR上分别提升1.3%和1.6%,超越最新SOTA方法。
Details
Motivation: 大型音频-语言模型虽达到SOTA性能,但常产生幻觉,即生成未基于音频输入的文本。本文旨在解决这一接地失败问题,并识别了事件遗漏、错误事件身份、时间关系错误和定量时间错误等细粒度分类。
Result: Qwen-Audio-AHA模型在AHA-Eval诊断基准上提升13.7%,在公开基准MMAUTest和MMAR上分别提升1.3%和1.6%,超越最新SOTA方法。
Insight: 创新点包括:提出针对音频-语言模型幻觉的细粒度分类法;引入反事实硬负样本挖掘构建偏好数据集,强制模型区分严格声学证据与语言合理虚构;建立AHA-Eval诊断基准以严格测试时间推理能力;框架的改进能泛化到诊断集之外的公开基准。
Abstract: Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods.
eess.SP [Back]
[96] A multimodal Transformer for InSAR-based ground deformation forecasting with cross-site generalization across Europe eess.SP | cs.AI | cs.CV | cs.LGPDF
Wendong Yao, Binhua Huang, Soumyabrata Dev
TL;DR: 该论文提出了一种基于多模态Transformer的模型,用于从欧洲地面运动服务(EGMS)的InSAR时序数据中预测下一时刻的地面形变位移图。模型融合了近期位移快照、静态运动学指标(如平均速度、加速度、季节性振幅)以及谐波年日编码,在爱尔兰东部区域(E32N34)的测试中,该模型在接收相同多模态输入时,显著优于CNN-LSTM、CNN-LSTM+Attn及多模态STGCN等基线模型。
Details
Motivation: 近实时、区域尺度的地面形变监测对城市规划、关键基础设施管理和自然灾害缓解至关重要。尽管InSAR和EGMS等服务提供了密集的历史运动观测,但由于长期趋势、季节周期、偶尔的突变不连续性(如地震阶跃)以及强烈的空间异质性叠加,预测下一时刻的观测仍具挑战性。
Result: 在爱尔兰东部区域(E32N34)的测试集上,多模态Transformer模型在接收相同多模态输入时,取得了RMSE = 0.90 mm和R² = 0.97的最佳性能,其阈值精度也最高,明显优于CNN-LSTM、CNN-LSTM+Attn及多模态STGCN等基线模型。
Insight: 论文的创新点在于提出了一种融合多模态输入(位移快照、静态运动学指标和谐波时间编码)的基于patch的Transformer架构,用于InSAR形变预测,并强调了在训练窗口内以无泄漏方式计算静态指标的重要性,这有助于提升模型的泛化能力和预测精度。
Abstract: Near-real-time regional-scale monitoring of ground deformation is increasingly required to support urban planning, critical infrastructure management, and natural hazard mitigation. While Interferometric Synthetic Aperture Radar (InSAR) and continental-scale services such as the European Ground Motion Service (EGMS) provide dense observations of past motion, predicting the next observation remains challenging due to the superposition of long-term trends, seasonal cycles, and occasional abrupt discontinuities (e.g., co-seismic steps), together with strong spatial heterogeneity. In this study we propose a multimodal patch-based Transformer for single-step, fixed-interval next-epoch nowcasting of displacement maps from EGMS time series (resampled to a 64x64 grid over 100 km x 100 km tiles). The model ingests recent displacement snapshots together with (i) static kinematic indicators (mean velocity, acceleration, seasonal amplitude) computed in a leakage-safe manner from the training window only, and (ii) harmonic day-of-year encodings. On the eastern Ireland tile (E32N34), the STGCN is strongest in the displacement-only setting, whereas the multimodal Transformer clearly outperforms CNN-LSTM, CNN-LSTM+Attn, and multimodal STGCN when all models receive the same multimodal inputs, achieving RMSE = 0.90 mm and $R^2$ = 0.97 on the test set with the best threshold accuracies.
cs.LG [Back]
[97] Trellis: Learning to Compress Key-Value Memory in Attention Models cs.LG | cs.CLPDF
Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni
TL;DR: 本文提出了一种名为Trellis的新型Transformer架构,旨在解决标准Transformer中KV缓存导致的二次计算复杂度和内存无限增长的问题。Trellis通过一个固定大小的记忆模块和两轮循环压缩机制,在推理时动态学习压缩键值记忆,从而实现对长序列的高效处理。
Details
Motivation: 动机是解决Transformer模型因注意力机制中键值缓存不断增长而带来的计算复杂度和内存消耗问题,特别是在处理长序列时。
Result: 在语言建模、常识推理、召回密集型任务和时间序列等多个基准测试中,Trellis超越了强基线模型,且随着序列长度增加,其性能优势更加明显,显示出在长上下文应用中的潜力。
Insight: 创新点在于用固定大小的记忆模块替代标准KV缓存,并引入基于在线梯度下降和遗忘门的两轮循环压缩机制,使模型能在推理时动态学习和更新压缩记忆,有效保留重要上下文信息,为长序列处理提供了可扩展的解决方案。
Abstract: Transformers, while powerful, suffer from quadratic computational complexity and the ever-growing Key-Value (KV) cache of the attention mechanism. This paper introduces Trellis, a novel Transformer architecture with bounded memory that learns how to compress its key-value memory dynamically at test time. Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory. To achieve this, it leverages an online gradient descent procedure with a forget gate, enabling the compressed memory to be updated recursively while learning to retain important contextual information from incoming tokens at test time. Extensive experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines. Notably, its performance gains increase as the sequence length grows, highlighting its potential for long-context applications.
[98] Many Minds from One Model: Bayesian Transformers for Population Intelligence cs.LG | cs.CLPDF
Diji Yang, Yi Zhang
TL;DR: 本文提出了一种名为Population Bayesian Transformers(B-Trans)的方法,将标准的大型语言模型转换为贝叶斯Transformer模型,使其能够从一组预训练权重中采样出多样且连贯的模型实例。该方法通过将归一化层中的偏置偏移视为具有高斯变分近似的随机变量,引入贝叶斯启发的后验代理,从而在不训练完整贝叶斯神经网络的情况下诱导模型行为的分布。在序列级别冻结采样的噪声以确保生成的一致性,并通过聚合多个采样个体的预测进行群体级决策,增强探索能力。
Details
Motivation: 现代Transformer模型通常被训练为单一确定性的系统,而智能可能源于多种思维的集合。本文旨在解决单一模型行为缺乏多样性的问题,探索如何从一个预训练模型中高效地采样出行为多样但保持基本能力的模型实例,以利用群体智慧。
Result: 在零样本生成、带可验证奖励的强化学习(RLVR)以及无显式标签的强化学习等任务上的实验表明,B-Trans能够有效利用群体智慧,相比确定性基线模型,在保持任务性能的同时,实现了更优的语义多样性。
Insight: 创新点在于提出了一种轻量级的贝叶斯后验代理方法,通过随机化归一化层的偏置偏移来诱导模型行为分布,避免了训练完整贝叶斯网络的高成本。同时,通过序列级噪声冻结确保了生成过程的连贯性,并通过群体决策聚合提升了探索能力,为从单一模型中获得多样性提供了新思路。
Abstract: Despite their scale and success, modern transformers are almost universally trained as single-minded systems: optimization produces one deterministic set of parameters, representing a single functional hypothesis about the data. Motivated by the idea that intelligence emerge from many minds, we propose Population Bayesian Transformers (B-Trans), which transform a standard Large Language Model into a Bayesian Transformer model to supports sampling diverse yet coherent model instances from a single set of pre-trained weights. B-Trans introduces a Bayesian-motivated posterior proxy by treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation, inducing a distribution over model behavior without the cost of training full Bayesian neural networks. Sampling from this proxy yields a set of model instances with diverse behaviors while maintaining general competence. To preserve coherence within each generation, we freeze the sampled noise at the sequence level, enforcing temporal consistency across tokens. B-Trans allows for population-level decision-making, where aggregating predictions across sampled individuals significantly enhances exploration. Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels demonstrate that B-Trans effectively leverage the wisdom of crowds, yielding superior semantic diversity while achieving better task performance compared to deterministic baselines.
[99] Scaling Open-Ended Reasoning to Predict the Future cs.LG | cs.CLPDF
Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
TL;DR: 本文提出了一种基于语言模型的开放端预测系统OpenForecaster 8B,通过从新闻中自动合成预测问题构建数据集OpenForesight,并训练Qwen3思维模型进行未来事件预测。系统采用离线新闻语料防止信息泄露,结合检索和改进的强化学习奖励函数提升性能。在2025年5月至8月的测试中,该模型在准确性、校准性和一致性方面匹配或超越更大的专有模型,且校准改进可泛化至其他基准。
Details
Motivation: 解决高风险决策中未来不确定性的开放端推理问题,通过自动化合成大规模预测数据,训练语言模型进行未来事件预测,以提升预测的可靠性和可扩展性。
Result: 在2025年5月至8月的保留测试中,OpenForecaster 8B模型在准确性、校准性和一致性方面匹配更大的专有模型,训练后的校准改进在流行基准上具有泛化性。
Insight: 创新点包括:使用自动化方法从新闻中合成开放端预测问题构建数据集,采用离线语料防止信息泄露,结合检索和改进的强化学习奖励函数优化预测系统;客观分析认为,该方法在数据生成和模型训练策略上具有可扩展性和泛化潜力。
Abstract: High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.
[100] A Granular Grassmannian Clustering Framework via the Schubert Variety of Best Fit cs.LG | cs.CG | cs.CV | cs.DCPDF
Karim Salta, Michael Kirby, Chris Peterson
TL;DR: 本文提出了一种基于舒伯特最佳拟合簇(SVBF)的格拉斯曼流形聚类框架,通过将子空间均值替换为可训练的SVBF原型,该原型定义为与每个簇成员在至少一个固定方向上尽可能接近相交的子空间。该方法集成在Linde-Buzo-Grey(LBG)流程中,在合成、图像、光谱和视频动作数据上实现了更高的聚类纯度,同时保留了后续分析所需的数学结构。
Details
Motivation: 在子空间表示的分类和聚类任务中,传统方法使用格拉斯曼或旗流形上的几何代表(如均值或中位数)可能不够精确,本文旨在通过引入更灵活的舒伯特簇原型来提升聚类性能。
Result: SVBF-LBG方案在合成、图像、光谱和视频动作数据上均显示出比传统方法更高的聚类纯度,验证了其有效性。
Insight: 创新点在于将舒伯特簇作为可训练的原型集成到LBG聚类流程中,结合了几何结构与数据适应性,为子空间聚类提供了新的数学框架和性能提升途径。
Abstract: In many classification and clustering tasks, it is useful to compute a geometric representative for a dataset or a cluster, such as a mean or median. When datasets are represented by subspaces, these representatives become points on the Grassmann or flag manifold, with distances induced by their geometry, often via principal angles. We introduce a subspace clustering algorithm that replaces subspace means with a trainable prototype defined as a Schubert Variety of Best Fit (SVBF) - a subspace that comes as close as possible to intersecting each cluster member in at least one fixed direction. Integrated in the Linde-Buzo-Grey (LBG) pipeline, this SVBF-LBG scheme yields improved cluster purity on synthetic, image, spectral, and video action data, while retaining the mathematical structure required for downstream analysis.
[101] GARDO: Reinforcing Diffusion Models without Reward Hacking cs.LG | cs.AI | cs.CVPDF
Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang
TL;DR: 本文提出GARDO框架,通过门控自适应正则化和多样性感知优化,在强化学习微调扩散模型时有效缓解奖励黑客问题,同时保持样本效率和探索能力。
Details
Motivation: 在线强化学习微调扩散模型时,代理奖励与真实目标不匹配会导致奖励黑客问题,即代理分数上升但真实图像质量下降且多样性崩溃,现有基于参考策略的正则化方法会牺牲样本效率并阻碍探索。
Result: 在多种代理奖励和未见指标上的广泛实验表明,GARDO能缓解奖励黑客并增强生成多样性,且不牺牲样本效率或探索能力。
Insight: 创新点在于选择性对高不确定性样本进行正则化、自适应更新参考模型以保持正则化目标相关性,以及通过多样性感知奖励放大来鼓励模式覆盖,从而平衡了防止奖励黑客、保持样本效率和促进探索的竞争需求。
Abstract: Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.
[102] Lifting Vision: Ground to Aerial Localization with Reasoning Guided Planning cs.LG | cs.CVPDF
Soham Pahari, M. Srinivas
TL;DR: 本文提出了一种名为ViReLoc的视觉推理定位框架,通过仅使用视觉表示进行规划和定位,学习空间依赖和几何关系,以解决基于文本的推理在空间任务(如视觉导航和地理定位)中的局限性。
Details
Motivation: 当前多模态推理系统主要依赖文本信息进行推理,限制了其在视觉导航和地理定位等空间任务中的有效性,因此需要开发一种基于视觉的推理范式来提升空间理解能力。
Result: 在多种导航和定位场景的实验表明,ViReLoc在空间推理准确性和跨视角检索性能上均取得了一致性提升,证明了视觉推理作为导航和定位的强互补方法,且无需实时全球定位系统数据。
Insight: 创新点在于提出了Geo-Consistent Visual Planning视觉推理范式,通过强化学习目标优化逐步推理,并整合对比学习和自适应特征交互来对齐跨视角并减少视角差异,为安全导航提供了新思路。
Abstract: Multimodal intelligence development recently show strong progress in visual understanding and high level reasoning. Though, most reasoning system still reply on textual information as the main medium for inference. This limit their effectiveness in spatial tasks such as visual navigation and geo-localization. This work discuss about the potential scope of this field and eventually propose an idea visual reasoning paradigm Geo-Consistent Visual Planning, our introduced framework called Visual Reasoning for Localization, or ViReLoc, which performs planning and localization using only visual representations. The proposed framework learns spatial dependencies and geometric relations that text based reasoning often suffer to understand. By encoding step by step inference in the visual domain and optimizing with reinforcement based objectives, ViReLoc plans routes between two given ground images. The system also integrates contrastive learning and adaptive feature interaction to align cross view perspectives and reduce viewpoint differences. Experiments across diverse navigation and localization scenarios show consistent improvements in spatial reasoning accuracy and cross view retrieval performance. These results establish visual reasoning as a strong complementary approach for navigation and localization, and show that such tasks can be performed without real time global positioning system data, leading to more secure navigation solutions.
cs.IR [Back]
[103] RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment cs.IR | cs.AI | cs.CL | cs.LGPDF
Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang
TL;DR: 本文提出了RAIR(Rule-Aware benchmark with Image for Relevance assessment),一个用于电子商务相关性评估的中文基准数据集。该数据集源自真实场景,旨在解决现有基准在模型评估中缺乏足够复杂性和标准化度量的问题。RAIR建立了一个标准化的评估框架和通用规则,并包含三个子集:通用子集、长尾困难子集和视觉显著性子集,以全面评估模型的基本能力、极限性能和多模态理解能力。
Details
Motivation: 现有相关性评估基准缺乏足够的复杂性,导致行业缺乏标准化的评估指标,无法全面评估模型性能。
Result: 在RAIR上对14个开源和闭源模型进行了实验,结果表明即使对于性能最佳的GPT-5,RAIR也提出了足够的挑战。
Insight: 创新点在于构建了一个结合规则感知、长尾挑战和视觉显著性的综合性基准,为相关性评估提供了标准化框架,并可用于评估通用LLM和视觉语言模型(VLM)。
Abstract: Search relevance plays a central role in web e-commerce. While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics across the industry. To address this limitation, we propose Rule-Aware benchmark with Image for Relevance assessment(RAIR), a Chinese dataset derived from real-world scenarios. RAIR established a standardized framework for relevance assessment and provides a set of universal rules, which forms the foundation for standardized evaluation. Additionally, RAIR analyzes essential capabilities required for current relevance models and introduces a comprehensive dataset consists of three subset: (1) a general subset with industry-balanced sampling to evaluate fundamental model competencies; (2) a long-tail hard subset focus on challenging cases to assess performance limits; (3) a visual salience subset for evaluating multimodal understanding capabilities. We conducted experiments on RAIR using 14 open and closed-source models. The results demonstrate that RAIR presents sufficient challenges even for GPT-5, which achieved the best performance. RAIR data are now available, serving as an industry benchmark for relevance assessment while providing new insights into general LLM and Visual Language Model(VLM) evaluation.
eess.IV [Back]
[104] Automated Classification of First-Trimester Fetal Heart Views Using Ultrasound-Specific Self-Supervised Learning eess.IV | cs.AI | cs.CVPDF
Youssef Megahed, Aylin Erman, Robin Ducharme, Mark C. Walker, Steven Hawken
TL;DR: 本研究评估了一种专为超声图像设计的自监督基础模型USF-MAE,用于早期妊娠胎儿心脏超声视图的自动分类。该模型在超过37万张未标记的超声图像上进行掩码自编码预训练,并在一个包含6720张图像的公开数据集上微调,以分类主动脉、房室血流、V征、X征和其他共五类。实验表明,USF-MAE在独立测试集上取得了最佳性能,准确率达到90.57%,优于监督训练的卷积神经网络基线以及在自然图像上预训练的Vision Transformer模型。
Details
Motivation: 先天性心脏病是最常见的先天性异常,也是新生儿发病和死亡的主要原因。虽然早期妊娠胎儿超声心动图提供了早期检测的机会,但由于心脏结构小、信噪比低以及操作者间差异大,该阶段的自动分析具有挑战性。
Result: 在独立测试集上,USF-MAE在所有评估指标上均取得最高性能:准确率90.57%,精确率91.15%,召回率90.57%,F1分数90.71%。与最强的基线模型ResNet-18相比,准确率提升了2.03%,F1分数提升了1.98%。
Insight: 论文的创新点在于利用超声图像特定的大规模自监督预训练(USF-MAE)来提升早期妊娠胎儿心脏视图分类的准确性和鲁棒性,该方法不依赖于激进的图像预处理或感兴趣区域裁剪,并显示出对非诊断性帧的更好区分能力。从客观角度看,将领域自适应的自监督预训练应用于医学图像分析,特别是针对低信噪比和小结构的挑战性任务,是一个有效且可借鉴的方向。
Abstract: Congenital heart disease remains the most common congenital anomaly and a leading cause of neonatal morbidity and mortality. Although first-trimester fetal echocardiography offers an opportunity for earlier detection, automated analysis at this stage is challenging due to small cardiac structures, low signal-to-noise ratio, and substantial inter-operator variability. In this work, we evaluate a self-supervised ultrasound foundation model, USF-MAE, for first-trimester fetal heart view classification. USF-MAE is pretrained using masked autoencoding modelling on more than 370,000 unlabelled ultrasound images spanning over 40 anatomical regions and is subsequently fine-tuned for downstream classification. As a proof of concept, the pretrained Vision Transformer encoder was fine-tuned on an open-source dataset of 6,720 first-trimester fetal echocardiography images to classify five categories: aorta, atrioventricular flows, V sign, X sign, and Other. Model performance was benchmarked against supervised convolutional neural network baselines (ResNet-18 and ResNet-50) and a Vision Transformer (ViT-B/16) model pretrained on natural images (ImageNet-1k). All models were trained and evaluated using identical preprocessing, data splits, and optimization protocols. On an independent test set, USF-MAE achieved the highest performance across all evaluation metrics, with 90.57% accuracy, 91.15% precision, 90.57% recall, and 90.71% F1-score. This represents an improvement of +2.03% in accuracy and +1.98% in F1-score compared with the strongest baseline, ResNet-18. The proposed approach demonstrated robust performance without reliance on aggressive image preprocessing or region-of-interest cropping and showed improved discrimination of non-diagnostic frames.
cs.AI [Back]
[105] From Building Blocks to Planning: Multi-Step Spatial Reasoning in LLMs with Reinforcement Learning cs.AI | cs.CLPDF
Amir Tahmasbi, Sadegh Majidi, Kazem Taram, Aniket Bera
TL;DR: 本文提出了一种两阶段方法,用于提升大型语言模型在结构化环境中的多步空间推理能力。该方法首先通过监督微调使模型掌握基本空间变换(如旋转、平移、缩放)的原子构建块,然后冻结该模型,在GRPO框架下训练轻量级LoRA适配器,以学习将这些构建块组合起来进行多步规划的闭环策略。为支持此流程,研究合成了一个ASCII艺术数据集并构建了对应的基于ASCII的强化学习环境。
Details
Motivation: 尽管大型语言模型具备强大的通用语言能力,但在结构化环境中的空间变换和多步规划方面仍存在困难。论文旨在解决LLMs在空间推理任务中,特别是需要多步组合操作的规划问题上的性能瓶颈。
Result: 该方法在动态(有显式状态更新)和静态(模型需依赖内部状态)两种环境中均一致优于基线模型,包括通用主干模型、物理感知模型和端到端RL模型。此外,所提方法收敛更快,训练过程比从头开始的端到端强化学习更稳定。
Insight: 创新点在于将复杂的空间推理任务分解为原子构建块学习与组合策略学习两个阶段,并通过冻结基础模型、训练轻量适配器的方式实现高效组合。这为LLMs在需要多步物理推理的任务上提供了一种模块化、可解释且高效的训练范式。从客观角度看,其构建的ASCII艺术环境也为评估空间推理提供了一种新颖且可控的基准。
Abstract: Spatial reasoning in large language models (LLMs) has gained increasing attention due to applications in navigation and planning. Despite strong general language capabilities, LLMs still struggle with spatial transformations and multi-step planning in structured environments. We propose a two-stage approach that decomposes spatial reasoning into atomic building blocks and their composition. First, we apply supervised fine-tuning on elementary spatial transformations, such as rotation, translation, and scaling, to equip the model with basic spatial physics. We then freeze this physics-aware model and train lightweight LoRA adapters within the GRPO framework to learn policies that compose these building blocks for multi-step planning in puzzle-based environments, in a closed-loop manner. To support this pipeline, we synthesize an ASCII-art dataset and construct a corresponding ASCII-based reinforcement learning environment. Our method consistently outperforms baselines, including the generic backbone, physics-aware model, and end-to-end RL models, under both Dynamic environments with explicit state updates and Static environments where the model must rely on its internal state across steps. In addition, the proposed approach converges faster and exhibits more stable training compared to end-to-end reinforcement learning from scratch. Finally, we analyze attention patterns to assess whether fine-tuning induces meaningful improvements in spatial understanding.
[106] Iterative Deployment Improves Planning Skills in LLMs cs.AI | cs.CL | cs.LGPDF
Augusto B. Corrêa, Yoav Gelberg, Luckeciano C. Melo, Ilia Shumailov, André G. Pereira
TL;DR: 本文提出通过迭代部署大型语言模型(LLMs),并在每次部署后由用户精心筛选数据用于微调后续模型,可以显著改变模型特性。在多个规划领域测试中,该方法大幅提升了模型的规划能力,后续模型展现出泛化能力,能发现比初始模型更长的规划方案。理论分析表明,迭代部署本质上是在外层循环中实现了强化学习训练,具有隐式奖励函数。
Details
Motivation: 探索迭代部署与数据筛选如何影响LLMs的规划能力,并揭示其与强化学习的隐含联系,以提供一种不依赖显式奖励的替代训练机制。
Result: 在多个规划领域的实验中,后续模型展现出显著的规划技能提升,能够生成更长的规划方案,表现出泛化能力。
Insight: 创新点在于将迭代部署与用户数据筛选结合,作为一种隐式强化学习训练范式,避免了显式奖励设计;从安全角度,这种隐式奖励可能带来不可预见的模型行为变化,需引起关注。
Abstract: We show that iterative deployment of large language models (LLMs), each fine-tuned on data carefully curated by users from the previous models’ deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.