Table of Contents

cs.CL [Back]

[1] Companion Agents: A Table-Information Mining Paradigm for Text-to-SQL cs.CL | cs.AIPDF

Jiahui Chen, Lei Fu, Jian Cui, Yu Lei, Zhenning Dong

TL;DR: 本文提出了一种名为‘伴侣智能体’的新型Text-to-SQL范式,旨在解决在数据库标注缺失、不完整或错误的工业场景下,现有SOTA系统性能受限的问题。该方法通过在数据库侧预先部署一组智能体来主动挖掘和整合表间关系、值域分布等内在细粒度信息,从而在推理时选择性激活查询相关知识,以提升Text-to-SQL的准确性。

Details

Motivation: 现有大规模Text-to-SQL基准(如BIRD)通常假设数据库标注完整准确且外部知识易于获取,这与工业界常见标注缺失、不完整或错误的实际情况不匹配,限制了SOTA系统的实际应用。本文旨在探索一种以数据库为中心的方法,利用关系数据库内在的细粒度信息来构建缺失的证据,以提升在标注稀缺条件下的Text-to-SQL准确率。

Result: 在BIRD基准测试的完全缺失证据设置下,CA方法在RSL-SQL、CHESS和DAIL-SQL模型上分别恢复了+4.49、+4.37和+14.13个执行准确率百分点,在更具挑战性的子集上提升更大,分别为+9.65、+7.58和+16.71个点。

Insight: 论文的核心创新点在于提出了‘伴侣智能体’这一新范式,将知识挖掘和证据构建过程前置并自动化于数据库侧,而非依赖查询时检索或人工标注。这提供了一条不依赖人工整理证据、面向工业级部署的实用路径。从客观角度看,其将数据库本身视为一个可主动挖掘和缓存相关知识的智能体集合,是一种新颖的架构思路。

Abstract: Large-scale Text-to-SQL benchmarks such as BIRD typically assume complete and accurate database annotations as well as readily available external knowledge, which fails to reflect common industrial settings where annotations are missing, incomplete, or erroneous. This mismatch substantially limits the real-world applicability of state-of-the-art (SOTA) Text-to-SQL systems. To bridge this gap, we explore a database-centric approach that leverages intrinsic, fine-grained information residing in relational databases to construct missing evidence and improve Text-to-SQL accuracy under annotation-scarce conditions. Our key hypothesis is that when a query requires multi-step reasoning over extensive table information, existing methods often struggle to reliably identify and utilize the truly relevant knowledge. We therefore propose to “cache” query-relevant knowledge on the database side in advance, so that it can be selectively activated at inference time. Based on this idea, we introduce Companion Agents (CA), a new Text-to-SQL paradigm that incorporates a group of agents accompanying database schemas to proactively mine and consolidate hidden inter-table relations, value-domain distributions, statistical regularities, and latent semantic cues before query generation. Experiments on BIRD under the fully missing evidence setting show that CA recovers +4.49 / +4.37 / +14.13 execution accuracy points on RSL-SQL / CHESS / DAIL-SQL, respectively, with larger gains on the Challenging subset +9.65 / +7.58 / +16.71. These improvements stem from CA’s automatic database-side mining and evidence construction, suggesting a practical path toward industrial-grade Text-to-SQL deployment without reliance on human-curated evidence.


[2] Recursive Knowledge Synthesis for Multi-LLM Systems: Stability Analysis and Tri-Agent Audit Framework cs.CLPDF

Toshiyuki Shigemura

TL;DR: 本文提出了一种用于多模型大语言系统的三智能体交叉验证框架,旨在分析系统的稳定性和可解释性。该框架集成了三个异构的LLM,分别负责语义生成、分析一致性检查和透明度审计,形成一个递归交互循环,从而诱导出递归知识合成。通过在公开部署的LLM上进行47次受控试验,评估了系统的稳定性,结果表明系统在多数试验中能够收敛并保持较高的透明度和可靠性。

Details

Motivation: 解决多LLM系统中由于模型异构性和交互复杂性导致的稳定性、一致性和可解释性问题,旨在构建一个安全、人类可监督的多LLM架构,以实现稳定的递归知识合成。

Result: 在47次使用公开访问LLM部署的受控试验中,系统实现了平均反射可靠性分数为0.78±0.06,约68%的试验中透明度分数≥0.8,约89%的试验收敛。这些结果为系统在现实公开部署环境中实现稳定递归知识合成提供了初步经验证据。

Insight: 创新点包括:1)一个结构化的三智能体框架,用于协调异构LLM之间的推理;2)基于不动点理论的正式递归知识合成模型;3)在现实非API公开访问条件下对模型间稳定性的实证评估。该框架通过透明度审计作为收缩算子,促进了系统收敛和稳定性,为构建可解释、稳健的多LLM系统提供了新思路。

Abstract: This paper presents a tri-agent cross-validation framework for analyzing stability and explainability in multi-model large language systems. The architecture integrates three heterogeneous LLMs-used for semantic generation, analytical consistency checking, and transparency auditing-into a recursive interaction cycle. This design induces Recursive Knowledge Synthesis (RKS), where intermediate representations are continuously refined through mutually constraining transformations irreducible to single-model behavior. Across 47 controlled trials using public-access LLM deployments (October 2025), we evaluated system stability via four metrics: Reflex Reliability Score (RRS), Transparency Score (TS), Deviation Detection Rate (DDR), and Correction Success Rate (CSR). The system achieved mean RRS = 0.78+-0.06 and maintained TS >= 0.8 in about 68% of trials. Approximately 89% of trials converged, supporting the theoretical prediction that transparency auditing acts as a contraction operator within the composite validation mapping. The contributions are threefold: (1) a structured tri-agent framework for coordinated reasoning across heterogeneous LLMs, (2) a formal RKS model grounded in fixed-point theory, and (3) empirical evaluation of inter-model stability under realistic, non-API public-access conditions. These results provide initial empirical evidence that a safety-preserving, humansupervised multi-LLM architecture can achieve stable recursive knowledge synthesis in realistic, publicly deployed environments.


[3] Resisting Correction: How RLHF Makes Language Models Ignore External Safety Signals in Natural Conversation cs.CL | cs.AIPDF

Felipe Biava Cataneo

TL;DR: 本研究探讨了指令微调语言模型在交互设置中对外部安全信号的响应能力。研究发现,基础模型能近乎完美地遵从外部置信度信号,而经过RLHF优化的指令微调模型在自然对话查询中会系统性地忽略这些安全校正信号,尽管它们在明确指令下能完全遵从。

Details

Motivation: 随着语言模型安全架构日益依赖外部监控器在推理时检测错误并注入校正信号,本研究旨在测试指令微调模型在不同交互模式下是否保持了这种对外部置信度信息的可控性。

Result: 在GSM8K基准上使用Llama-3.2-3B进行的因果干预研究表明:基础模型表现出近乎完美的可控性(Spearman rho接近1.0);指令微调模型在明确指令提示下完全遵从校正(偏差约0%,rho=0.93),但在自然对话查询中系统性地忽略相同信号(偏差增加40%,rho=0.04)。

Insight: 论文揭示了RLHF优化的一个关键副作用:在自然对话中优先考虑对话流畅性而牺牲外部校准线索,这导致用户期望的交互方式恰恰是安全校正最无效的场景。同时,研究强调了小模型内部令牌级置信度的不可靠性(r=0.035),凸显了外部监督的必要性。

Abstract: Safety architectures for language models increasingly rely on external monitors to detect errors and inject corrective signals at inference time. For such systems to function in interactive settings, models must be able to incorporate externally provided confidence information into their verbal responses. In this work, we test whether instruction-tuned language models preserve this controllability across different interaction modes. Using Llama-3.2-3B on GSM8K, we perform a causal intervention study in which explicit external confidence signals are injected and model compliance is measured under multiple prompt strategies. We find that base models exhibit near-perfect controllability (Spearman rho close to 1.0), while instruction-tuned models display a striking context dependence: they fully comply with external corrections under explicit command prompts (bias approximately 0 percent, rho = 0.93), yet systematically ignore the same signals in natural conversational queries (bias plus 40 percent, rho = 0.04). This behavior is not a capability failure; the model can process the signal, but an emergent property of RLHF optimization that prioritizes conversational fluency over external calibration cues in natural dialogue. We further show that internal token-level confidence in small models is uninformative (r = 0.035), underscoring the necessity of external supervision. Our findings highlight a deployment-critical failure mode: the interaction style users expect is precisely where safety corrections are least effective.


[4] Emissions and Performance Trade-off Between Small and Large Language Models cs.CL | cs.AI | cs.CY | cs.LGPDF

Anandita Garg, Uma Gaba, Deepan Muthirayan, Anish Roy Chowdhury

TL;DR: 该研究探讨了在特定任务中使用微调的小型语言模型(SLMs)作为大型语言模型(LLMs)的可持续替代方案,通过比较两者在自然语言处理、推理和编程任务中的性能与碳排放权衡,发现SLMs在多数任务中能保持相当性能的同时显著降低推理碳排放。

Details

Motivation: 针对大型语言模型训练和推理过程中巨大的碳足迹问题,研究旨在探索微调小型语言模型在预定义任务中作为可持续替代方案的潜力,以减轻资源密集型LLMs对环境的影响。

Result: 在选定的六个任务中,有四个任务上,SLMs在保持可比性能的同时,推理过程中的碳排放显著降低,证明了较小模型在减少环境影响方面的可行性。

Insight: 论文的创新点在于系统量化了LLMs与SLMs在性能与碳排放之间的权衡,为推进绿色、可持续AI提供了实证依据,即针对特定任务微调小型模型是减少AI碳足迹的有效策略。

Abstract: The advent of Large Language Models (LLMs) has raised concerns about their enormous carbon footprint, starting with energy-intensive training and continuing through repeated inference. This study investigates the potential of using fine-tuned Small Language Models (SLMs) as a sustainable alternative for predefined tasks. Here, we present a comparative analysis of the performance-emissions trade-off between LLMs and fine-tuned SLMs across selected tasks under Natural Language Processing, Reasoning and Programming. Our results show that in four out of the six selected tasks, SLMs maintained comparable performances for a significant reduction in carbon emissions during inference. Our findings demonstrate the viability of smaller models in mitigating the environmental impact of resource-heavy LLMs, thus advancing towards sustainable, green AI.


[5] Directional Attractors in LLM Reasoning: How Similarity Retrieval Steers Iterative Summarization Based Reasoning cs.CL | cs.AI | cs.LGPDF

Cagatay Tekin, Charbel Barakat, Luis Joseph Luna Limgenco

TL;DR: 本文提出了一种名为InftyThink with Cross-Chain Memory的扩展方法,通过在迭代推理框架中引入基于嵌入的语义缓存来存储先前成功的推理模式,以引导LLM的推理过程,避免上下文窗口的无序增长。

Details

Motivation: 解决现有基于迭代摘要的推理框架(如InftyThink)在跨任务中反复生成相似推理策略的问题,旨在通过复用历史成功模式来提升长视野推理的效率和准确性。

Result: 在MATH500、AIME2024和GPQA-Diamond基准测试中,语义引理检索提高了结构化领域的准确性,但在异质领域测试中暴露了失效模式;几何分析表明缓存检索在嵌入空间中诱导了方向性偏差,形成了稳定的改善或恶化基线的吸引子。

Insight: 创新点在于将语义缓存机制集成到迭代推理中,实现历史推理模式的定向引导;客观分析认为,该方法揭示了基于相似性的记忆对自改进LLM推理既有益处(定向提升)也存在局限(可能固化偏差或导致性能下降)。

Abstract: Iterative summarization based reasoning frameworks such as InftyThink enable long-horizon reasoning in large language models (LLMs) by controlling context growth, but they repeatedly regenerate similar reasoning strategies across tasks. We introduce InftyThink with Cross-Chain Memory, an extension that augments iterative reasoning with an embedding-based semantic cache of previously successful reasoning patterns. At each reasoning step, the model retrieves and conditions on the most semantically similar stored lemmas, guiding inference without expanding the context window indiscriminately. Experiments on MATH500, AIME2024, and GPQA-Diamond demonstrate that semantic lemma retrieval improves accuracy in structured domains while exposing failure modes in tests that include heterogeneous domains. Geometric analyses of reasoning trajectories reveal that cache retrieval induces directional biases in embedding space, leading to consistent fix (improve baseline accuracy) and break (degradation in baseline accuracy) attractors. Our results highlight both the benefits and limits of similarity-based memory for self-improving LLM reasoning.


[6] PediaMind-R1: A Temperament-Aware Language Model for Personalized Early Childhood Care Reasoning via Cognitive Modeling and Preference Alignment cs.CL | cs.AIPDF

Zihe Zhang, Can Zhang, Yanheng Xu, Xin Hu, Jichao Leng

TL;DR: 本文提出了PediaMind-R1,一个专门针对智能育儿场景的领域专业化大语言模型,旨在实现主动个性化。它借鉴发展心理学理论,引入了Thomas-Chess框架中的气质理论,并构建了婴幼儿(0-3岁)气质知识图谱。通过两阶段训练流程(监督微调与基于GRPO的对齐)来强化逻辑一致性、领域专业知识和共情式照护策略。评估结果表明,该模型能准确解读儿童气质特征并进行个性化推理。

Details

Motivation: 解决传统育儿系统提供通用建议、缺乏个性化的问题,旨在将发展心理学理论与垂直领域建模结合,为敏感照护场景开发以用户为中心的、能主动个性化的LLM。

Result: 在包含气质敏感性多项选择测试和人工评估的框架下进行评估,结果表明PediaMind-R1能够准确解读婴幼儿气质特征并主动进行个性化推理。

Insight: 创新点在于将心理学理论(Thomas-Chess气质理论)与垂直领域知识图谱构建相结合,并采用两阶段训练(SFT + GRPO对齐)来专门优化逻辑一致性、专业知识和共情策略,为开发高度个性化的领域专用LLM提供了新思路。

Abstract: This paper presents PediaMind-R1, a domain-specialized large language model designed to achieve active personalization in intelligent parenting scenarios. Unlike conventional systems that provide generic suggestions, PediaMind-R1 draws on insights from developmental psychology. It introduces temperament theory from the Thomas-Chess framework and builds a temperament knowledge graph for infants and toddlers (0-3 years). Our two-stage training pipeline first uses supervised fine-tuning to teach structured chain-of-thought reasoning, and then applies a GRPO-based alignment stage to reinforce logical consistency, domain expertise, and empathetic caregiving strategies. We further design an evaluation framework comprising temperament-sensitive multiple-choice tests and human assessments. The results demonstrate that PediaMind-R1 can accurately interpret early childhood temperament profiles and proactively engage in individualized reasoning. This work highlights the value of integrating vertical-domain modeling with psychological theory. It offers a novel approach to developing user-centered LLMs that advance the practice of active personalization in sensitive caregiving contexts.


[7] Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models cs.CL | cs.AI | cs.LGPDF

Youwei Liu, Jian Wang, Hanlin Wang, Beichen Guo, Wenjie Li

TL;DR: 本文提出了Imagine-then-Plan (ITP)框架,一个通过前瞻想象进行智能体学习的统一方法。该方法让策略模型与学习到的世界模型交互,生成多步“想象”轨迹,并引入自适应前瞻机制来平衡最终目标与任务进展,从而为策略学习提供丰富的未来信号。

Details

Motivation: 当前基于世界模型的方法主要进行单步或固定步长的轨迹推演,未能充分发挥其在复杂任务规划中的潜力。本文旨在解决如何更有效地利用世界模型进行多步前瞻,以增强智能体在复杂任务中的推理和规划能力。

Result: 在多个代表性智能体基准测试上的广泛实验表明,ITP显著优于现有基线方法。进一步分析验证了自适应前瞻机制能大幅提升智能体的推理能力。

Insight: 核心创新点在于提出了一个结合世界模型与策略模型进行多步想象交互的统一框架,并引入了根据任务和阶段动态调整前瞻步长的自适应机制。这为解决更广泛、复杂的任务提供了新的思路,即将未来想象轨迹的丰富信号(如进展和潜在冲突)与当前观测融合,形成一个部分可观测且可想象的马尔可夫决策过程来指导学习。

Abstract: Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\texttt{ITP}), a unified framework for agent learning via lookahead imagination, where an agent’s policy model interacts with the learned world model, yielding multi-step ``imagined’’ trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially \textit{observable} and \textit{imaginable} Markov decision process to guide policy learning. We instantiate \texttt{ITP} with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that \texttt{ITP} significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents’ reasoning capability, providing valuable insights into addressing broader, complex tasks.


[8] Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM cs.CLPDF

Pedro Memoli Buffa, Luciano Del Corro

TL;DR: 本文提出了一种名为Entropy Sentinel的方法,通过分析LLM解码过程中的熵迹来持续监控其在STEM领域的准确性。该方法利用最终层下一个token概率的熵分布统计特征,训练轻量级分类器预测单个响应的正确性,进而聚合得到领域级别的准确率估计,以应对模型部署中的监控和数据采集优先级问题。

Details

Motivation: 解决LLM部署中两个相互关联的挑战:监控模型在流量和领域漂移下的性能表现,以及优先采集数据以弥补最大的性能差距。

Result: 在十个STEM推理基准测试上(涵盖所有’10选k’组合,k取1到4)和九个不同规模的LLM(3B-20B)上评估,估计的准确率通常与保留的基准测试准确率一致,多个模型显示出近乎单调的领域排序能力。

Insight: 创新点在于利用输出熵分布作为推理时信号来估计领域偏移下的切片级准确率,提供了一种可扩展的监控和针对性数据采集方法;客观分析认为,该方法通过轻量级统计特征提取和分类,实现了低开销的持续性能评估,对实际部署具有实用价值。

Abstract: Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all “10 choose k” combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.


[9] TranslateGemma Technical Report cs.CL | cs.AIPDF

Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska

TL;DR: TranslateGemma是一套基于Gemma 3基础模型的开源机器翻译模型套件。它通过两阶段微调(监督微调和强化学习)来增强Gemma 3固有的多语言能力,旨在提升翻译质量。模型在WMT25和WMT24++等基准测试上进行了评估,在多种语言对上均显示出相对于基线模型的显著提升,且小模型能达到与大模型相当的性能。

Details

Motivation: 动机是增强Gemma 3基础模型固有的多语言能力,使其更专注于并优化机器翻译任务,以提供强大的开源翻译工具。

Result: 在WMT25测试集的10个语言对上进行了人工评估,在WMT24++基准的55个语言对上进行了自动评估。自动指标显示,所有规模的TranslateGemma模型在所有语言对上都比基线Gemma 3模型有持续且显著的提升。值得注意的是,较小的TranslateGemma模型通常能达到与较大基线模型相当的性能,从而提高了效率。此外,模型在Vistra图像翻译基准上也表现出增强的多模态能力。

Insight: 摘要宣称的创新点在于采用了两阶段微调策略:首先使用由SOTA模型生成的大规模高质量合成并行数据与人工翻译并行数据的混合数据进行监督微调,然后使用包含MetricX-QE和AutoMQM的奖励模型集成进行强化学习以优化翻译质量。客观来看,其创新之处在于将先进的合成数据生成与强化学习奖励模型集成相结合,有效地将通用基础模型适配到特定翻译任务,并实现了模型尺寸与性能的高效权衡。

Abstract: We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models. To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process. First, supervised fine-tuning is performed using a rich mixture of high-quality large-scale synthetic parallel data generated via state-of-the-art models and human-translated parallel data. This is followed by a reinforcement learning phase, where we optimize translation quality using an ensemble of reward models, including MetricX-QE and AutoMQM, targeting translation quality. We demonstrate the effectiveness of TranslateGemma with human evaluation on the WMT25 test set across 10 language pairs and with automatic evaluation on the WMT24++ benchmark across 55 language pairs. Automatic metrics show consistent and substantial gains over the baseline Gemma 3 models across all sizes. Notably, smaller TranslateGemma models often achieve performance comparable to larger baseline models, offering improved efficiency. We also show that TranslateGemma models retain strong multimodal capabilities, with enhanced performance on the Vistra image translation benchmark. The release of the open TranslateGemma models aims to provide the research community with powerful and adaptable tools for machine translation.


[10] SpectraQuery: A Hybrid Retrieval-Augmented Conversational Assistant for Battery Science cs.CL | cs.IRPDF

Sreya Vangara, Jagjit Nanda, Yan-Kai Tzeng, Eric Darve

TL;DR: SpectraQuery是一个混合检索增强的对话助手,专为电池科学设计,通过结合结构化拉曼光谱数据库和非结构化科学文献,使用类似SUQL的设计实现跨模态联合推理。

Details

Motivation: 解决科学推理中结构化实验数据与非结构化文献难以联合处理的问题,使大语言模型助手能够跨模态协调检索与生成。

Result: 在SQL正确性、答案可追溯性、检索效果和专家评估中表现优异:约80%的SQL查询完全正确,合成答案的可追溯性达93-97%(使用10-15个检索段落),电池科学家在准确性、相关性、可追溯性和清晰度上给出4.1-4.6/5的高评分。

Insight: 创新点在于采用混合检索架构(语义解析与检索增强生成结合),将SQL与文献检索操作协调,实现数据与解释的统一;客观分析认为其跨模态联合检索设计为高容量实验数据集的工作流提供了有效支持。

Abstract: Scientific reasoning increasingly requires linking structured experimental data with the unstructured literature that explains it, yet most large language model (LLM) assistants cannot reason jointly across these modalities. We introduce SpectraQuery, a hybrid natural-language query framework that integrates a relational Raman spectroscopy database with a vector-indexed scientific literature corpus using a Structured and Unstructured Query Language (SUQL)-inspired design. By combining semantic parsing with retrieval-augmented generation, SpectraQuery translates open-ended questions into coordinated SQL and literature retrieval operations, producing cited answers that unify numerical evidence with mechanistic explanation. Across SQL correctness, answer groundedness, retrieval effectiveness, and expert evaluation, SpectraQuery demonstrates strong performance: approximately 80 percent of generated SQL queries are fully correct, synthesized answers reach 93-97 percent groundedness with 10-15 retrieved passages, and battery scientists rate responses highly across accuracy, relevance, grounding, and clarity (4.1-4.6/5). These results show that hybrid retrieval architectures can meaningfully support scientific workflows by bridging data and discourse for high-volume experimental datasets.


[11] Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers cs.CL | cs.AIPDF

Kaiyu He, Zhang Mian, Peilin Wu, Xinya Du, Zhiyu Chen

TL;DR: 本文对Transformer模型在组合任务中通过’grokking’阶段形成的’泛化电路’进行了机理研究。研究发现,grokked模型与非grokked模型在分布内组合查询的推理路径是相同的,表明’泛化电路’并非突然获得新的推理范式,而是将记忆的原子事实整合到已建立的推理路径中的过程。研究还表明,长时间训练后的高精度与特定推理路径的形成并非必然绑定,且成熟的电路在整合新知识时转移性有限。

Details

Motivation: 解决大型语言模型在组合任务中面临的’双跳推理诅咒’问题,并探究通过’grokking’过程形成的’泛化电路’是否真正提升了模型在下游任务中的性能,以及为此付出的巨大计算成本是否值得。

Result: 研究通过机理分析得出定性结论:grokked与非grokked模型的推理路径相同;高精度与特定推理路径可独立出现;成熟电路的跨知识转移能力有限。未提及具体基准测试或定量SOTA比较。

Insight: 创新点在于挑战了’grokking’代表突然获得新推理能力的观点,提出它是记忆整合过程,并揭示了泛化电路在知识整合上的局限性。这为理解Transformer的泛化机制和评估训练策略的价值提供了新视角。

Abstract: While Large Language Models (LLMs) excel at factual retrieval, they often struggle with the “curse of two-hop reasoning” in compositional tasks. Recent research suggests that parameter-sharing transformers can bridge this gap by forming a “Generalization Circuit” during a prolonged “grokking” phase. A fundamental question arises: Is a grokked model superior to its non-grokked counterparts on downstream tasks? Furthermore, is the extensive computational cost of waiting for the grokking phase worthwhile? In this work, we conduct a mechanistic study to evaluate the Generalization Circuit’s role in knowledge assimilation and transfer. We demonstrate that: (i) The inference paths established by non-grokked and grokked models for in-distribution compositional queries are identical. This suggests that the “Generalization Circuit” does not represent the sudden acquisition of a new reasoning paradigm. Instead, we argue that grokking is the process of integrating memorized atomic facts into an naturally established reasoning path. (ii) Achieving high accuracy on unseen cases after prolonged training and the formation of a certain reasoning path are not bound; they can occur independently under specific data regimes. (iii) Even a mature circuit exhibits limited transferability when integrating new knowledge, suggesting that “grokked” Transformers do not achieve a full mastery of compositional logic.


[12] Mi:dm 2.0 Korea-centric Bilingual Language Models cs.CL | cs.AIPDF

Donghoon Shin, Sejung Lee, Soonmin Bae, Hwijung Ryu, Changwon Ok

TL;DR: Mi:dm 2.0 是一个专门为韩国中心AI设计的双语大语言模型,通过整合韩国社会的价值观、推理模式和常识知识,实现对文化背景、情感细微差别和现实场景的细致理解,以生成可靠且文化适宜的回应。该模型提供两个版本:Base(115亿参数)用于通用目的,Mini(23亿参数)用于资源受限环境,并在韩国特定基准测试中取得了最先进的性能。

Details

Motivation: 解决现有大语言模型因韩语数据不足或质量低、缺乏文化对齐而导致的局限性,旨在推动韩国中心AI的发展。

Result: 在韩国特定基准测试(如KMMLU)上实现了最先进的性能,在语言、人文和社会科学任务中取得了顶级的零样本结果和强大的内部评估结果。

Insight: 创新点在于强调通过专有数据清洗、高质量合成数据生成、课程学习策略下的数据混合以及定制化的韩语优化分词器来确保数据质量,从而提升模型对韩国文化语境的理解和生成能力;采用深度扩展策略和不同参数规模的配置以满足不同应用场景需求。

Abstract: We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI. This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage. To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks. Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks. The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use. By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence. Our models are available at https://huggingface.co/K-intelligence. For technical inquiries, please contact midm-llm@kt.com.


[13] SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding cs.CL | cs.AIPDF

Shuyang Hou, Yi Hu, Muhan Zhang

TL;DR: 该论文提出了SubTokenTest基准测试,旨在评估大语言模型在现实应用中的子词理解能力,通过四个领域的十个实用任务来隔离分词相关错误,并对九个先进模型进行了全面评估。

Details

Motivation: 大语言模型在字符级任务(如单词字母计数)上表现不佳,这源于其分词过程,而现有基准测试常因缺乏实际相关性而被忽视,但许多现实应用(如文本地图导航或表格解析)高度依赖精确的子词理解。

Result: 论文对九个先进LLMs进行了全面评估,并研究了测试时缩放对子词推理的影响,以及字符级信息在隐藏状态中的编码方式。

Insight: 创新点在于通过实用、效用驱动的任务构建综合基准,以隔离分词相关失败,并探索模型内部对字符信息的处理机制,为改进模型在细粒度文本理解方面的能力提供了新视角。

Abstract: Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive evaluation of nine advanced LLMs. Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states.


[14] ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection cs.CL | cs.AIPDF

Tao Liu, Taiqiang Wu, Runming Yang, Shaoning Sun, Junjie Wang

TL;DR: 该论文提出了ProFit方法,通过概率引导的令牌选择来改进监督微调(SFT)。传统SFT强制模型对齐单一参考答案,忽略了语言的一对多特性,导致模型过拟合于非核心表达。ProFit的核心思想是利用令牌概率与语义重要性之间的内在联系,有选择性地掩码低概率令牌以防止表面级过拟合,从而在不过度依赖多样参考答案的情况下提升模型性能。

Details

Motivation: 传统监督微调(SFT)在使大语言模型(LLM)与人类意图对齐时,通常强制模型学习单一的参考答案,这忽略了语言的一对多本质,导致模型过拟合于非核心的表面表达。虽然引入多个参考答案可以缓解此问题,但其数据和计算成本过高,因此需要一种更高效的策略来缓解单参考过拟合问题。

Result: 广泛的实验证实,ProFit在通用推理和数学基准测试上持续优于传统的SFT基线方法。

Insight: 论文宣称的创新点在于揭示了令牌概率与语义重要性之间的内在联系:高概率令牌承载核心逻辑框架,而低概率令牌大多是可替换的表达。基于此,提出了ProFit方法,通过概率引导选择性掩码低概率令牌来防止表面级过拟合。从客观角度看,这是一种新颖且高效的缓解SFT过拟合的策略,无需依赖昂贵的多参考答案数据,具有实际应用价值。

Abstract: Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent. However, traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer, leading to the model overfitting to non-core expressions. Although our empirical analysis suggests that introducing multiple reference answers can mitigate this issue, the prohibitive data and computational costs necessitate a strategic shift: prioritizing the mitigation of single-reference overfitting over the costly pursuit of answer diversity. To achieve this, we reveal the intrinsic connection between token probability and semantic importance: high-probability tokens carry the core logical framework, while low-probability tokens are mostly replaceable expressions. Based on this insight, we propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting. Extensive experiments confirm that ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.


[15] A.X K1 Technical Report cs.CL | cs.AIPDF

Sung Jun Cheon, Jaekyung Cho, Seongho Choi, Hyunjun Eun, Seokhwan Jo

TL;DR: 本文介绍了A.X K1,这是一个拥有5190亿参数的从头开始训练的混合专家(MoE)语言模型。该模型基于约10万亿token的语料库进行预训练,其设计旨在弥合推理能力与推理效率之间的差距,并支持显式可控的推理。论文提出了一种简单有效的Think-Fusion训练方法,允许用户在单一统一模型内控制切换“思考”与“非思考”模式。

Details

Motivation: 旨在解决大型语言模型中推理能力与推理效率之间的权衡问题,并实现可控推理以适应多样化的实际部署场景。

Result: 广泛的评估表明,A.X K1的性能与领先的开源模型相当,并在韩语基准测试中建立了独特的优势。

Insight: 主要创新点包括:1) 利用缩放定律在固定计算预算下优化训练配置和词汇表大小;2) 提出Think-Fusion训练方法,实现了在单一模型内用户可控的推理模式切换;3) 通过多阶段数据处理流程构建高质量训练语料;4) 在保持通用竞争力的同时,在特定语言(韩语)任务上展现出优势。

Abstract: We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.


[16] UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning cs.CLPDF

Feng Zhang, Shijia Li, Chunmao Zhang, Zhanyu Ma, Jun Xu

TL;DR: 本文提出UserLM-R1,一种具备推理能力的用户语言模型,旨在解决现有用户模拟器依赖静态配置文件、缺乏人类策略思维的问题。通过结合静态角色和动态场景目标的综合用户画像,并采用目标驱动的决策策略生成高质量推理链,再通过监督微调和多奖励强化学习优化推理与策略能力。

Details

Motivation: 现有用户模拟器存在两个主要问题:一是依赖静态、上下文无关的配置文件,导致泛化能力差且需大量手动调整;二是忽视人类策略性思维,易被智能体操纵。本文旨在构建一个能跨领域泛化、主动进行谈判(如挑战或讨价还价)的理想用户模拟器。

Result: 大量实验结果表明,UserLM-R1在多个基准测试中优于竞争基线,尤其在更具挑战性的对抗性数据集上表现突出。

Insight: 创新点包括:构建结合静态角色与动态场景目标的综合用户画像以提升泛化能力;引入目标驱动的决策策略,在生成响应前先产生高质量推理链;采用监督微调与多奖励强化学习相结合的方法优化推理与策略能力,模拟人类理性决策过程。

Abstract: User simulators serve as the critical interactive environment for agent post-training, and an ideal user simulator generalizes across domains and proactively engages in negotiation by challenging or bargaining. However, current methods exhibit two issues. They rely on static and context-unaware profiles, necessitating extensive manual redesign for new scenarios, thus limiting generalizability. Moreover, they neglect human strategic thinking, leading to vulnerability to agent manipulation. To address these issues, we propose UserLM-R1, a novel user language model with reasoning capability. Specifically, we first construct comprehensive user profiles with both static roles and dynamic scenario-specific goals for adaptation to diverse scenarios. Then, we propose a goal-driven decision-making policy to generate high-quality rationales before producing responses, and further refine the reasoning and improve strategic capabilities with supervised fine-tuning and multi-reward reinforcement learning. Extensive experimental results demonstrate that UserLM-R1 outperforms competitive baselines, particularly on the more challenging adversarial set.


[17] When to Trust: A Causality-Aware Calibration Framework for Accurate Knowledge Graph Retrieval-Augmented Generation cs.CLPDF

Jing Ren, Bowen Li, Ziqi Xu, Xinkun Zhang, Haytham Fayek

TL;DR: 本文提出Ca2KG,一个用于知识图谱检索增强生成(KG-RAG)的因果感知校准框架,旨在解决现有KG-RAG模型在检索子图不完整或不可靠时仍过度自信的问题,通过整合反事实提示和基于面板的重评分机制来提升校准性。

Details

Motivation: 现有KG-RAG模型在复杂任务中虽提高事实准确性,但常严重过度自信,即使检索子图不完整或不可靠也产生高置信度预测,这在高风险领域部署中存在隐患。

Result: 在两个复杂问答数据集上的大量实验表明,Ca2KG能持续改善校准性,同时保持甚至提升预测准确性。

Insight: 创新点在于将因果推理引入KG-RAG校准,通过反事实提示暴露知识质量和推理可靠性的检索依赖性不确定性,并结合面板重评分机制稳定干预下的预测,为RAG系统提供更可靠的置信度估计。

Abstract: Knowledge Graph Retrieval-Augmented Generation (KG-RAG) extends the RAG paradigm by incorporating structured knowledge from knowledge graphs, enabling Large Language Models (LLMs) to perform more precise and explainable reasoning. While KG-RAG improves factual accuracy in complex tasks, existing KG-RAG models are often severely overconfident, producing high-confidence predictions even when retrieved sub-graphs are incomplete or unreliable, which raises concerns for deployment in high-stakes domains. To address this issue, we propose Ca2KG, a Causality-aware Calibration framework for KG-RAG. Ca2KG integrates counterfactual prompting, which exposes retrieval-dependent uncertainties in knowledge quality and reasoning reliability, with a panel-based re-scoring mechanism that stabilises predictions across interventions. Extensive experiments on two complex QA datasets demonstrate that Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy.


[18] MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus cs.CLPDF

Yexing Du, Kaiyuan Liu, Bihe Zhang, Youcheng Pan, Bo Yang

TL;DR: 本文提出了一个多任务古典中文文学体裁音频语料库(MCGA),旨在填补多模态大语言模型(MLLMs)在中文古典研究(CCS)中音频模态的空白。该语料库涵盖了自动语音识别、语音到文本翻译、语音情感描述、口语问答、语音理解和语音推理六个任务。通过评估十个MLLMs,实验表明当前模型在MCGA测试集上仍面临显著挑战。作者还引入了语音情感描述的评估指标以及衡量MLLMs语音与文本能力一致性的指标,并公开了语料库和代码以促进CCS领域多维音频能力的发展。

Details

Motivation: 随着多模态大语言模型(MLLMs)的快速发展,其在中文古典研究(CCS)中的潜力受到关注,但现有研究主要集中在文本和视觉模态,音频语料库在该领域尚未得到充分探索。因此,本文旨在构建一个多任务古典中文文学体裁音频语料库(MCGA),以弥补这一空白并推动MLLMs在音频模态上的应用。

Result: 在MCGA测试集上评估了十个多模态大语言模型(MLLMs),实验结果表明当前模型在处理该语料库时仍面临重大挑战,具体表现为在自动语音识别、语音到文本翻译、语音情感描述、口语问答、语音理解和语音推理等任务上的性能不足。

Insight: 论文的创新点在于首次构建了一个专注于古典中文文学体裁的多任务音频语料库(MCGA),涵盖了六个音频相关任务,并引入了新的评估指标(如语音情感描述指标和语音-文本能力一致性指标),为多模态大语言模型在中文古典研究领域的音频能力评估和提升提供了基准和工具。

Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: https://github.com/yxduir/MCGA


[19] ReGraM: Region-First Knowledge Graph Reasoning for Medical Question Answering cs.CL | cs.AIPDF

Chaerin Lee, Sohee Park, Hyunsik Na, Daseon Choi

TL;DR: ReGraM是一个用于医学问答的’区域优先’知识图谱推理框架,它通过为每个查询构建对齐的子图,并在多种证据感知模式下进行逐步推理,从而专注于知识图谱中最相关的部分,以提高事实准确性。

Details

Motivation: 现有方法通常遍历整个知识图谱或进行大规模检索,这会引入大量噪声并导致不稳定的多跳推理。核心挑战在于识别并推理每个查询的适当证据子集,而非扩大知识访问范围。

Result: 在七个医学问答基准测试上,ReGraM持续优于强基线KGARevion,在MCQ上获得8.04%的绝对准确率提升,在SAQ上提升4.50%,并将幻觉率降低了42.9%。

Insight: 创新点在于’区域优先’的推理范式,即先构建查询对齐的子图区域再进行逐步推理,这摒弃了所有关系同等重要的假设,并通过区域构建与逐跳推理的对齐来驱动性能提升。

Abstract: Recent studies in medical question answering (Medical QA) have actively explored the integration of large language models (LLMs) with biomedical knowledge graphs (KGs) to improve factual accuracy. However, most existing approaches still rely on traversing the entire KG or performing large-scale retrieval, which introduces substantial noise and leads to unstable multi-hop reasoning. We argue that the core challenge lies not in expanding access to knowledge, but in identifying and reasoning over the appropriate subset of evidence for each query. ReGraM is a region-first knowledge graph reasoning framework that addresses this challenge by constructing a query-aligned subgraph and performing stepwise reasoning constrained to this localized region under multiple evidence aware modes. By focusing inference on only the most relevant portion of the KG, ReGraM departs from the assumption that all relations are equally useful an assumption that rarely holds in domain-specific medical settings. Experiments on seven medical QA benchmarks demonstrate that ReGraM consistently outperforms a strong baseline (KGARevion), achieving an 8.04% absolute accuracy gain on MCQ, a 4.50% gain on SAQ, and a 42.9% reduction in hallucination rate. Ablation and qualitative analyses further show that aligning region construction with hop-wise reasoning is the primary driver of these improvements. Overall, our results highlight region-first KG reasoning as an effective paradigm for improving factual accuracy and consistency in medical QA.


[20] Relation Extraction Capabilities of LLMs on Clinical Text: A Bilingual Evaluation for English and Turkish cs.CLPDF

Aidana Aidynkyzy, Oğuz Dikenelli, Oylum Alatlı, Şebnem Bora

TL;DR: 本研究首次对大型语言模型在英语和土耳其语临床关系抽取任务上进行了全面的双语评估,并构建了首个英土平行临床关系抽取数据集。通过系统评估多种提示策略,发现基于提示的LLM方法持续优于传统微调模型,并提出了基于对比学习的关系感知检索方法,显著提升了模型性能。

Details

Motivation: 解决非英语语言临床信息抽取中标注数据稀缺的问题,评估LLM在英语和土耳其语临床关系抽取任务上的表现,以弥合资源差距。

Result: 在构建的英土平行数据集上,基于提示的LLM方法(如Gemini 1.5 Flash使用RAR)在英语和土耳其语上分别达到0.906和0.888的微平均F1分数;结合结构化推理提示的DeepSeek-V3模型在英语上进一步提升至0.918 F1,均优于传统微调基线(如PURE)。

Insight: 创新点包括构建首个英土平行临床RE数据集,以及提出关系感知检索方法,通过捕获句子级和关系级语义来优化上下文示例选择;客观分析表明,高质量演示检索和高级提示技术能有效提升LLM在资源匮乏语言中的性能。

Abstract: The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.


[21] Ability Transfer and Recovery via Modularized Parameters Localization cs.CL | cs.AI | cs.LGPDF

Songyao Jin, Kun Zhou, Wenqi Li, Peng Wang, Biwei Huang

TL;DR: 本文研究了大型语言模型中能力在参数中的分布,发现能力相关的激活高度集中在少数通道(通常<5%),且这些通道具有较好的解耦性、充分性和稳定性。基于此,作者提出了ACT方法,通过激活差异定位能力相关通道,选择性迁移对应参数并进行轻量微调,以恢复被遗忘的能力或整合多个专业模型的能力。

Details

Motivation: 解决LLM在持续预训练或微调以提升特定领域、语言或技能时,常导致其他能力退化或灾难性遗忘的问题。

Result: 在多语言数学和科学推理任务上的实验表明,ACT能够恢复被遗忘的能力,同时保留已掌握技能,并能将多个专业模型的能力整合到单一模型中,且干扰最小。

Insight: 创新点在于揭示了LLM中能力相关激活的高度集中性和解耦性,并据此提出了基于激活引导的通道级能力定位与选择性参数迁移方法(ACT),为模型能力编辑与融合提供了新思路。

Abstract: Large language models can be continually pre-trained or fine-tuned to improve performance in specific domains, languages, or skills, but this specialization often degrades other capabilities and may cause catastrophic forgetting. We investigate how abilities are distributed within LLM parameters by analyzing module activations under domain- and language-specific inputs for closely related models. Across layers and modules, we find that ability-related activations are highly concentrated in a small set of channels (typically <5%), and these channels are largely disentangled with good sufficiency and stability. Building on these observations, we propose ACT (Activation-Guided Channel-wise Ability Transfer), which localizes ability-relevant channels via activation differences and selectively transfers only the corresponding parameters, followed by lightweight fine-tuning for compatibility. Experiments on multilingual mathematical and scientific reasoning show that ACT can recover forgotten abilities while preserving retained skills. It can also merge multiple specialized models to integrate several abilities into a single model with minimal interference. Our code and data will be publicly released.


[22] Improving Symbolic Translation of Language Models for Logical Reasoning cs.CL | cs.AIPDF

Ramya Keerthy Thatikonda, Jiuzhou Han, Wray Buntine, Ehsan Shareghi

TL;DR: 本文提出了一种改进语言模型在逻辑推理中符号翻译性能的方法,通过将推理过程分解为谓词生成和一阶逻辑翻译两个阶段,并引入验证模块来减少翻译错误,从而提升较小语言模型的可靠性和推理能力。

Details

Motivation: 解决较小语言模型在将自然语言翻译为一阶逻辑时因格式和翻译错误导致符号输出不准确的问题,现有方法依赖模型自身迭代能力,效果有限。

Result: 在四个逻辑推理数据集上评估三个模型家族,通过综合微调、增量推理和验证模块,降低了错误率,提高了谓词覆盖率和推理性能。

Insight: 创新点在于将推理分解为两阶段增量推理框架,并引入针对谓词-元数错误的验证模块,增强了对模型行为的控制,提升了生成质量,为开发可靠且易用的符号推理系统提供了新思路。

Abstract: The use of formal language for deductive logical reasoning aligns well with language models (LMs), where translating natural language (NL) into first-order logic (FOL) and employing an external solver results in a verifiable and therefore reliable reasoning system. However, smaller LMs often struggle with this translation task, frequently producing incorrect symbolic outputs due to formatting and translation errors. Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model. To address this, we first categorize common errors and fine-tune smaller LMs using data synthesized by large language models. The evaluation is performed using the defined error categories. We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior and enhancing generation quality as measured by predicate metrics. This decomposition framework also enables the use of a verification module that targets predicate-arity errors to further improve performance. Our study evaluates three families of models across four logical-reasoning datasets. The comprehensive fine-tuning, incremental inference, and verification modules reduce error rates, increase predicate coverage, and improve reasoning performance for smaller LMs, moving us closer to developing reliable and accessible symbolic-reasoning systems.


[23] Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats cs.CL | cs.AIPDF

Manyi Zhang, Ji-Fu Li, Zhongao Sun, Haoli Bai, Hui-Ling Zhen

TL;DR: 本文系统性地研究了在微缩浮点格式下大语言模型的后训练量化,评估了超过7种PTQ算法、15个基准测试和3个LLM系列,发现MXFP8能实现接近无损的性能,而MXFP4仍面临挑战,并提供了将现有PTQ方法适配到MXFP的实用指导。

Details

Motivation: 微缩浮点格式作为LLMs的低精度格式前景广阔,但现有PTQ算法主要关注整数量化,其在MXFP格式下的适用性和行为尚未充分探索,因此本文旨在填补这一研究空白。

Result: 在多个基准测试中,MXFP8 consistently achieves near-lossless performance,而MXFP4 introduces substantial accuracy degradation;同时发现量化缩放因子是MXFP4的关键误差源,通过简单的预缩放优化策略可显著减轻其影响。

Insight: 论文的创新点在于首次系统评估PTQ在MXFP格式下的表现,揭示了格式兼容性对PTQ效果的关键影响,以及跨模型系列和模态的一致性趋势,特别是量化敏感性主要由语言模型而非视觉编码器主导,这为低精度量化实践提供了新见解。

Abstract: Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs). Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization.


[24] Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering cs.CLPDF

Dimitris Panagopoulos, Adolfo Perrusquia, Weisi Guo

TL;DR: 本文提出了一种名为对话遥测(Dialogue Telemetry, DT)的测量框架,用于监测自主信息收集对话的效率。DT在每次问答交换后生成两个与模型无关的信号:进度估计器(PE)量化每个类别的剩余信息潜力,以及停滞指数(SI)检测因重复类别探测和低边际增益响应导致的对话停滞模式。该框架在基于大型语言模型(LLM)的模拟搜索与救援(SAR)访谈中得到验证,能够区分高效与停滞的对话轨迹,并通过将DT信号集成到强化学习(RL)策略中展示了其下游实用性。

Details

Motivation: 自主系统在进行基于模式的信息收集对话时面临一个工具化缺口,缺乏回合级别的可观测指标来监控信息获取效率和检测提问何时变得低效或无产出。

Result: 在受控的、受搜索与救援(SAR)启发的访谈中,使用基于大型语言模型(LLM)的模拟进行验证,DT能够有效区分高效与停滞的对话轨迹。将DT信号集成到强化学习(RL)策略中,当对话停滞带来操作成本时,DT提供的可解释回合级工具化改善了策略性能。

Insight: 创新点在于提出了一个模型无关的、回合级别的对话效率测量框架,通过进度估计器(PE)和停滞指数(SI)两个核心信号,无需进行因果诊断即可检测对话停滞模式,为自主对话系统的监控和策略优化提供了可解释的工具。从客观角度看,将信息论概念(如比特变体)与对话状态分析结合,并集成到强化学习策略中以应对操作成本,是一个有实际应用价值的系统设计思路。

Abstract: Autonomous systems conducting schema-grounded information-gathering dialogues face an instrumentation gap, lacking turn-level observables for monitoring acquisition efficiency and detecting when questioning becomes unproductive. We introduce Dialogue Telemetry (DT), a measurement framework that produces two model-agnostic signals after each question-answer exchange: (i) a Progress Estimator (PE) quantifying residual information potential per category (with a bits-based variant), and (ii) a Stalling Index (SI) detecting an observable failure signature characterized by repeated category probing with semantically similar, low-marginal-gain responses. SI flags this pattern without requiring causal diagnosis, supporting monitoring in settings where attributing degradation to specific causes may be impractical. We validate DT in controlled search-and-rescue (SAR)-inspired interviews using large language model (LLM)-based simulations, distinguishing efficient from stalled dialogue traces and illustrating downstream utility by integrating DT signals into a reinforcement learning (RL) policy. Across these settings, DT provides interpretable turn-level instrumentation that improves policy performance when stalling carries operational costs.


[25] DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing cs.CL | cs.AIPDF

Qian Cao, Yahui Liu, Wei Bi, Yi Zhao, Ruihua Song

TL;DR: 本文提出了一种名为DPWriter的强化学习框架,旨在解决基于强化学习增强大语言模型时输出多样性下降的问题,特别是在创意写作等开放式任务中。该方法通过半结构化长链思维分解生成过程,并引入多样化规划分支策略和群体感知多样性奖励,以在规划阶段引导多样化探索。

Details

Motivation: 现有基于强化学习的大语言模型增强方法往往优先考虑优化效率和性能,缺乏明确的机制来引导多样化探索,导致在开放式任务中输出多样性降低,限制了其实用性。本文旨在解决这一多样性不足的问题。

Result: 在创意写作基准测试上的实验结果表明,该方法在不损害生成质量的前提下,显著提高了输出多样性,并持续优于现有基线模型。

Insight: 主要创新点在于提出了一个围绕半结构化长链思维构建的强化学习框架,并引入了基于多样性变化的多样化规划分支方法以及在规划阶段鼓励不同轨迹的群体感知多样性奖励机制,从而在强化学习优化中明确地引导和维持输出多样性。

Abstract: Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.


[26] LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation cs.CLPDF

Stergios Chatzikyriakidis

TL;DR: 本文提出了一种结合大语言模型(LLMs)与确定性音韵算法的混合系统,用于解决LLMs在希腊语诗歌的押韵检测与生成任务上的不足。该系统定义了希腊语押韵的全面分类,并采用基于代理的生成流程与音韵验证。实验表明,纯LLM生成效果极差,而混合验证方法能显著提升性能。

Details

Motivation: 尽管LLMs在多种NLP任务上表现出色,但在基于音韵的现象(如押韵检测与生成)上存在困难,这在资源较少的语言(如现代希腊语)中尤为明显。

Result: 在押韵识别任务上,推理密集型模型(Claude 4.5)在采用思维链提示时达到54%的准确率,接近SOTA水平;在生成任务上,纯LLM生成的有效诗歌不足4%,而混合验证循环将性能恢复至73.1%。

Insight: 创新点在于将LLMs的生成能力与确定性音韵规则相结合,通过验证循环弥补LLMs在音韵任务上的固有缺陷,并构建了一个经过严格清洗的大型希腊语押韵语料库以支持研究。

Abstract: Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant “Reasoning Gap”: while native-like models (Claude 3.7) perform intuitively (40% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4% valid poems), while our hybrid verification loop restores performance to 73.1%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.


[27] TaxoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion cs.CLPDF

Sahil Mishra, Srinitish Srinivasan, Srikanta Bedathur, Tanmoy Chakraborty

TL;DR: TaxoBell是一种基于高斯盒嵌入的自监督分类法扩展框架,通过将盒几何与多元高斯分布相互转换,其中均值编码语义位置、协方差编码不确定性,解决了传统点向量嵌入难以建模非对称“is-a”关系的问题,并在五个基准数据集上显著优于现有方法。

Details

Motivation: 手动扩展分类法劳动密集且无法跟上新概念的出现,现有基于点向量的方法难以建模分类法中非对称的“is-a”关系,而盒嵌入虽能表示包含和不相交关系,但存在边界梯度不稳定、缺乏语义不确定性建模以及多义性表示能力有限等问题。

Result: 在五个基准数据集上的实验表明,TaxoBell在MRR指标上显著优于八个最先进的分类法扩展基线方法19%,在Recall@k指标上提升约25%,达到了SOTA水平。

Insight: 创新点在于将盒嵌入与高斯分布结合,通过均值编码语义位置、协方差编码不确定性,实现了稳定的基于能量的优化、对模糊概念的鲁棒建模以及可解释的层次推理,为分类法扩展提供了更强大的表示能力。

Abstract: Taxonomies form the backbone of structured knowledge representation across diverse domains, enabling applications such as e-commerce catalogs, semantic search, and biomedical discovery. Yet, manual taxonomy expansion is labor-intensive and cannot keep pace with the emergence of new concepts. Existing automated methods rely on point-based vector embeddings, which model symmetric similarity and thus struggle with the asymmetric “is-a” relationships that are fundamental to taxonomies. Box embeddings offer a promising alternative by enabling containment and disjointness, but they face key issues: (i) unstable gradients at the intersection boundaries, (ii) no notion of semantic uncertainty, and (iii) limited capacity to represent polysemy or ambiguity. We address these shortcomings with TaxoBell, a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions, where means encode semantic location and covariances encode uncertainty. Energy-based optimization yields stable optimization, robust modeling of ambiguous concepts, and interpretable hierarchical reasoning. Extensive experimentation on five benchmark datasets demonstrates that TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. We further demonstrate the advantages and pitfalls of TaxoBell with error analysis and ablation studies.


[28] Value-Aware Numerical Representations for Transformer Language Models cs.CL | cs.AI | cs.LGPDF

Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu

TL;DR: 本文针对Transformer语言模型在数学推理中数值理解薄弱的局限性,提出了一种值感知数值表示方法,通过为数字添加一个嵌入显式编码数值大小的前缀标记来增强模型输入,从而提升算术任务的性能。

Details

Motivation: Transformer语言模型在数学推理基准上表现良好,但在基本数值理解和算术运算上仍脆弱,核心限制在于数字被处理为符号标记,其嵌入未显式编码数值,导致系统性错误。

Result: 在算术任务评估中,该方法在多种数值格式、任务和操作数长度上均优于基线,表明显式编码数值是提高语言模型基本数值鲁棒性的有效且高效方式。

Insight: 创新点在于设计了一种与现有分词器和仅解码器Transformer架构兼容的值感知前缀标记机制,直接向输入空间注入幅度信息,从而增强模型对数值的底层理解能力。

Abstract: Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model’s input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.


cs.CV [Back]

[29] Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR cs.CV | cs.AI | cs.CLPDF

Yufeng Zhong, Lei Chen, Zhixiong Zeng, Xuanle Zhao, Deyang Jiang

TL;DR: 本文提出了一种格式解耦强化学习(FD-RL)方法,用于提升文档OCR模型在处理格式化文本(如公式、表格)时的性能。该方法通过基于熵的数据过滤策略识别格式密集型实例,并采用针对不同格式类型定制的解耦奖励,实现格式级验证而非词元级记忆。

Details

Motivation: 现有先进OCR模型在处理格式化文本时表现出比纯文本高得多的输出不确定性(熵),这表明模型在格式敏感文档上存在困难,需要引入对多样化阅读路径的推理能力来提升性能。

Result: FD-RL在流行的OmniDocBench基准测试上取得了90.41的平均分,为该端到端模型在该基准上创造了新记录。

Insight: 创新点在于将高熵模式用于针对性优化,通过格式解耦的奖励机制和数据过滤策略,使模型从简单的词元记忆转向格式级推理,这为处理复杂文档结构提供了新思路。

Abstract: Reading text from images or scanned documents via OCR models has been a longstanding focus of researchers. Intuitively, text reading is perceived as a straightforward perceptual task, and existing work primarily focuses on constructing enriched data engineering to enhance SFT capabilities. In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text (\emph{e.g.}, formula, table, etc.) compared to plain text, often by an order of magnitude. These statistical patterns reveal that advanced OCR models struggle with high output uncertainty when dealing with format sensitive document, suggesting that reasoning over diverse reading pathways may improve OCR performance. To address this, we propose format decoupled reinforcement learning (FD-RL), which leverages high-entropy patterns for targeted optimization. Our approach employs entropy-based data filtration strategy to identify format-intensive instances, and adopt format decoupled rewards tailored to different format types, enabling format-level validation rather than token-level memorization. FD-RL achieves an average score of 90.41 on OmniDocBench, setting a new record for end-to-end models on this highly popular benchmark. More importantly, we conduct comprehensive ablation studies over data, training, filtering, and rewarding strategies, thoroughly validating their effectiveness.


[30] Bias Detection and Rotation-Robustness Mitigation in Vision-Language Models and Generative Image Models cs.CV | cs.AIPDF

Tarannum Mithila

TL;DR: 该论文研究了视觉语言模型和生成式图像模型在图像旋转和分布偏移下的偏见传播与鲁棒性下降问题,提出了结合数据增强、表示对齐和模型正则化的旋转鲁棒性缓解策略,以提升模型的公平性和可靠性。

Details

Motivation: 当前视觉语言模型和生成式图像模型在多模态任务中表现优异,但其在输入变换下的鲁棒性和公平性尚未充分探索,因此论文旨在解决模型在图像旋转等变换中产生的偏见和性能退化问题。

Result: 在多个数据集上的实验结果表明,所提方法显著提高了模型的鲁棒性,减少了偏见放大,且未牺牲整体性能。

Insight: 论文的创新点在于系统分析了旋转扰动对模型预测、置信度校准和人口统计偏见模式的影响,并提出了综合性的缓解策略,为构建更可靠和公平的AI模型提供了实用技术。

Abstract: Vision-Language Models (VLMs) and generative image models have achieved remarkable performance across multimodal tasks, yet their robustness and fairness under input transformations remain insufficiently explored. This work investigates bias propagation and robustness degradation in state-of-the-art vision-language and generative models, with a particular focus on image rotation and distributional shifts. We analyze how rotation-induced perturbations affect model predictions, confidence calibration, and demographic bias patterns. To address these issues, we propose rotation-robust mitigation strategies that combine data augmentation, representation alignment, and model-level regularization. Experimental results across multiple datasets demonstrate that the proposed methods significantly improve robustness while reducing bias amplification without sacrificing overall performance. This study highlights critical limitations of current multimodal systems and provides practical mitigation techniques for building more reliable and fair AI models.


[31] Residual Cross-Modal Fusion Networks for Audio-Visual Navigation cs.CV | cs.AI | cs.ROPDF

Yi Wang, Yinfeng Yu, Bin Ren

TL;DR: 本文提出了一种用于音频-视觉导航的跨模态残差融合网络(CRFN),旨在解决异构特征融合中单模态主导或信息退化的问题。该方法通过双向残差交互实现互补建模和细粒度对齐,同时在Replica和Matterport3D数据集上显著超越了现有融合方法,并展现出更强的跨域泛化能力。

Details

Motivation: 解决音频-视觉具身导航任务中,异构特征在多模态融合时交互建模不足导致的单模态主导或信息退化问题,特别是在跨域场景下。

Result: 在Replica和Matterport3D数据集上的实验表明,CRFN显著优于最先进的融合基线方法,并实现了更强的跨域泛化性能。

Insight: 创新点在于引入双向残差交互进行跨模态互补建模,而非简单的拼接或注意力门控;同时发现智能体在不同数据集中表现出差异化的模态依赖性,这为理解具身智能体的跨模态协作机制提供了新视角。

Abstract: Audio-visual embodied navigation aims to enable an agent to autonomously localize and reach a sound source in unseen 3D environments by leveraging auditory cues. The key challenge of this task lies in effectively modeling the interaction between heterogeneous features during multimodal fusion, so as to avoid single-modality dominance or information degradation, particularly in cross-domain scenarios. To address this, we propose a Cross-Modal Residual Fusion Network, which introduces bidirectional residual interactions between audio and visual streams to achieve complementary modeling and fine-grained alignment, while maintaining the independence of their representations. Unlike conventional methods that rely on simple concatenation or attention gating, CRFN explicitly models cross-modal interactions via residual connections and incorporates stabilization techniques to improve convergence and robustness. Experiments on the Replica and Matterport3D datasets demonstrate that CRFN significantly outperforms state-of-the-art fusion baselines and achieves stronger cross-domain generalization. Notably, our experiments also reveal that agents exhibit differentiated modality dependence across different datasets. The discovery of this phenomenon provides a new perspective for understanding the cross-modal collaboration mechanism of embodied agents.


[32] ForensicFormer: Hierarchical Multi-Scale Reasoning for Cross-Domain Image Forgery Detection cs.CV | cs.AI | cs.LG | cs.MMPDF

Hema Hariharan Samson

TL;DR: 本文提出ForensicFormer,一种用于跨域图像伪造检测的分层多尺度推理框架,通过交叉注意力Transformer统一低级伪影检测、中级边界分析和高级语义推理,在七个不同测试集上平均准确率达到86.8%,显著优于现有通用检测器。

Details

Motivation: 针对AI生成图像和复杂编辑工具导致传统取证方法在跨域伪造检测中失效的问题,需要开发能适应未知篡改技术的鲁棒检测方案。

Result: 在包含传统篡改、GAN生成图像和扩散模型输出的七个多样化测试集上达到86.8%平均准确率(SOTA);JPEG压缩鲁棒性显著提升(Q=70时83% vs 基线66%);像素级伪造定位F1分数0.76。

Insight: 创新点在于分层多尺度推理架构,将不同层次取证线索通过交叉注意力机制融合;通过消融实验验证各层级贡献(提升4-10%准确率),实现可解释特征对齐人类专家推理,连接经典图像取证与现代深度学习。

Abstract: The proliferation of AI-generated imagery and sophisticated editing tools has rendered traditional forensic methods ineffective for cross-domain forgery detection. We present ForensicFormer, a hierarchical multi-scale framework that unifies low-level artifact detection, mid-level boundary analysis, and high-level semantic reasoning via cross-attention transformers. Unlike prior single-paradigm approaches, which achieve <75% accuracy on out-of-distribution datasets, our method maintains 86.8% average accuracy across seven diverse test sets, spanning traditional manipulations, GAN-generated images, and diffusion model outputs - a significant improvement over state-of-the-art universal detectors. We demonstrate superior robustness to JPEG compression (83% accuracy at Q=70 vs. 66% for baselines) and provide pixel-level forgery localization with a 0.76 F1-score. Extensive ablation studies validate that each hierarchical component contributes 4-10% accuracy improvement, and qualitative analysis reveals interpretable forensic features aligned with human expert reasoning. Our work bridges classical image forensics and modern deep learning, offering a practical solution for real-world deployment where manipulation techniques are unknown a priori.


[33] Learning Domain-Invariant Representations for Cross-Domain Image Registration via Scene-Appearance Disentanglement cs.CV | cs.AI | cs.LGPDF

Jiahao Qin, Yiwen Wang

TL;DR: 本文提出SAR-Net,一种通过场景-外观解耦学习域不变表示的跨域图像配准框架,旨在解决源域和目标域图像存在系统性强度差异时的配准难题。该方法将观测图像分解为域不变的场景表示和域特定的外观编码,通过重渲染而非直接强度匹配实现配准。

Details

Motivation: 解决跨域图像配准中因域偏移导致亮度恒定假设失效,从而使对应关系估计不适定的根本挑战。

Result: 在双向扫描显微镜的真实世界测试平台上,SAR-Net取得了0.885 SSIM和0.979 NCC,比最强基线提升了3.1倍,并保持实时性能(77 fps)。消融研究表明场景一致性损失和域对齐损失均不可或缺。

Insight: 核心创新在于通过理论引导的场景-外观解耦实现跨域配准,将配准问题转化为在共享潜在空间中对齐场景表示,而非直接匹配强度,这为处理域偏移提供了新的理论保证和实用框架。

Abstract: Image registration under domain shift remains a fundamental challenge in computer vision and medical imaging: when source and target images exhibit systematic intensity differences, the brightness constancy assumption underlying conventional registration methods is violated, rendering correspondence estimation ill-posed. We propose SAR-Net, a unified framework that addresses this challenge through principled scene-appearance disentanglement. Our key insight is that observed images can be decomposed into domain-invariant scene representations and domain-specific appearance codes, enabling registration via re-rendering rather than direct intensity matching. We establish theoretical conditions under which this decomposition enables consistent cross-domain alignment (Proposition 1) and prove that our scene consistency loss provides a sufficient condition for geometric correspondence in the shared latent space (Proposition 2). Empirically, we validate SAR-Net on bidirectional scanning microscopy, where coupled domain shift and geometric distortion create a challenging real-world testbed. Our method achieves 0.885 SSIM and 0.979 NCC, representing 3.1x improvement over the strongest baseline, while maintaining real-time performance (77 fps). Ablation studies confirm that both scene consistency and domain alignment losses are necessary: removing either degrades performance by 90% SSIM or causes 223x increase in latent alignment error, respectively. Code and data are available at https://github.com/D-ST-Sword/SAR-NET.


[34] Compressing Vision Transformers in Geospatial Transfer Learning with Manifold-Constrained Optimization cs.CV | cs.AIPDF

Thomas Snyder, H. Lexie Yang, Stefan Schnake, Steffen Schotthöfer

TL;DR: 本文提出了一种基于流形约束优化框架DLRT的方法,用于在迁移学习过程中压缩基于视觉Transformer的地理空间基础模型,旨在实现高压缩比的同时保持下游任务性能。

Details

Motivation: 解决地理空间基础模型在资源受限的边缘设备上部署时,因参数量大和压缩导致的精度损失问题。

Result: 在多个地理空间基准测试中,该方法显著减少了参数数量且精度损失最小,性能优于现成的低秩方法如LoRA。

Insight: 创新点在于利用与下游目标对齐的结构化低维参数化进行流形约束优化,从而在压缩过程中更好地保持任务特定精度。

Abstract: Deploying geospatial foundation models on resource-constrained edge devices demands compact architectures that maintain high downstream performance. However, their large parameter counts and the accuracy loss often induced by compression limit practical adoption. In this work, we leverage manifold-constrained optimization framework DLRT to compress large vision transformer-based geospatial foundation models during transfer learning. By enforcing structured low-dimensional parameterizations aligned with downstream objectives, this approach achieves strong compression while preserving task-specific accuracy. We show that the method outperforms of-the-shelf low-rank methods as LoRA. Experiments on diverse geospatial benchmarks confirm substantial parameter reduction with minimal accuracy loss, enabling high-performing, on-device geospatial models.


[35] Adaptive few-shot learning for robust part quality classification in two-photon lithography cs.CV | cs.LGPDF

Sixian Jia, Ruo-Syuan Mei, Chenhui Shao

TL;DR: 本文提出了一种自适应计算机视觉框架,用于双光子光刻(TPL)中零件质量分类的整个生命周期维护。该框架基于一个尺度鲁棒的主干模型,集成了三种关键方法:基于线性判别分析(LDA)的统计假设检验用于新类检测、基于重放的两阶段小样本增量学习策略,以及用于小样本领域自适应的小样本领域对抗神经网络(DANN)。

Details

Motivation: 解决现有静态计算机视觉模型在动态制造环境中无法有效检测新缺陷类别、难以利用少量数据高效更新或适应新零件几何形状的问题。

Result: 在TPL数据集(源域为半球体,目标域为立方体,均包含良好、轻微损坏和损坏三类)上评估。假设检验方法以99-100%的准确率识别新类别批次;增量学习方法仅用K=20个样本就将新类别整合到模型中,达到92%的准确率;领域自适应模型仅用K=5个样本就弥合了严重的领域差异,在目标域上达到96.19%的准确率。

Insight: 创新点在于将新类检测、小样本增量学习和领域自适应整合到一个统一的、面向全生命周期的自适应框架中,为动态生产环境提供了数据高效且鲁棒的解决方案。从客观角度看,其将统计假设检验与深度学习结合用于新类发现,以及针对小样本场景改进DANN进行领域自适应,是具有借鉴价值的技术路径。

Abstract: Two-photon lithography (TPL) is an advanced additive manufacturing (AM) technique for fabricating high-precision micro-structures. While computer vision (CV) is proofed for automated quality control, existing models are often static, rendering them ineffective in dynamic manufacturing environments. These models typically cannot detect new, unseen defect classes, be efficiently updated from scarce data, or adapt to new part geometries. To address this gap, this paper presents an adaptive CV framework for the entire life-cycle of quality model maintenance. The proposed framework is built upon a same, scale-robust backbone model and integrates three key methodologies: (1) a statistical hypothesis testing framework based on Linear Discriminant Analysis (LDA) for novelty detection, (2) a two-stage, rehearsal-based strategy for few-shot incremental learning, and (3) a few-shot Domain-Adversarial Neural Network (DANN) for few-shot domain adaptation. The framework was evaluated on a TPL dataset featuring hemisphere as source domain and cube as target domain structures, with each domain categorized into good, minor damaged, and damaged quality classes. The hypothesis testing method successfully identified new class batches with 99-100% accuracy. The incremental learning method integrated a new class to 92% accuracy using only K=20 samples. The domain adaptation model bridged the severe domain gap, achieving 96.19% accuracy on the target domain using only K=5 shots. These results demonstrate a robust and data-efficient solution for deploying and maintaining CV models in evolving production scenarios.


[36] Thermo-LIO: A Novel Multi-Sensor Integrated System for Structural Health Monitoring cs.CVPDF

Chao Yang, Haoyuan Zheng, Yue Ma

TL;DR: 本文提出了一种名为Thermo-LIO的新型多传感器集成系统,用于结构健康监测。该系统通过融合热成像与高分辨率激光雷达数据,结合激光雷达惯性里程计,实现了对大型结构温度分布和缺陷的精确、全覆盖监测。实验验证表明,该系统在桥梁和厅堂建筑案例中能比传统方法更准确地检测热异常和结构缺陷。

Details

Motivation: 传统二维热成像技术虽具有非侵入性,但在评估复杂几何形状、难以接近区域及次表面缺陷方面存在局限。本文旨在通过融合热成像与激光雷达,提升结构健康监测的精度和覆盖范围。

Result: 在桥梁和厅堂建筑的实验验证中,Thermo-LIO系统相比传统方法能更准确地检测热异常和结构缺陷,实现了更高的诊断精度、实时处理和扩展的检测覆盖范围。

Insight: 创新点在于开发了一种热成像与激光雷达的多模态融合方法,并集成到激光雷达惯性里程计中,实现了数据流的精确校准与同步,从而为大型土木基础设施的结构健康监测提供了更全面的解决方案。

Abstract: Traditional two-dimensional thermography, despite being non-invasive and useful for defect detection in the construction field, is limited in effectively assessing complex geometries, inaccessible areas, and subsurface defects. This paper introduces Thermo-LIO, a novel multi-sensor system that can enhance Structural Health Monitoring (SHM) by fusing thermal imaging with high-resolution LiDAR. To achieve this, the study first develops a multimodal fusion method combining thermal imaging and LiDAR, enabling precise calibration and synchronization of multimodal data streams to create accurate representations of temperature distributions in buildings. Second, it integrates this fusion approach with LiDAR-Inertial Odometry (LIO), enabling full coverage of large-scale structures and allowing for detailed monitoring of temperature variations and defect detection across inspection cycles. Experimental validations, including case studies on a bridge and a hall building, demonstrate that Thermo-LIO can detect detailed thermal anomalies and structural defects more accurately than traditional methods. The system enhances diagnostic precision, enables real-time processing, and expands inspection coverage, highlighting the crucial role of multimodal sensor integration in advancing SHM methodologies for large-scale civil infrastructure.


[37] Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers cs.CVPDF

Jonas Römer, Timo Dickscheid

TL;DR: 本文研究了在视频视觉Transformer中应用分块自监督学习(BWSSL)进行掩码视频建模的可行性,通过将编码器划分为多个块并分别使用局部掩码重建损失进行优化,实现了无需端到端反向传播的训练。研究发现,该方法能够收敛并在线性探测和检索任务上获得与端到端基线相当的表征性能,同时揭示了分块训练在深度表征发展上的独特动态,如早期暴露高层结构、后期块饱和以及几何保持特性。

Details

Motivation: 端到端反向传播通过全局误差信号耦合所有层,虽然支持协同学习但需要长程信用分配;本文旨在探索掩码视频Transformer是否可以在不使用端到端反向传播的情况下,通过分块自监督学习进行训练,并比较分块学习与端到端训练在学习动态和深度表征发展上的差异。

Result: 在不同模型规模和划分粒度下,分块训练均能收敛,且在线性探测和检索任务上获得的表征与匹配的端到端基线接近。

Insight: 分块训练在早期阶段更早地暴露高层结构,后期块则趋于饱和并在几何保持模式下操作;该方法还能引发令牌级的变化,这些变化在池化指标中可能被忽略,而后期块饱和和接口形成是导致与端到端基线存在剩余差距的因素。

Abstract: End-to-end backpropagation couples all layers through a global error signal, enabling coordinated learning but requiring long-range credit assignment. Motivated by recent progress in blockwise self-supervised learning (BWSSL), we ask whether masked video transformers can be trained without end-to-end backpropagation. Applying BWSSL to masked video modeling remains relatively underexplored and must handle spatiotemporal context and long-range temporal structure. More broadly, analyses that compare BWSSL and end-to-end training in terms of learning dynamics and depth-wise representation development remain sparse. We apply blockwise learning to a masked autoencoding video vision transformer by partitioning the encoder into blocks, each of which is optimized with a local masked reconstruction loss. Across model sizes and partition granularities, training converges and yields representations close to matched end-to-end baselines under linear-probe and retrieval proxies. In order to compare intermediate representations, we analyze depth-wise decodability, inter-block similarity, and patch-level diagnostics. Blockwise training exposes higher-level structure earlier, while later blocks saturate and operate in a more geometry-preserving regime. It can also induce token-level shifts consistent with stronger early mixing that pooled metrics can miss. These findings point to late-block saturation and interface formation as contributors to the remaining gap.


[38] Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking cs.CVPDF

Junze Shi, Yang Yu, Jian Shi, Haibo Luo

TL;DR: 本文提出STDTrack,一种将可靠时空依赖集成到轻量级跟踪器中的框架,通过密集视频采样、时空传播令牌和多帧信息融合模块来充分利用视频的时空信息,以弥合轻量级与高性能跟踪器之间的性能差距。

Details

Motivation: 现有基于Transformer的轻量级目标跟踪方法在训练时普遍采用稀疏采样(每序列仅使用一个模板和一个搜索图像),未能充分利用视频中的时空信息,限制了性能并导致轻量级与高性能跟踪器之间存在差距。

Result: 在六个基准测试上取得了最先进的结果,在GOT-10k上,STDTrack的性能可与某些高性能非实时跟踪器(如MixFormer)相媲美,同时在GPU上达到192 FPS,在CPU上达到41 FPS。

Insight: 创新点包括:采用密集视频采样以最大化时空信息利用;引入时空传播令牌指导逐帧特征提取;设计多帧信息融合模块(MFIFM)利用历史上下文增强当前依赖;构建基于质量更新的时空令牌维护器(STM)确保信息可靠性;以及开发多尺度预测头以适应不同尺寸的目标。

Abstract: Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these achievements, existing methods universally employ sparse sampling during training–utilizing only one template and one search image per sequence–which fails to comprehensively explore spatiotemporal information in videos. This limitation constrains performance and cause the gap between lightweight and high-performance trackers. To bridge this divide while maintaining real-time efficiency, we propose STDTrack, a framework that pioneers the integration of reliable spatiotemporal dependencies into lightweight trackers. Our approach implements dense video sampling to maximize spatiotemporal information utilization. We introduce a temporally propagating spatiotemporal token to guide per-frame feature extraction. To ensure comprehensive target state representation, we disign the Multi-frame Information Fusion Module (MFIFM), which augments current dependencies using historical context. The MFIFM operates on features stored in our constructed Spatiotemporal Token Maintainer (STM), where a quality-based update mechanism ensures information reliability. Considering the scale variation among tracking targets, we develop a multi-scale prediction head to dynamically adapt to objects of different sizes. Extensive experiments demonstrate state-of-the-art results across six benchmarks. Notably, on GOT-10k, STDTrack rivals certain high-performance non-real-time trackers (e.g., MixFormer) while operating at 192 FPS(GPU) and 41 FPS(CPU).


[39] Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams cs.CV | cs.ROPDF

Lachlan Holden, Feras Dayoub, Alberto Candela, David Harvey, Tat-Jun Chin

TL;DR: 本文提出了一种使用跨视图定位双编码器深度神经网络的方法,使行星漫游车能够通过有限的单目地面RGB图像在本地航空地图中定位。该方法利用视觉基础模型进行语义分割,并结合大量合成数据来弥合与真实图像之间的领域差距,同时通过粒子滤波器进行状态估计,实现了在简单和复杂轨迹上的精确位置估计。

Details

Motivation: 解决行星机器人任务中因真实空间数据稀缺而难以训练机器学习模型进行精确定位的问题,以支持未来地面-空中机器人团队任务的高级自主性需求。

Result: 在行星模拟设施中捕获的真实世界漫游车轨迹数据集上,结合合成图像对,通过粒子滤波器与跨视图网络实现了基于地面图像序列的准确位置估计,但摘要未明确提及具体基准或与SOTA的比较结果。

Insight: 创新点包括利用视觉基础模型进行语义分割来桥接领域差距,以及结合大量合成数据增强模型泛化能力;从客观角度看,该方法通过跨视图网络和粒子滤波器的集成,为资源受限的行星环境提供了可扩展的定位解决方案。

Abstract: Accurate localisation in planetary robotics enables the advanced autonomy required to support the increased scale and scope of future missions. The successes of the Ingenuity helicopter and multiple planetary orbiters lay the groundwork for future missions that use ground-aerial robotic teams. In this paper, we consider rovers using machine learning to localise themselves in a local aerial map using limited field-of-view monocular ground-view RGB images as input. A key consideration for machine learning methods is that real space data with ground-truth position labels suitable for training is scarce. In this work, we propose a novel method of localising rovers in an aerial map using cross-view-localising dual-encoder deep neural networks. We leverage semantic segmentation with vision foundation models and high volume synthetic data to bridge the domain gap to real images. We also contribute a new cross-view dataset of real-world rover trajectories with corresponding ground-truth localisation data captured in a planetary analogue facility, plus a high volume dataset of analogous synthetic image pairs. Using particle filters for state estimation with the cross-view networks allows accurate position estimation over simple and complex trajectories based on sequences of ground-view images.


[40] SAM-Aug: Leveraging SAM Priors for Few-Shot Parcel Segmentation in Satellite Time Series cs.CVPDF

Kai Hu, Yaozu Feng, Vladimir Lysenko, Ya Guo Member, Huayi Wu

TL;DR: 本文提出SAM-Aug框架,用于解决卫星时间序列图像中标注数据稀缺条件下的少样本语义分割问题。该方法利用Segment Anything Model (SAM)的几何感知分割能力,以无监督方式从时序图像中生成几何感知的掩码先验,并通过提出的RegionSmoothLoss损失函数将这些先验整合到训练中,以增强模型在时间帧间的预测一致性。

Details

Motivation: 在遥感图像领域,标注数据稀缺或获取成本高昂,导致现有全监督模型在少样本设定下性能显著下降,限制了其实际应用。因此,需要开发一种标注高效的框架来提升少样本土地覆盖制图的性能。

Result: 在PASTIS-R基准测试的5%标注数据设定下,SAM-Aug在三个随机种子(42, 2025, 4090)上的平均测试mIoU达到36.21%,比当前最优基线提升了2.33个百分点(相对提升6.89%)。在最优分割(seed=42)上,测试mIoU达到40.28%,实现了11.2%的相对增益,且无需额外标注数据。

Insight: 创新点在于利用SAM等基础模型作为无监督的几何先验生成器,并通过RegionSmoothLoss损失函数将先验作为正则化项融入训练,强制模型在时间序列中保持语义一致性。这提供了一种可扩展的即插即用解决方案,无需人工标注或模型微调,即可提升少样本遥感学习的泛化能力。

Abstract: Few-shot semantic segmentation of time-series remote sensing images remains a critical challenge, particularly in regions where labeled data is scarce or costly to obtain. While state-of-the-art models perform well under full supervision, their performance degrades significantly under limited labeling, limiting their real-world applicability. In this work, we propose SAM-Aug, a new annotation-efficient framework that leverages the geometry-aware segmentation capability of the Segment Anything Model (SAM) to improve few-shot land cover mapping. Our approach constructs cloud-free composite images from temporal sequences and applies SAM in a fully unsupervised manner to generate geometry-aware mask priors. These priors are then integrated into training through a proposed loss function called RegionSmoothLoss, which enforces prediction consistency within each SAM-derived region across temporal frames, effectively regularizing the model to respect semantically coherent structures. Extensive experiments on the PASTIS-R benchmark under a 5 percent labeled setting demonstrate the effectiveness and robustness of SAM-Aug. Averaged over three random seeds (42, 2025, 4090), our method achieves a mean test mIoU of 36.21 percent, outperforming the state-of-the-art baseline by +2.33 percentage points, a relative improvement of 6.89 percent. Notably, on the most favorable split (seed=42), SAM-Aug reaches a test mIoU of 40.28 percent, representing an 11.2 percent relative gain with no additional labeled data. The consistent improvement across all seeds confirms the generalization power of leveraging foundation model priors under annotation scarcity. Our results highlight that vision models like SAM can serve as useful regularizers in few-shot remote sensing learning, offering a scalable and plug-and-play solution for land cover monitoring without requiring manual annotations or model fine-tuning.


[41] Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning cs.CVPDF

Yang Li, Aming Wu, Zihao Zhang, Yahong Han

TL;DR: 本文针对视觉语言导航(VLN)任务在开放环境下面临的泛化挑战,提出了一个名为slow4fast-VLN的动态交互式快慢推理框架。该框架通过快速推理模块实时执行导航决策并积累记忆,慢速推理模块则对这些记忆进行深度反思以提取泛化经验,两者交互优化,旨在提升智能体在未见环境和指令下的导航能力。

Details

Motivation: 传统VLN方法通常基于闭集假设,即训练和测试数据共享相似的图像和指令风格,这限制了其在充满未知变化的真实开放世界中的适用性。因此,本文聚焦于通用场景适应(GSA-VLN)任务,旨在通过引入多样化环境和不一致指令来学习泛化的导航能力。

Result: 论文未在摘要中提供具体的定量实验结果或基准测试比较。

Insight: 主要创新点在于受人类快慢认知系统启发,构建了一个动态交互的快慢推理框架,将实时决策的快速模块与进行深度经验提炼的慢速模块相结合,通过交互优化来增强对开放环境的适应能力,这不同于传统将快慢推理视为独立机制的方法。

Abstract: Vision-Language Navigation aims to enable agents to navigate to a target location based on language instructions. Traditional VLN often follows a close-set assumption, i.e., training and test data share the same style of the input images and instructions. However, the real world is open and filled with various unseen environments, posing enormous difficulties for close-set methods. To this end, we focus on the General Scene Adaptation (GSA-VLN) task, aiming to learn generalized navigation ability by introducing diverse environments and inconsistent intructions.Towards this task, when facing unseen environments and instructions, the challenge mainly lies in how to enable the agent to dynamically produce generalized strategies during the navigation process. Recent research indicates that by means of fast and slow cognition systems, human beings could generate stable policies, which strengthen their adaptation for open world. Inspired by this idea, we propose the slow4fast-VLN, establishing a dynamic interactive fast-slow reasoning framework. The fast-reasoning module, an end-to-end strategy network, outputs actions via real-time input. It accumulates execution records in a history repository to build memory. The slow-reasoning module analyze the memories generated by the fast-reasoning module. Through deep reflection, it extracts experiences that enhance the generalization ability of decision-making. These experiences are structurally stored and used to continuously optimize the fast-reasoning module. Unlike traditional methods that treat fast-slow reasoning as independent mechanisms, our framework enables fast-slow interaction. By leveraging the experiences from slow reasoning. This interaction allows the system to continuously adapt and efficiently execute navigation tasks when facing unseen scenarios.


[42] LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models cs.CV | cs.AIPDF

Haoyan Gong, Hongbin Liu

TL;DR: 本文提出LP-LLM,一种基于Qwen3-VL的端到端结构感知多模态推理框架,用于解决真实世界中因运动模糊、低分辨率、复杂光照等严重退化导致的车辆牌照识别难题。其核心创新是字符感知多模态推理模块,通过引入可学习的字符槽查询,利用交叉注意力机制从视觉特征中检索字符位置的细粒度证据,并通过残差调制将其注入视觉标记,使语言模型能够基于明确的结构先验进行自回归生成。结合LoRA参数高效微调策略,模型在保留大模型泛化能力的同时实现了领域适应。

Details

Motivation: 解决真实世界车牌识别中严重退化问题,以及传统’先恢复后识别’两阶段范式因像素级优化目标与字符识别语义目标不一致而导致的伪影干扰和误差累积问题。同时,针对现有视觉语言模型缺乏对车牌字符序列(如固定长度、特定顺序)的显式结构建模能力进行改进。

Result: 在合成和真实世界的严重退化数据集上进行的大量实验表明,该方法显著优于现有的恢复-识别组合方法以及通用视觉语言模型,验证了将结构化推理融入大模型以处理低质量文本识别任务的优越性。

Insight: 创新点在于提出了字符感知多模态推理模块,通过可学习的字符槽查询和残差调制,将显式的字符结构先验注入到视觉语言模型的推理过程中,实现了端到端的结构感知识别。这为处理具有固定格式或序列结构的低质量文本识别任务提供了一种新的、将结构化约束与大模型多模态能力相结合的有效范式。

Abstract: Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination. The prevailing “restoration-then-recognition” two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition, leading to artifact interference and error accumulation. While Vision-Language Models (VLMs) have demonstrated powerful general capabilities, they lack explicit structural modeling for license plate character sequences (e.g., fixed length, specific order). To address this, we propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL. The core innovation lies in the Character-Aware Multimodal Reasoning Module (CMRM), which introduces a set of learnable Character Slot Queries. Through a cross-attention mechanism, these queries actively retrieve fine-grained evidence corresponding to character positions from visual features. Subsequently, we inject these character-aware representations back into the visual tokens via residual modulation, enabling the language model to perform autoregressive generation based on explicit structural priors. Furthermore, combined with the LoRA parameter-efficient fine-tuning strategy, the model achieves domain adaptation while retaining the generalization capabilities of the large model. Extensive experiments on both synthetic and real-world severely degraded datasets demonstrate that our method significantly outperforms existing restoration-recognition combinations and general VLMs, validating the superiority of incorporating structured reasoning into large models for low-quality text recognition tasks.


[43] LPCAN: Lightweight Pyramid Cross-Attention Network for Rail Surface Defect Detection Using RGB-D Data cs.CVPDF

Jackie Alex, Guoqiang Huan

TL;DR: 本文提出了一种轻量级金字塔交叉注意力网络(LPCANet),用于利用RGB-D数据进行高效准确的钢轨表面缺陷检测。该网络整合了MobileNetv2作为RGB特征提取的主干网络、一个用于深度处理的轻量级金字塔模块(LPM)、一个用于多模态融合的交叉注意力机制(CAM)以及一个用于增强结构分析的空间特征提取器(SFE)。

Details

Motivation: 解决当前基于视觉的钢轨缺陷检测方法存在的高计算复杂度、参数量过大以及精度欠佳等局限性。

Result: 在三个无监督RGB-D钢轨数据集(NEU-RSDDS-AUG, RSDD-TYPE1, RSDD-TYPE2)上的综合评估表明,LPCANet仅用990万参数、2.50 G FLOPs和162.60 fps的推理速度,实现了最先进的性能。与18个现有方法相比,在多个指标上均有显著提升。

Insight: 创新点在于设计了一个轻量化的多模态融合架构,通过交叉注意力机制和空间特征提取器有效结合RGB和深度信息,在保持高精度和实时性的同时大幅降低了模型复杂度,并展示了良好的泛化能力。

Abstract: This paper addresses the limitations of current vision-based rail defect detection methods, including high computational complexity, excessive parameter counts, and suboptimal accuracy. We propose a Lightweight Pyramid Cross-Attention Network (LPCANet) that leverages RGB-D data for efficient and accurate defect identification. The architecture integrates MobileNetv2 as a backbone for RGB feature extraction with a lightweight pyramid module (LPM) for depth processing, coupled with a cross-attention mechanism (CAM) for multimodal fusion and a spatial feature extractor (SFE) for enhanced structural analysis. Comprehensive evaluations on three unsupervised RGB-D rail datasets (NEU-RSDDS-AUG, RSDD-TYPE1, RSDD-TYPE2) demonstrate that LPCANet achieves state-of-the-art performance with only 9.90 million parameters, 2.50 G FLOPs, and 162.60 fps inference speed. Compared to 18 existing methods, LPCANet shows significant improvements, including +1.48% in $S_α$, +0.86% in IOU, and +1.77% in MAE over the best-performing baseline. Ablation studies confirm the critical roles of CAM and SFE, while experiments on non-rail datasets (DAGM2007, MT, Kolektor-SDD2) validate its generalization capability. The proposed framework effectively bridges traditional and deep learning approaches, offering substantial practical value for industrial defect inspection. Future work will focus on further model compression for real-time deployment.


[44] SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL cs.CV | cs.AIPDF

Lijun Liu, Linwei Chen, Zhishou Zhang, Meng Tian, Hengfu Cui

TL;DR: 本文提出SkinFlow框架,通过动态视觉编码和两阶段强化学习优化视觉信息传输效率,以解决通用大视觉语言模型在皮肤病诊断中因‘扩散注意力’而难以区分细微病变与背景噪声的问题。该方法在Fitzpatrick17k基准上取得了SOTA结果,证明了优化几何容量和信息流比单纯参数缩放更有效。

Details

Motivation: 通用大视觉语言模型在皮肤病学中因‘扩散注意力’而表现不佳,本文挑战了参数缩放是提升医学精度的唯一路径的假设,旨在通过优化视觉信息传输效率来改进诊断。

Result: 在Fitzpatrick17k基准上,7B模型实现了Top-1准确率提升12.06%和Top-6准确率提升28.57%,超越了Qwen3VL-235B和GPT-5.2等大规模通用模型,达到新的SOTA水平。

Insight: 创新点包括:引入虚拟宽度动态视觉编码器在不增加物理参数的情况下‘展开’复杂病理流形,以及两阶段强化学习策略在受限语义空间中顺序对齐显式医学描述和重建隐式诊断纹理;从客观角度看,该方法强调优化几何容量和信息流,为医学视觉任务提供了高效替代参数缩放的新思路。

Abstract: General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to “diffuse attention” - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to “unfold” complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.


[45] SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection cs.CV | cs.AIPDF

Chenhao Fu, Han Fang, Xiuzheng Zheng, Wenbo Wei, Yonghua Li

TL;DR: 本文提出了一种名为SSVP(协同语义-视觉提示)的新方法,用于工业零样本异常检测。该方法通过融合多种视觉编码来提升模型的细粒度感知能力,具体包括引入分层语义-视觉协同机制将DINOv3的多尺度结构先验整合到CLIP语义空间,使用视觉条件提示生成器指导动态提示生成,以及通过视觉-文本异常映射器建立双门校准范式来解决全局评分与局部证据之间的差异。

Details

Motivation: 现有零样本异常检测方法受限于单一视觉主干网络,难以平衡全局语义泛化与细粒度结构判别能力,因此需要一种能有效融合多样视觉编码以提升细粒度感知的方法。

Result: 在七个工业基准测试上的广泛评估验证了方法的鲁棒性;在MVTec-AD数据集上,SSVP达到了93.0%的图像AUROC和92.2%的像素AUROC,显著优于现有零样本方法,实现了最先进的性能。

Insight: 创新点在于提出了协同语义-视觉提示框架,通过分层语义-视觉协同机制深度融合多尺度结构先验与语义空间,并利用视觉条件提示生成器和双门校准范式来增强对特定异常模式的精确锚定和评分一致性,从而有效提升了零样本异常检测的细粒度判别能力。

Abstract: Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model’s fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3’s multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.


[46] Point Tracking as a Temporal Cue for Robust Myocardial Segmentation in Echocardiography Videos cs.CVPDF

Bahar Khodabakhshian, Nima Hashemi, Armin Saadat, Zahra Gholami, In-Chang Hwang

TL;DR: 本文提出了一种名为Point-Seg的基于Transformer的分割框架,用于超声心动图视频中的心肌分割。该方法通过集成点跟踪作为时序线索,利用合成数据集训练的点跟踪模块追踪关键解剖标志点的轨迹,并结合时序平滑损失,以实现跨帧稳定且一致的分割。

Details

Motivation: 超声心动图视频中的心肌分割因对比度低、噪声大和解剖结构变异而具有挑战性。传统深度学习模型要么独立处理各帧而忽略时序信息,要么依赖基于记忆的特征传播,导致误差随时间累积。本文旨在解决这些问题,实现更鲁棒的分割。

Result: 在公开和私有超声心动图数据集上的实验表明,Point-Seg在高质量数据上的Dice准确率与最先进的分割模型相当,而在低质量数据上则实现了更好的分割准确率和时序稳定性。此外,该方法能提供像素级心肌运动信息,这对于下游任务如心肌应变测量和局部室壁运动异常检测至关重要。

Insight: 主要创新点在于将点跟踪作为显式的运动感知信号来引导分割,避免了基于记忆的特征累积,从而减少了漂移并增强了时序一致性。这为视频分割提供了一种可靠且可泛化的新思路,特别是在医学影像分析中,将运动信息直接整合到分割框架中具有重要价值。

Abstract: Purpose: Myocardium segmentation in echocardiography videos is a challenging task due to low contrast, noise, and anatomical variability. Traditional deep learning models either process frames independently, ignoring temporal information, or rely on memory-based feature propagation, which accumulates error over time. Methods: We propose Point-Seg, a transformer-based segmentation framework that integrates point tracking as a temporal cue to ensure stable and consistent segmentation of myocardium across frames. Our method leverages a point-tracking module trained on a synthetic echocardiography dataset to track key anatomical landmarks across video sequences. These tracked trajectories provide an explicit motion-aware signal that guides segmentation, reducing drift and eliminating the need for memory-based feature accumulation. Additionally, we incorporate a temporal smoothing loss to further enhance temporal consistency across frames. Results: We evaluate our approach on both public and private echocardiography datasets. Experimental results demonstrate that Point-Seg has statistically similar accuracy in terms of Dice to state-of-the-art segmentation models in high quality echo data, while it achieves better segmentation accuracy in lower quality echo with improved temporal stability. Furthermore, Point-Seg has the key advantage of pixel-level myocardium motion information as opposed to other segmentation methods. Such information is essential in the computation of other downstream tasks such as myocardial strain measurement and regional wall motion abnormality detection. Conclusion: Point-Seg demonstrates that point tracking can serve as an effective temporal cue for consistent video segmentation, offering a reliable and generalizable approach for myocardium segmentation in echocardiography videos. The code is available at https://github.com/DeepRCL/PointSeg.


[47] SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction via VD-VAE and Versatile Diffusion cs.CV | cs.AIPDF

Jialu Li, Taiyan Zhou

TL;DR: 本文提出SpikeVAEDiff,一个结合了深度变分自编码器(VDVAE)和Versatile Diffusion模型的两阶段框架,用于从神经尖峰信号重建高分辨率、语义丰富的自然视觉场景图像。第一阶段,VDVAE将神经信号映射到潜在表示以生成初步低分辨率重建;第二阶段,通过回归模型将神经信号映射到CLIP视觉和文本特征,驱动Versatile Diffusion进行图像到图像的精细化生成。

Details

Motivation: 解决从神经活动(特别是高时空分辨率的神经尖峰数据)重建自然视觉场景这一神经科学和计算机视觉中的关键挑战,旨在生成比基于fMRI的方法更精细、语义更准确的图像重建。

Result: 在Allen Visual Coding-Neuropixels数据集上评估,发现VISI脑区激活最显著,对重建质量起关键作用;通过消融实验验证了特定脑区数据能显著提升重建性能,展示了成功与不成功的重建案例,突显了神经解码的挑战。

Insight: 创新点在于将神经尖峰信号与先进的生成模型(VDVAE和Versatile Diffusion)及多模态特征(CLIP)相结合,构建了一个专用于尖峰数据的两阶段重建框架;客观来看,其利用CLIP特征桥接神经信号与扩散模型进行语义引导的精细化生成,是跨模态解码的一种新颖尝试。

Abstract: Reconstructing natural visual scenes from neural activity is a key challenge in neuroscience and computer vision. We present SpikeVAEDiff, a novel two-stage framework that combines a Very Deep Variational Autoencoder (VDVAE) and the Versatile Diffusion model to generate high-resolution and semantically meaningful image reconstructions from neural spike data. In the first stage, VDVAE produces low-resolution preliminary reconstructions by mapping neural spike signals to latent representations. In the second stage, regression models map neural spike signals to CLIP-Vision and CLIP-Text features, enabling Versatile Diffusion to refine the images via image-to-image generation. We evaluate our approach on the Allen Visual Coding-Neuropixels dataset and analyze different brain regions. Our results show that the VISI region exhibits the most prominent activation and plays a key role in reconstruction quality. We present both successful and unsuccessful reconstruction examples, reflecting the challenges of decoding neural activity. Compared with fMRI-based approaches, spike data provides superior temporal and spatial resolution. We further validate the effectiveness of the VDVAE model and conduct ablation studies demonstrating that data from specific brain regions significantly enhances reconstruction performance.


[48] Disentangle Object and Non-object Infrared Features via Language Guidance cs.CVPDF

Fan Liu, Ting Wu, Chuanyi Zhang, Liang Yao, Xing Ma

TL;DR: 本文提出了一种新颖的视觉-语言表示学习范式,用于提升红外目标检测性能。该方法通过引入富含语义信息的文本监督,引导模型解耦红外图像中的目标与非目标特征。具体设计了语义特征对齐模块和对象特征解耦模块,以增强目标特征的判别性并减少噪声,最终在红外目标检测基准上取得了优异性能。

Details

Motivation: 红外目标检测在光照条件差(如黑暗、雨雪)的复杂环境中至关重要,但红外图像对比度低、边缘信息弱,难以提取具有判别性的目标特征进行鲁棒检测。

Result: 在两个基准测试集上取得了优越性能:在M³FD上达到83.7% mAP,在FLIR上达到86.1% mAP。

Insight: 创新点在于利用语言(文本)的丰富语义信息作为监督信号,引导视觉模型解耦目标与非目标特征。具体通过特征对齐和解耦模块,将跨模态信息用于增强红外图像特征的判别性,这是一种新颖的视觉-语言融合范式,可借鉴于其他低质量或弱纹理图像的分析任务。

Abstract: Infrared object detection focuses on identifying and locating objects in complex environments (\eg, dark, snow, and rain) where visible imaging cameras are disabled by poor illumination. However, due to low contrast and weak edge information in infrared images, it is challenging to extract discriminative object features for robust detection. To deal with this issue, we propose a novel vision-language representation learning paradigm for infrared object detection. An additional textual supervision with rich semantic information is explored to guide the disentanglement of object and non-object features. Specifically, we propose a Semantic Feature Alignment (SFA) module to align the object features with the corresponding text features. Furthermore, we develop an Object Feature Disentanglement (OFD) module that disentangles text-aligned object features and non-object features by minimizing their correlation. Finally, the disentangled object features are entered into the detection head. In this manner, the detection performance can be remarkably enhanced via more discriminative and less noisy features. Extensive experimental results demonstrate that our approach achieves superior performance on two benchmarks: M\textsuperscript{3}FD (83.7% mAP), FLIR (86.1% mAP). Our code will be publicly available once the paper is accepted.


[49] DeTracker: Motion-decoupled Vehicle Detection and Tracking in Unstabilized Satellite Videos cs.CV | eess.IVPDF

Jiajun Chen, Jing Xiao, Shaohan Cao, Yuming Zhu, Liang Liao

TL;DR: 本文提出了DeTracker,一种专为未稳定卫星视频设计的联合检测与跟踪框架,通过全局-局部运动解耦模块分离卫星平台抖动与真实目标运动,并利用时序依赖特征金字塔增强微小目标特征表示,构建了模拟多方向多速度平台运动的SDM-Car-SU基准数据集,实验表明其在模拟和真实卫星视频中均显著优于现有方法。

Details

Motivation: 解决未稳定卫星视频中平台抖动与微小目标外观微弱共同导致的多目标跟踪性能下降问题。

Result: 在模拟数据集SDM-Car-SU上达到61.1% MOTA,在真实卫星视频数据上达到47.3% MOTA,显著优于现有方法。

Insight: 创新点包括全局-局部运动解耦模块显式分离平台与目标运动,以及时序依赖特征金字塔进行跨帧特征融合;客观分析认为其构建的模拟运动扰动数据集为系统评估跟踪鲁棒性提供了新基准。

Abstract: Satellite videos provide continuous observations of surface dynamics but pose significant challenges for multi-object tracking (MOT), especially under unstabilized conditions where platform jitter and the weak appearance of tiny objects jointly degrade tracking performance. To address this problem, we propose DeTracker, a joint detection-and-tracking framework tailored for unstabilized satellite videos. DeTracker introduces a Global–Local Motion Decoupling (GLMD) module that explicitly separates satellite platform motion from true object motion through global alignment and local refinement, leading to improved trajectory stability and motion estimation accuracy. In addition, a Temporal Dependency Feature Pyramid (TDFP) module is developed to perform cross-frame temporal feature fusion, enhancing the continuity and discriminability of tiny-object representations. We further construct a new benchmark dataset, SDM-Car-SU, which simulates multi-directional and multi-speed platform motions to enable systematic evaluation of tracking robustness under varying motion perturbations. Extensive experiments on both simulated and real unstabilized satellite videos demonstrate that DeTracker significantly outperforms existing methods, achieving 61.1% MOTA on SDM-Car-SU and 47.3% MOTA on real satellite video data.


[50] Integrating Diverse Assignment Strategies into DETRs cs.CVPDF

Yiwei Zhang, Jin Gao, Hanshi Wang, Fudong Ge, Guan Luo

TL;DR: 本文提出LoRA-DETR,一个轻量级框架,通过集成多种一对多标签分配策略来增强DETR类目标检测器的训练。该方法在训练时为主网络添加多个低秩适应(LoRA)分支,每个分支实现不同的分配规则以提供多样化的监督梯度,推理时则丢弃这些分支,不增加额外计算成本。

Details

Motivation: DETR类检测器的一对一匹配策略虽然简洁,但存在监督稀疏导致收敛慢的问题。现有的一对多方法通常引入复杂、结构特定的修改,且只关注单一辅助策略,缺乏统一且可扩展的设计。

Result: 在不同基线模型上的大量实验验证了该方法的有效性,表明通过集成多样化的一对多监督,可以在不损害模型简洁性的前提下实现最先进的性能。

Insight: 核心创新在于揭示了性能提升的关键在于分配策略的多样性而非单纯监督数量,并据此提出了一种参数高效的LoRA分支集成框架,在训练中注入多样化监督梯度,推理时无开销,实现了鲁棒的联合优化与架构简洁性的平衡。

Abstract: Label assignment is a critical component in object detectors, particularly within DETR-style frameworks where the one-to-one matching strategy, despite its end-to-end elegance, suffers from slow convergence due to sparse supervision. While recent works have explored one-to-many assignments to enrich supervisory signals, they often introduce complex, architecture-specific modifications and typically focus on a single auxiliary strategy, lacking a unified and scalable design. In this paper, we first systematically investigate the effects of one-to-many'' supervision and reveal a surprising insight that performance gains are driven not by the sheer quantity of supervision, but by the diversity of the assignment strategies employed. This finding suggests that a more elegant, parameter-efficient approach is attainable. Building on this insight, we propose LoRA-DETR, a flexible and lightweight framework that seamlessly integrates diverse assignment strategies into any DETR-style detector. Our method augments the primary network with multiple Low-Rank Adaptation (LoRA) branches during training, each instantiating a different one-to-many assignment rule. These branches act as auxiliary modules that inject rich, varied supervisory gradients into the main model and are discarded during inference, thus incurring no additional computational cost. This design promotes robust joint optimization while maintaining the architectural simplicity of the original detector. Extensive experiments on different baselines validate the effectiveness of our approach. Our work presents a new paradigm for enhancing detectors, demonstrating that diverse one-to-many’’ supervision can be integrated to achieve state-of-the-art results without compromising model elegance.


[51] Hybrid guided variational autoencoder for visual place recognition cs.CV | cs.AIPDF

Ni Wang, Zihan You, Emre Neftci, Thorben Schoepe

TL;DR: 本文提出了一种混合引导变分自编码器(guided VAE)用于视觉地点识别(VPR),旨在解决现有SOTA模型内存占用大、移动部署困难,而紧凑模型又缺乏鲁棒性和泛化能力的问题。该模型结合了基于事件的视觉传感器和基于脉冲神经网络的编码器,在自建室内VPR数据集上实现了与SOTA相当的分类性能,并在不同光照条件下表现出鲁棒性,且能泛化到未知场景。

Details

Motivation: 解决自主智能体(如汽车、机器人、无人机)在GPS拒止室内环境中的精确定位问题,同时克服现有视觉地点识别模型内存占用高、不便移动部署,以及紧凑模型鲁棒性和泛化能力不足的局限性。

Result: 在新构建的室内VPR数据集上,模型成功解耦了16个不同地点的视觉特征,分类性能与其他SOTA方法相当,并在不同光照条件下保持鲁棒。在未知场景的新视觉输入测试中,模型能区分这些地点,展示了通过学习位置本质特征的高泛化能力。

Insight: 创新点在于结合了基于事件的视觉传感器和一种新颖的引导变分自编码器(VAE),其中编码器采用与低功耗、低延迟神经形态硬件兼容的脉冲神经网络(SNN)。这提供了一个紧凑、鲁棒且具有泛化能力的VPR模型,有望显著增强移动机器人在已知和未知室内环境中的导航能力。

Abstract: Autonomous agents such as cars, robots and drones need to precisely localize themselves in diverse environments, including in GPS-denied indoor environments. One approach for precise localization is visual place recognition (VPR), which estimates the place of an image based on previously seen places. State-of-the-art VPR models require high amounts of memory, making them unwieldy for mobile deployment, while more compact models lack robustness and generalization capabilities. This work overcomes these limitations for robotics using a combination of event-based vision sensors and an event-based novel guided variational autoencoder (VAE). The encoder part of our model is based on a spiking neural network model which is compatible with power-efficient low latency neuromorphic hardware. The VAE successfully disentangles the visual features of 16 distinct places in our new indoor VPR dataset with a classification performance comparable to other state-of-the-art approaches while, showing robust performance also under various illumination conditions. When tested with novel visual inputs from unknown scenes, our model can distinguish between these places, which demonstrates a high generalization capability by learning the essential features of location. Our compact and robust guided VAE with generalization capabilities poses a promising model for visual place recognition that can significantly enhance mobile robot navigation in known and unknown indoor environments.


[52] PhyRPR: Training-Free Physics-Constrained Video Generation cs.CVPDF

Yibo Zhao, Hengjia Li, Xiaofei He, Boxi Wu

TL;DR: 本文提出了一种无需训练的三阶段视频生成框架PhyRPR,旨在解决现有扩散模型在视频生成中难以满足物理约束的问题。该框架通过解耦物理理解与视觉合成,依次进行物理状态推理、确定性运动规划以及外观细化,从而在生成过程中实现显式的物理控制。

Details

Motivation: 现有基于扩散的视频生成模型虽然能合成视觉上合理的视频,但常常无法满足物理约束,主要原因是它们通常采用单阶段方法,将高层物理理解与低层视觉合成纠缠在一起,难以处理需要显式物理推理的内容。

Result: 在物理约束下的广泛实验表明,该方法持续提升了生成视频的物理合理性和运动可控性。

Insight: 核心创新在于提出了一个训练免费的三阶段流水线(PhyReason-PhyPlan-PhyRefine),将物理推理与视觉合成解耦,并通过确定性运动支架和潜在融合策略,在扩散采样中注入规划好的动力学以细化外观,实现了对生成过程的显式物理控制。

Abstract: Recent diffusion-based video generation models can synthesize visually plausible videos, yet they often struggle to satisfy physical constraints. A key reason is that most existing approaches remain single-stage: they entangle high-level physical understanding with low-level visual synthesis, making it hard to generate content that require explicit physical reasoning. To address this limitation, we propose a training-free three-stage pipeline,\textit{PhyRPR}:\textit{Phy\uline{R}eason}–\textit{Phy\uline{P}lan}–\textit{Phy\uline{R}efine}, which decouples physical understanding from visual synthesis. Specifically, \textit{PhyReason} uses a large multimodal model for physical state reasoning and an image generator for keyframe synthesis; \textit{PhyPlan} deterministically synthesizes a controllable coarse motion scaffold; and \textit{PhyRefine} injects this scaffold into diffusion sampling via a latent fusion strategy to refine appearance while preserving the planned dynamics. This staged design enables explicit physical control during generation. Extensive experiments under physics constraints show that our method consistently improves physical plausibility and motion controllability.


[53] Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain cs.CVPDF

Lianying Chao, Haoran Cai, Xubin Li, Kai Zhang, Sijie Wu

TL;DR: 本文提出了一种用于信息通信技术(ICT)领域的领域特定图像描述模型(DICModel),通过多阶段渐进式训练策略解决通用多模态大语言模型缺乏领域知识的问题。该方法利用合成数据、专家标注数据和指令微调数据训练模型,实验表明其7B参数模型在多个指标上超越了更大规模的SOTA模型。

Details

Motivation: 解决ICT领域知识同时隐藏在文本和图像模态中,而传统方法无法解析图像、通用多模态大模型又缺乏足够领域知识的问题,旨在高效准确地从领域图像中提取逻辑文本。

Result: 在ICT领域构建的标准评估系统上,仅7B参数的DICModel性能优于其他32B参数的SOTA模型。相比7B和32B参数的SOTA模型,BLEU指标分别提升约56.8%和20.8%;在专家构建的客观问题上,准确率比Qwen2.5-VL 32B高1%。

Insight: 创新点在于提出结合合成数据、专家标注和指令微调的多阶段渐进式训练策略,以相对较小的模型规模实现领域特定图像描述的SOTA性能。可借鉴之处包括利用工具和LLM合成高质量领域数据、以及专家与模型协同的标注与评估方法。

Abstract: In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.


[54] Beyond the final layer: Attentive multilayer fusion for vision transformers cs.CVPDF

Laure Ciernik, Marco Morik, Lukas Thede, Luca Eyring, Shinichi Nakajima

TL;DR: 该论文提出了一种名为’Attentive Multilayer Fusion’的方法,用于改进基于视觉Transformer(ViT)基础模型的下游任务适应。该方法通过注意力机制动态融合ViT所有层的表示,而不仅仅依赖最后一层,从而更有效地利用分布在网络各层的任务相关信息。

Details

Motivation: 动机在于解决线性探测(linear probing)在适应下游任务时仅使用最后一层表示的局限性,因为任务相关信息实际上分布在网络的多个层次中,而非仅编码于最后一层。

Result: 在20个多样化数据集和多个预训练基础模型上的实验表明,该方法相比标准线性探测取得了持续且显著的性能提升。注意力热图进一步揭示,与预训练领域不同的任务最能从中间层表示中受益。

Insight: 创新点在于提出了一种基于注意力的多层融合机制,该机制能学习识别目标任务最相关的层,并结合低层结构线索与高层语义抽象。核心见解是强调中间层信息在基于探测的适应中的价值,并展示了一种原则性的、任务感知的方法来释放其潜力。

Abstract: With the rise of large-scale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 20 diverse datasets and multiple pretrained foundation models, our method achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate layer information and demonstrate a principled, task aware approach for unlocking their potential in probing-based adaptation.


[55] See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval cs.CVPDF

Mingyu Jeon, Sungjin Han, Jinkwon Hwang, Minchol Kwon, Jonghee Kim

TL;DR: 本文提出SMORE框架,旨在解决视频时刻检索任务中因密集帧处理导致的内存效率问题。该方法通过查询引导的语义编码、查询感知的重要性调制和自适应帧压缩,在保持高信息分辨率的同时显著减少内存占用。

Details

Motivation: 现有视频时刻检索方法依赖稀疏帧采样,可能导致信息丢失,尤其对于长视频;而多模态大语言模型处理视频时面临内存限制的挑战。

Result: SMORE在QVHighlights、Charades-STA和ActivityNet-Captions三个基准测试上取得了最先进的性能。

Insight: 创新点在于将查询信息深度融合到视频编码过程中,通过语义对齐和自适应压缩实现内存高效的高分辨率视频理解,而非简单降低帧采样率。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.


[56] Radiomics-Integrated Deep Learning with Hierarchical Loss for Osteosarcoma Histology Classification cs.CV | cs.AIPDF

Yaxi Chen, Zi Ye, Shaheer U. Saeed, Oliver Yu, Simin Ni

TL;DR: 该论文提出了一种结合放射组学特征和分层损失函数的深度学习模型,用于骨肉瘤组织学分类,旨在自动化评估新辅助化疗后肿瘤区域的活性状态。

Details

Motivation: 解决骨肉瘤组织病理学评估中手动分析劳动密集、主观性强且存在观察者间差异的问题,并应对深度学习模型在患者级别测试数据上性能显著下降的挑战。

Result: 在TCIA骨肉瘤肿瘤评估数据集上,所提方法(放射组学特征输入和分层损失)显著提升了分类性能,达到了该应用在该公开数据集上的新SOTA水平。

Insight: 创新点包括将放射组学特征作为多模态输入以提升性能与可解释性,以及通过分层损失函数优化两个二元分类任务(肿瘤vs非肿瘤、活性vs非活性)来改进各类别性能,而非传统的“扁平”三分类。

Abstract: Osteosarcoma (OS) is an aggressive primary bone malignancy. Accurate histopathological assessment of viable versus non-viable tumor regions after neoadjuvant chemotherapy is critical for prognosis and treatment planning, yet manual evaluation remains labor-intensive, subjective, and prone to inter-observer variability. Recent advances in digital pathology have enabled automated necrosis quantification. Evaluating on test data, independently sampled on patient-level, revealed that the deep learning model performance dropped significantly from the tile-level generalization ability reported in previous studies. First, this work proposes the use of radiomic features as additional input in model training. We show that, despite that they are derived from the images, such a multimodal input effectively improved the classification performance, in addition to its added benefits in interpretability. Second, this work proposes to optimize two binary classification tasks with hierarchical classes (i.e. tumor-vs-non-tumor and viable-vs-non-viable), as opposed to the alternative ``flat’’ three-class classification task (i.e. non-tumor, non-viable tumor, viable tumor), thereby enabling a hierarchical loss. We show that such a hierarchical loss, with trainable weightings between the two tasks, the per-class performance can be improved significantly. Using the TCIA OS Tumor Assessment dataset, we experimentally demonstrate the benefits from each of the proposed new approaches and their combination, setting a what we consider new state-of-the-art performance on this open dataset for this application. Code and trained models: https://github.com/YaxiiC/RadiomicsOS.git.


[57] Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs cs.CVPDF

Rui Zhu, Xin Shen, Shuchen Wu, Chenxi Miao, Xin Yu

TL;DR: 本文提出了Video-MSR,这是首个专门用于评估多模态大语言模型在动态视频场景中进行多跳空间推理能力的基准测试。该基准包含四个任务,共3052个高质量视频实例和4993个问答对。通过对20个先进MLLM的评估,揭示了模型在多步空间推理上的显著不足。为提升能力,作者还构建了MSR-9K指令微调数据集,并在Qwen-VL上微调,在Video-MSR上取得了7.82%的绝对性能提升。

Details

Motivation: 现有基准主要关注单步感知到判断的任务,而需要复杂视觉-空间逻辑链的场景(即多跳空间推理)研究不足。本文旨在填补这一空白,评估和提升MLLM在动态视频中的多跳空间推理能力。

Result: 对20个SOTA MLLM的评估表明,模型在表面感知上表现熟练,但在MSR任务上性能显著下降,常出现空间迷失和幻觉。通过使用MSR-9K数据集微调Qwen-VL,在Video-MSR基准上实现了+7.82%的绝对提升。

Insight: 创新点在于构建了首个针对动态视频多跳空间推理的基准Video-MSR,并设计了四个系统性的任务。客观来看,其提出的可扩展、视觉基础的构建流程(结合模型生成与人工验证)以及专门的指令微调数据集MSR-9K,是提升模型复杂空间推理能力的有效途径,为未来研究提供了重要基础。

Abstract: Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR.


[58] Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs? cs.CV | cs.AI | cs.LGPDF

David Reid, Ognjen Arandjelovic

TL;DR: 本文首次将Vision Transformer(ViT)应用于古罗马硬币语义元素的自动识别任务,通过多模态数据(图像和非结构化文本)进行全自动学习,并与卷积神经网络(CNN)模型进行性能比较。

Details

Motivation: 解决古硬币自动分析中语义元素识别的挑战,以帮助研究者和收藏者从大规模硬币收藏中提取历史信息,并探索ViT架构在此领域的应用潜力。

Result: 在古硬币语义元素识别任务中,ViT模型在准确率上优于新训练的CNN模型,但未提及具体基准测试或是否达到SOTA水平。

Insight: 创新点在于首次将ViT引入古硬币分析,利用多模态数据进行端到端学习;客观来看,这展示了Transformer在细粒度视觉任务中的潜力,可能为文化遗产分析提供新思路。

Abstract: Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.


Darya Baranouskaya, Andrea Cavallaro

TL;DR: PrivLEX是一种新颖的图像隐私分类器,它基于法律定义的个人数据概念进行决策,是首个利用视觉语言模型识别能力并与法律概念对齐的可解释隐私分类器。该方法通过零样本VLM概念检测,在无需训练时显式概念标签的情况下,通过无标签概念瓶颈模型提供可解释分类。

Details

Motivation: 解决图像隐私分类中缺乏与法律概念对齐且可解释的方法的问题,旨在基于法律定义的个人数据概念来检测图像中的隐私内容。

Result: 论文展示了PrivLEX能够识别图像中存在的个人数据概念,并进一步分析了人类标注者在图像隐私数据集中对这些概念的敏感度感知。

Insight: 创新点在于将视觉语言模型的零样本识别能力与法律概念相结合,构建无需显式概念标签训练的可解释概念瓶颈模型,为隐私分类提供了法律对齐的解释性。

Abstract: We present PrivLEX, a novel image privacy classifier that grounds its decisions in legally defined personal data concepts. PrivLEX is the first interpretable privacy classifier aligned with legal concepts that leverages the recognition capabilities of Vision-Language Models (VLMs). PrivLEX relies on zero-shot VLM concept detection to provide interpretable classification through a label-free Concept Bottleneck Model, without requiring explicit concept labels during training. We demonstrate PrivLEX’s ability to identify personal data concepts that are present in images. We further analyse the sensitivity of such concepts as perceived by human annotators of image privacy datasets.


[60] MAD: Motion Appearance Decoupling for efficient Driving World Models cs.CVPDF

Ahmad Rahimi, Valentin Gerard, Eloi Zablocki, Matthieu Cord, Alexandre Alahi

TL;DR: 本文提出了一种名为MAD(Motion Appearance Decoupling)的高效适配框架,旨在将通用的视频扩散模型转化为可控的自动驾驶世界模型。其核心思想是将运动学习与外观合成解耦,通过先预测结构化运动(如骨架化智能体和场景元素的视频),再基于此运动序列合成逼真的RGB视频,实现了推理-渲染的范式。该方法显著降低了计算成本。

Details

Motivation: 当前视频扩散模型虽能生成逼真且时序连贯的视频,但作为自动驾驶世界模型,其在结构化运动和物理一致性交互方面存在不足。直接将这些通用模型适配到驾驶领域通常需要大量领域特定数据和昂贵的微调,因此需要一种更高效的适配方法。

Result: 实验表明,该方法效率极高:在适配SVD模型时,仅使用不到6%的计算量即达到了先前SOTA模型的性能。扩展到LTX模型后,所提出的MAD-LTX模型超越了所有开源竞争对手,并支持全面的文本、自我(ego)和物体控制。

Insight: 主要创新点在于将运动与外观解耦的两阶段学习范式,这模仿了推理-渲染过程,使得模型能专注于学习物理和社会合理性,再渲染外观,从而实现了高效且可控的驾驶世界模型构建。从客观角度看,这种解耦策略有效分离了动态推理和外观生成任务,降低了模型适配的复杂度和计算需求。

Abstract: Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively “dressing” the motion with texture and lighting. This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Project page: https://vita-epfl.github.io/MAD-World-Model/


[61] V-DPM: 4D Video Reconstruction with Dynamic Point Maps cs.CVPDF

Edgar Sucar, Eldar Insafutdinov, Zihang Lai, Andrea Vedaldi

TL;DR: 本文提出了V-DPM,一种用于动态场景4D视频重建的方法。它基于动态点图(DPM)的概念,将DUSt3R的静态点图扩展到动态内容,并通过在VGGT这一强大的3D重建器基础上进行改进,使其能够从视频输入中直接预测包含3D形状和完整3D点运动的表示,无需后处理优化。

Details

Motivation: 现有动态点图(DPMs)仅限于图像对,并且在处理多于两个视图时像DUSt3R一样需要后处理优化。作者认为DPMs应用于视频时更有用,因此旨在开发一种能从视频中直接、高效地重建动态3D场景的方法。

Result: 该方法在动态场景的3D和4D重建任务上达到了最先进的性能。具体而言,它不仅能恢复动态深度,还能恢复场景中每个点的完整3D运动。

Insight: 主要创新点在于:1)为视频输入设计DPM的表述方式,以最大化表示能力、便于神经预测并重用预训练模型;2)在静态场景训练的VGGT基础上,仅用少量合成数据即可将其适配为有效的V-DPM预测器,实现了从视频到动态3D表示的端到端预测。

Abstract: Powerful 3D representations such as DUSt3R invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend this concept to dynamic 3D content by additionally representing scene motion. However, existing DPMs are limited to image pairs and, like DUSt3R, require post processing via optimization when more than two views are involved. We argue that DPMs are more useful when applied to videos and introduce V-DPM to demonstrate this. First, we show how to formulate DPMs for video input in a way that maximizes representational power, facilitates neural prediction, and enables reuse of pretrained models. Second, we implement these ideas on top of VGGT, a recent and powerful 3D reconstructor. Although VGGT was trained on static scenes, we show that a modest amount of synthetic data is sufficient to adapt it into an effective V-DPM predictor. Our approach achieves state of the art performance in 3D and 4D reconstruction for dynamic scenes. In particular, unlike recent dynamic extensions of VGGT such as P3, DPMs recover not only dynamic depth but also the full 3D motion of every point in the scene.


[62] Video Joint-Embedding Predictive Architectures for Facial Expression Recognition cs.CV | cs.HCPDF

Lennart Eing, Cristina Luna-Jiménez, Silvan Mertes, Elisabeth André

TL;DR: 本文提出了一种新颖的视频联合嵌入预测架构(V-JEPA)在面部表情识别(FER)中的应用。该方法通过从可见区域的嵌入预测被遮蔽区域的嵌入进行预训练,避免了传统像素级重建方法可能捕获无关背景信息的缺点。使用预训练的V-JEPA视频编码器,在RAVDESS和CREMA-D数据集上训练浅层分类器,取得了SOTA性能,并展示了强大的跨数据集泛化能力。

Details

Motivation: 解决传统视频理解预训练方法依赖像素级重建、可能捕获无关背景信息(如像素颜色)的问题,探索基于纯嵌入的预训练方法在面部表情识别任务中的潜力。

Result: 在RAVDESS数据集上达到SOTA性能,在CREMA-D数据集上优于所有其他基于视觉的方法(WAR提升+1.48%),并且跨数据集评估显示出强大的泛化能力。

Insight: 创新点在于将V-JEPA这种基于嵌入预测的自监督预训练范式首次应用于FER任务,其核心优势是通过在嵌入空间进行预测,使编码器能学习更鲁棒、与任务相关的视频表征,避免无关细节干扰,从而提升性能与泛化性。

Abstract: This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.


[63] Bipartite Mode Matching for Vision Training Set Search from a Hierarchical Data Server cs.CVPDF

Yue Yao, Ruining Yang, Tom Gedeon

TL;DR: 本文提出了一种针对目标域数据标注不可行场景下的训练集搜索方法,通过构建分层数据服务器并设计二分模式匹配算法(BMM),从大规模数据服务器中搜索与目标域模式对齐的源数据,以构建高质量训练集。

Details

Motivation: 解决目标域无法实时标注时,如何从大规模数据服务器中搜索结构匹配的源数据以构建训练集,从而弥补传统方法仅优化算法而忽视数据服务器结构优化的不足。

Result: 在目标重识别(re-ID)和检测任务上,BMM搜索到的训练集与目标域的域间隙更小,模型准确率更高;结合伪标签等现有UDA方法可进一步提升性能。

Insight: 创新点在于引入分层数据服务器结构和二分模式匹配算法,实现数据中心的域适应(与模型中心方法正交),通过优化数据服务器结构来提升模型性能。

Abstract: We explore a situation in which the target domain is accessible, but real-time data annotation is not feasible. Instead, we would like to construct an alternative training set from a large-scale data server so that a competitive model can be obtained. For this problem, because the target domain usually exhibits distinct modes (i.e., semantic clusters representing data distribution), if the training set does not contain these target modes, the model performance would be compromised. While prior existing works improve algorithms iteratively, our research explores the often-overlooked potential of optimizing the structure of the data server. Inspired by the hierarchical nature of web search engines, we introduce a hierarchical data server, together with a bipartite mode matching algorithm (BMM) to align source and target modes. For each target mode, we look in the server data tree for the best mode match, which might be large or small in size. Through bipartite matching, we aim for all target modes to be optimally matched with source modes in a one-on-one fashion. Compared with existing training set search algorithms, we show that the matched server modes constitute training sets that have consistently smaller domain gaps with the target domain across object re-identification (re-ID) and detection tasks. Consequently, models trained on our searched training sets have higher accuracy than those trained otherwise. BMM allows data-centric unsupervised domain adaptation (UDA) orthogonal to existing model-centric UDA methods. By combining the BMM with existing UDA methods like pseudo-labeling, further improvement is observed.


[64] OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding cs.CVPDF

Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun

TL;DR: 本文提出了OpenVoxel,一种无需训练的方法,用于对稀疏体素进行分组和描述,以实现开放词汇的3D场景理解。该方法基于从多视角图像获得的稀疏体素栅格化模型,能够生成描述场景中不同物体的有意义分组,并利用视觉语言模型和多模态大语言模型为每个分组生成描述,从而构建信息丰富的场景地图,支持开放词汇分割和指代表达式分割等任务。

Details

Motivation: 解决现有方法通常需要训练或依赖CLIP/BERT文本编码器嵌入的问题,旨在开发一种无需训练、直接利用多模态大语言模型进行文本到文本搜索的开放词汇3D场景理解方法。

Result: 在广泛的实验中,该方法相比近期研究表现出优越性能,特别是在复杂的指代表达式分割任务上。

Insight: 主要创新点在于提出了一种完全无需训练的流程,避免了引入特定文本编码器的嵌入,直接利用多模态大语言模型的能力进行分组和描述,简化了流程并可能提升了对复杂自然语言查询的鲁棒性。

Abstract: We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.


[65] Iterative Differential Entropy Minimization (IDEM) method for fine rigid pairwise 3D Point Cloud Registration: A Focus on the Metric cs.CVPDF

Emmanuele Barberi, Felice Sfravara, Filippo Cucinotta

TL;DR: 本文提出了一种基于微分熵的度量方法,称为迭代微分熵最小化(IDEM),用于精细的刚性成对3D点云配准。该方法通过优化微分熵度量来对齐点云,不依赖于固定点云的选择,并在变换过程中显示出清晰的最小值对应最佳对齐。

Details

Motivation: 传统点云配准方法(如ICP)使用欧氏距离度量(如RMSE),在点云密度差异、噪声、孔洞和有限重叠等情况下效果不佳,且需要选择固定点云,缺乏交换性。作者旨在解决这些问题,提出一种更鲁棒的度量方法。

Result: 通过多个案例研究,将IDEM与RMSE、Chamfer距离和Hausdorff距离进行比较。结果表明,IDEM在存在密度差异、噪声、孔洞和部分重叠时仍能有效对齐点云,而RMSE在这些情况下并不总能达到最优对齐。

Insight: 创新点在于引入微分熵作为点云配准的优化目标函数,该度量具有交换性,不依赖固定点云选择,并能处理现实场景中的非理想点云数据。从客观角度看,这提供了一种更通用和鲁棒的配准框架,可借鉴于其他几何对齐任务。

Abstract: Point cloud registration is a central theme in computer vision, with alignment algorithms continuously improving for greater robustness. Commonly used methods evaluate Euclidean distances between point clouds and minimize an objective function, such as Root Mean Square Error (RMSE). However, these approaches are most effective when the point clouds are well-prealigned and issues such as differences in density, noise, holes, and limited overlap can compromise the results. Traditional methods, such as Iterative Closest Point (ICP), require choosing one point cloud as fixed, since Euclidean distances lack commutativity. When only one point cloud has issues, adjustments can be made, but in real scenarios, both point clouds may be affected, often necessitating preprocessing. The authors introduce a novel differential entropy-based metric, designed to serve as the objective function within an optimization framework for fine rigid pairwise 3D point cloud registration, denoted as Iterative Differential Entropy Minimization (IDEM). This metric does not depend on the choice of a fixed point cloud and, during transformations, reveals a clear minimum corresponding to the best alignment. Multiple case studies are conducted, and the results are compared with those obtained using RMSE, Chamfer distance, and Hausdorff distance. The proposed metric proves effective even with density differences, noise, holes, and partial overlap, where RMSE does not always yield optimal alignment.


[66] Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets cs.CV | cs.AI | cs.ROPDF

Jeremiah Coholich, Justin Wit, Robert Azarcon, Zsolt Kira

TL;DR: 本文提出MANGO方法,一种基于非配对图像翻译的sim2real技术,通过新颖的分割条件InfoNCE损失、高度正则化的判别器设计和改进的PatchNCE损失,有效解决了机器人视觉策略在相机视角变化下的分布偏移问题。该方法仅需少量真实世界固定摄像头数据,即可从模拟观测生成多样化的未见视角,提升模仿学习策略的鲁棒性。

Details

Motivation: 解决机器人视觉策略因相机视角变化导致的分布偏移脆弱性问题,利用模拟数据大规模收集不同视角的演示,但面临视觉sim2real挑战,需将模拟图像有效转换为真实图像以增强策略训练。

Result: 在sim2real图像翻译任务中,MANGO优于所有测试的其他图像翻译方法;使用MANGO增强数据训练的模仿学习策略在未增强策略完全失败的视角上达到高达60%的成功率。

Insight: 创新点包括分割条件InfoNCE损失、高度正则化判别器设计和改进的PatchNCE损失,这些设计在sim2real翻译中保持视角一致性;客观分析表明,该方法通过少量真实数据生成多样化视角,有效提升了策略的视角鲁棒性,为sim2real转换提供了新思路。

Abstract: Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO – an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this domain, MANGO outperforms all other image translation methods we tested. Imitation-learning policies trained on data augmented by MANGO are able to achieve success rates as high as 60% on views that the non-augmented policy fails completely on.


[67] GRCF: Two-Stage Groupwise Ranking and Calibration Framework for Multimodal Sentiment Analysis cs.CVPDF

Manning Gao, Leheng Zhang, Shiqin Han, Haifeng Hu, Yuncheng Jiang

TL;DR: 本文提出了一种两阶段分组排序与校准框架(GRCF),用于解决多模态情感分析中传统点回归方法对标签噪声敏感、忽略样本间相对顺序的问题。该框架通过引入优势加权动态边际排序损失构建细粒度序数结构,并采用MAE驱动的目标进行预测值校准,在回归任务中达到SOTA性能,并展现出在分类任务中的良好泛化能力。

Details

Motivation: 现有研究多采用点回归方法,对标签噪声敏感且忽略样本间相对情感强度;而后续成对序数学习框架虽能捕捉相对顺序,但存在对所有比较赋予均匀重要性、使用静态排序边际无法反映情感组间语义距离变化的问题。

Result: 在核心回归基准测试中达到了最先进的性能,并在多模态幽默检测和讽刺检测等分类任务中表现出强大的泛化能力。

Insight: 创新点在于将GRPO思想引入多模态情感分析,提出了两阶段框架,通过优势加权动态边际排序损失自适应关注难排序样本并反映情感组间动态语义距离,同时结合MAE损失进行绝对分数校准,兼顾了相对序数结构和预测值校准。

Abstract: Most Multimodal Sentiment Analysis research has focused on point-wise regression. While straightforward, this approach is sensitive to label noise and neglects whether one sample is more positive than another, resulting in unstable predictions and poor correlation alignment. Pairwise ordinal learning frameworks emerged to address this gap, capturing relative order by learning from comparisons. Yet, they introduce two new trade-offs: First, they assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples. Second, they employ static ranking margins, which fail to reflect the varying semantic distances between sentiment groups. To address this, we propose a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization (GRPO). Our framework resolves these trade-offs by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples. Specifically, Stage 1 introduces a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build a fine-grained ordinal structure. Stage 2 then employs an MAE-driven objective to align prediction magnitudes. To validate its generalizability, we extend GRCF to classification tasks, including multimodal humor detection and sarcasm detection. GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.


[68] CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems cs.CV | cs.AIPDF

Yonglin Tian, Qiyao Zhang, Wei Xu, Yutong Wang, Yihao Wu

TL;DR: 本文提出了一个名为CogRail的新基准,用于评估视觉语言模型(VLMs)在智能铁路运输系统中的认知入侵感知能力。该基准整合了开源数据集和认知驱动的问答标注,以支持时空推理和预测。作者对现有最先进的VLMs进行了系统评估,并提出了一个联合微调框架,通过整合位置感知、运动预测和威胁分析三个核心任务,显著提升了模型在特定领域的性能。

Details

Motivation: 现有铁路入侵感知系统通常局限于固定视觉范围内的物体分类,并依赖基于规则的启发式方法判断入侵状态,往往忽略了潜在的入侵风险目标。准确且早期的风险感知需要对目标物体的空间上下文和时间动态进行认知,这对传统视觉模型构成了挑战。

Result: 广泛的实验表明,当前的大规模多模态模型在认知入侵感知任务所需的复杂时空推理方面存在困难,凸显了现有基础模型在这一安全关键领域的局限性。相比之下,作者提出的联合微调框架通过针对性地适应领域特定的推理需求,显著提升了模型性能,在准确性和可解释性方面均显示出优势。

Insight: 论文的创新点在于构建了一个专门用于认知入侵感知的基准(CogRail),并提出了一个整合了位置感知、运动预测和威胁分析三个核心任务的联合微调框架。从客观角度看,该工作强调了结构化多任务学习在将通用基础模型适配到特定安全关键领域时的优势,为解决复杂时空推理问题提供了有效途径。

Abstract: Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule-based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open-source datasets with cognitively driven question-answer annotations to support spatio-temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine-tune VLMs for better performance and propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general-purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large-scale multimodal models struggle with the complex spatial-temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety-critical domain. In contrast, our proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands, highlighting the advantages of structured multi-task learning in improving both accuracy and interpretability. Code will be available at https://github.com/Hub-Tian/CogRail.


[69] AquaFeat+: an Underwater Vision Learning-based Enhancement Method for Object Detection, Classification, and Tracking cs.CVPDF

Emanuel da Costa Silva, Tatiana Taís Schein, José David García Ramos, Eduardo Lawson da Silva, Stephanie Loi Brião

TL;DR: 本文提出了一种名为AquaFeat+的即插即用流水线,专门用于增强水下视觉任务的特征,以应对低光照、颜色失真和浑浊度等挑战。该方法通过端到端训练的颜色校正、分层特征增强和自适应残差输出模块,直接由最终应用(如目标检测)的损失函数引导,在FishTrack23数据集上显著提升了目标检测、分类和跟踪的性能。

Details

Motivation: 水下视频分析因低光照、颜色失真和浑浊度等因素导致视觉数据质量下降,严重影响机器人应用中感知模块的性能,因此需要一种专门针对自动化视觉任务而非人类感知质量的特征增强方法。

Result: 在FishTrack23数据集上进行训练和评估,AquaFeat+在目标检测、分类和跟踪指标上取得了显著提升,验证了其在水下机器人应用中增强感知任务的有效性。

Insight: 创新点在于提出了一种以最终应用损失函数直接引导的端到端可训练特征增强流水线,专注于提升自动化视觉任务的特征质量而非人类感知的图像质量,并通过模块化设计(颜色校正、分层特征增强、自适应残差输出)实现即插即用的水下视觉增强。

Abstract: Underwater video analysis is particularly challenging due to factors such as low lighting, color distortion, and turbidity, which compromise visual data quality and directly impact the performance of perception modules in robotic applications. This work proposes AquaFeat+, a plug-and-play pipeline designed to enhance features specifically for automated vision tasks, rather than for human perceptual quality. The architecture includes modules for color correction, hierarchical feature enhancement, and an adaptive residual output, which are trained end-to-end and guided directly by the loss function of the final application. Trained and evaluated in the FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics, validating its effectiveness for enhancing perception tasks in underwater robotic applications.


[70] Image2Garment: Simulation-ready Garment Generation from a Single Image cs.CVPDF

Selim Emir Can, Jan Ackermann, Kiyohiro Nakayama, Ruofan Liu, Tong Wu

TL;DR: 本文提出Image2Garment框架,从单张图像生成可直接用于物理仿真的服装模型。该方法通过微调视觉语言模型从图像推断材料成分和织物属性,再训练轻量级预测器将这些属性映射到物理参数,无需迭代优化即可生成包含几何和材料属性的仿真就绪服装。

Details

Motivation: 现有方法要么需要多视角捕获和昂贵的可微分仿真,要么仅预测几何而缺乏仿真所需的材料属性;且缺乏图像到物理的数据集,单图估计仿真就绪服装是一个不适定问题。

Result: 在材料成分估计和织物属性预测上达到更高精度;通过物理参数估计器,相比现有图像到服装的SOTA方法,能实现更高保真度的仿真。

Insight: 创新点在于结合视觉语言模型进行材料属性推断,并构建材料-物理测量小数据集来映射物理参数;贡献了两个新数据集(FTAG和T2P),实现了无需迭代优化的前馈式仿真就绪服装生成。

Abstract: Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.


[71] LiteEmbed: Adapting CLIP to Rare Classes cs.CVPDF

Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi

TL;DR: 论文提出LiteEmbed,一个轻量级框架,用于对CLIP模型进行少样本个性化,使其能够适应预训练中罕见或未见过的类别,而无需重新训练编码器。该方法通过PCA分解引导文本嵌入的优化,结合粗粒度对齐和细粒度分离目标,在保持全局语义一致性的同时增强视觉相似类别间的区分度。优化后的嵌入是即插即用的,可无缝应用于分类、检索、分割和检测等任务。

Details

Motivation: 解决大规模视觉-语言模型(如CLIP)在预训练中罕见类别(包括新出现的实体和文化特定类别)上零样本识别能力较弱的问题,旨在实现轻量级的少样本个性化适应。

Result: 大量实验表明,LiteEmbed在适应代表性不足、罕见或未见类别方面,相比先前方法取得了显著提升,确立了其作为有效方法的地位。

Insight: 创新点在于利用基于PCA的分解将粗粒度语义方向与细粒度变化解耦,并通过粗粒度对齐和细粒度分离两个互补目标进行子空间引导优化,实现了在保持全局一致性的同时增强判别性,且优化后的嵌入具有即插即用的通用性。

Abstract: Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP’s vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP’s original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.


[72] Self-Supervised Animal Identification for Long Videos cs.CVPDF

Xuyang Fang, Sion Hannuna, Edwin Simpson, Neill Campbell

TL;DR: 本文提出了一种高效的自监督动物个体识别方法,适用于长视频分析。该方法将动物识别重新定义为全局聚类任务,而非序列跟踪问题,仅需边界框检测和个体总数作为输入。通过采样帧对、使用冻结预训练骨干网络,并结合匈牙利算法进行批次内伪标签分配的自引导机制,该方法无需身份标签即可学习判别性特征。

Details

Motivation: 解决长视频中动物个体识别需要大量人工标注的问题,现有自监督方法因内存限制和时间误差传播而不适用于长序列。

Result: 在具有挑战性的真实数据集(3D-POP鸽子视频和8头小牛进食视频)上评估,该方法达到超过97%的准确率,匹配或超越了使用超过1000个标注帧训练的监督基线,同时每批次GPU内存消耗低于1GB,比标准对比方法低一个数量级。

Insight: 创新点包括将动物识别重构为全局聚类任务、采用自引导机制与匈牙利算法进行伪标签分配,以及借鉴视觉-语言模型的二元交叉熵损失实现高效训练。该方法在资源受限的研究环境中具有广泛适用性,能有效消除人工标注瓶颈。

Abstract: Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video – a common scenario in practice – and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97%) while consuming less than 1 GB of GPU memory per batch – an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{https://huggingface.co/datasets/tonyFang04/8-calves}{here}.


[73] SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings cs.CVPDF

Yuchen Wu, Jiahe Li, Xiaohan Yu, Lina Yu, Jin Zheng

TL;DR: SCE-SLAM是一种端到端的单目视觉SLAM系统,通过引入场景坐标嵌入来维持尺度一致性,解决了长序列中尺度漂移的问题。系统包含几何引导聚合和场景坐标束调整两个关键模块,利用学习到的补丁级表示在规范尺度参考下编码3D几何关系,从而在全局范围内约束尺度估计。

Details

Motivation: 单目视觉SLAM在资源受限平台和互联网视频的3D重建中应用广泛,但存在尺度漂移问题,即长序列中估计尺度逐渐发散。现有帧间方法通过局部优化实现实时性能,但由于独立窗口间缺乏全局约束,会累积尺度漂移。

Result: 在KITTI、Waymo和vKITTI数据集上的实验表明,该方法显著提升了性能:在KITTI上,与先前最佳方法相比,绝对轨迹误差减少了8.36米,同时保持36 FPS的实时速度,并在大规模场景中实现了尺度一致性。

Insight: 创新点在于提出场景坐标嵌入作为补丁级表示,在规范尺度下编码3D几何关系,并通过几何引导聚合和场景坐标束调整模块实现全局尺度约束。从客观角度看,该方法将学习到的嵌入与几何调制注意力结合,有效传播历史观测的尺度信息,为单目SLAM的尺度一致性提供了可借鉴的端到端解决方案。

Abstract: Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.


[74] STEP3-VL-10B Technical Report cs.CVPDF

Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo

TL;DR: STEP3-VL-10B是一个轻量级开源多模态基础模型,旨在以紧凑的10B参数规模实现前沿的多模态智能。它通过统一的、完全解冻的预训练策略(在1.2T多模态token上整合语言对齐的感知编码器和Qwen3-8B解码器)和包含超过1000次强化学习迭代的规模化后训练流程来实现。模型还采用了并行协调推理(PaCoRe)来扩展测试时计算资源,专注于可扩展的感知推理。

Details

Motivation: 解决在紧凑模型规模(10B参数)下实现与大型模型(如百亿、千亿参数模型)相匹敌甚至超越的前沿多模态性能的权衡问题,为社区提供一个高效、强大且可复现的基线。

Result: 在多个基准测试中达到顶尖水平:MMBench(92.2%)、MMMU(80.11%)、AIME2025(94.43%)和MathVision(75.95%)。其性能可与或超越参数规模大10-20倍的模型(如GLM-4.6V-106B、Qwen3-VL-235B)以及顶级专有旗舰模型(如Gemini 2.5 Pro和Seed-1.5-VL)。

Insight: 创新点包括:1)统一的、完全解冻的预训练策略,整合语言对齐的感知编码器与解码器,以建立内在的视觉-语言协同;2)规模化后训练流程,包含大量强化学习迭代;3)并行协调推理(PaCoRe),一种在推理时动态分配计算资源以探索和综合多样化视觉假设的方法,有效提升了复杂推理能力。这些策略共同使轻量级模型实现了与超大模型竞争的性能。

Abstract: We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.


[75] Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering cs.CVPDF

Jieying Chen, Jeffrey Hu, Joan Lasenby, Ayush Tewari

TL;DR: 本文提出了一种名为SRENDER的高效相机控制视频生成方法,用于静态场景。该方法结合稀疏扩散模型生成关键帧和3D重建渲染技术,通过将关键帧提升为3D表示并渲染中间视图,在保证几何一致性的同时,将生成成本分摊到数百帧中,从而实现了超过基线方法40倍的加速。

Details

Motivation: 解决现有基于扩散模型的视频生成模型计算效率低下、无法满足实时交互应用(如具身AI和VR/AR)需求的问题。

Result: 在生成20秒视频的任务上,SRENDER比基于扩散模型的基线方法快40倍以上,同时保持了高视觉保真度和时间稳定性。

Insight: 核心创新点在于将视频生成问题分解为稀疏关键帧生成和3D渲染两个阶段,并引入自适应关键帧数量预测模型,根据相机轨迹复杂度动态分配计算资源,实现了效率与质量的平衡。

Abstract: Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.


[76] SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3 cs.CVPDF

Ruiqi Shen, Chang Liu, Henghui Ding

TL;DR: 本文提出了SAM3-DMS方法,旨在改进SAM3模型在复杂多目标视频分割中的性能。原SAM3模型采用基于所有目标平均性能的集体记忆选择策略,这在多目标场景下存在不足。SAM3-DMS通过为每个目标对象实施细粒度的、解耦的独立记忆选择策略,无需额外训练,有效提升了身份保持和跟踪稳定性,尤其在目标密度高时优势更明显。

Details

Motivation: 解决SAM3模型在多目标视频分割场景中,其集体记忆选择策略因同步决策和依赖平均性能而忽略个体可靠性,导致在复杂多对象场景下表现欠佳的问题。

Result: 实验表明,该方法实现了鲁棒的身份保持和跟踪稳定性,其优势随着目标密度的增加而更加显著,为野外多目标视频分割奠定了坚实基础。

Insight: 主要创新点在于提出了一种无需训练的解耦记忆选择策略,对每个目标进行独立、细粒度的记忆管理,突破了原模型集体决策的局限性,提升了在多目标密集场景下的分割与跟踪性能。

Abstract: Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group-level collective memory selection is suboptimal for complex multi-object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3-DMS, a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.


[77] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning cs.CV | cs.AI | cs.LG | cs.ROPDF

Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz

TL;DR: 本文提出Fast-ThinkAct,一个高效的视觉-语言-动作推理框架,通过可言语化的潜在规划来实现紧凑且高性能的规划。该方法通过从教师模型蒸馏学习潜在思维链,在偏好引导的目标驱动下对齐操作轨迹,从而将语言和视觉规划能力迁移到具身控制中,实现推理增强的策略学习。

Details

Motivation: 解决现有视觉-语言-动作推理方法因显式思维链导致推理轨迹冗长、推理延迟高的问题,旨在实现高效且泛化性强的具身推理与控制。

Result: 在多种具身操作和推理基准测试上的广泛实验表明,Fast-ThinkAct实现了强大的性能,相比最先进的推理VLA方法,推理延迟降低了高达89.3%,同时保持了有效的长时程规划、少样本适应和故障恢复能力。

Insight: 创新点在于引入可言语化的潜在推理进行紧凑规划,并通过偏好引导的蒸馏目标对齐轨迹,将语言和视觉规划能力有效迁移到动作执行,实现了推理效率与性能的平衡。

Abstract: Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.


cs.RO [Back]

[78] Learning Whole-Body Human-Humanoid Interaction from Human-Human Demonstrations cs.RO | cs.AI | cs.CVPDF

Wei-Jin Huang, Yue-Yi Zhang, Yi-Lin Wei, Zhi-Wei Xia, Juantao Tan

TL;DR: 本文提出了一种从人类-人类交互数据中学习人形机器人全身交互的完整框架。该框架包含两个核心组件:PAIR(物理感知交互重定向)方法,用于将人类交互数据转化为物理一致的人形机器人交互数据;以及D-STAR(解耦时空动作推理器)分层策略,用于从生成的数据中学习超越简单模仿的、具有交互理解的协同行为。

Details

Motivation: 人形机器人与人类物理交互的发展受限于高质量人-机交互数据的稀缺。本文旨在利用丰富的人类-人类交互数据作为可扩展的替代方案,但发现标准重定向方法会破坏关键接触点,而传统的模仿学习策略仅能模仿轨迹,缺乏交互理解。

Result: 通过大量严格的仿真实验验证,所提出的完整框架在性能上显著优于基线方法,证明了其从人类-人类交互数据中学习复杂全身交互的有效性。

Insight: 主要创新点包括:1)提出以接触为中心的PAIR两阶段重定向管道,解决了跨形态差异下保持接触语义和物理一致性的问题;2)提出D-STAR分层策略,通过解耦“何时行动”和“何处行动”的推理流,利用相位注意力和多尺度空间模块,结合扩散头生成同步的全身行为,从而学习到具有响应性和同步协作能力的鲁棒策略。

Abstract: Enabling humanoid robots to physically interact with humans is a critical frontier, but progress is hindered by the scarcity of high-quality Human-Humanoid Interaction (HHoI) data. While leveraging abundant Human-Human Interaction (HHI) data presents a scalable alternative, we first demonstrate that standard retargeting fails by breaking the essential contacts. We address this with PAIR (Physics-Aware Interaction Retargeting), a contact-centric, two-stage pipeline that preserves contact semantics across morphology differences to generate physically consistent HHoI data. This high-quality data, however, exposes a second failure: conventional imitation learning policies merely mimic trajectories and lack interactive understanding. We therefore introduce D-STAR (Decoupled Spatio-Temporal Action Reasoner), a hierarchical policy that disentangles when to act from where to act. In D-STAR, Phase Attention (when) and a Multi-Scale Spatial module (where) are fused by the diffusion head to produce synchronized whole-body behaviors beyond mimicry. By decoupling these reasoning streams, our model learns robust temporal phases without being distracted by spatial noise, leading to responsive, synchronized collaboration. We validate our framework through extensive and rigorous simulations, demonstrating significant performance gains over baseline approaches and a complete, effective pipeline for learning complex whole-body interactions from HHI data.


[79] Multimodal Signal Processing For Thermo-Visible-Lidar Fusion In Real-time 3D Semantic Mapping cs.RO | cs.CVPDF

Jiajun Sun, Yangyi Ou, Haoyuan Zheng, Chao yang, Yue Ma

TL;DR: 本文提出了一种新颖的多模态信号处理方法,用于实时3D语义建图,通过融合热成像、可见光和激光雷达数据。该方法首先在像素级融合可见光与红外图像,然后将实时激光雷达点云投影到融合图像流上,并在热通道中分割热源特征以即时识别高温目标,最终将温度信息作为语义层应用于3D地图中。

Details

Motivation: 在复杂环境中,自主机器人导航和环境感知对SLAM技术提出了更高要求,需要增强地图的语义理解能力,以支持如快速灾害评估和工业预防性维护等特定应用。

Result: 该方法生成的3D地图不仅具有精确的几何结构,还包含对环境的关键语义理解,在特定应用场景中显示出高价值,但摘要中未提及具体的基准测试或定量比较结果。

Insight: 创新点在于将热成像信息作为语义层集成到实时3D点云建图中,实现了多模态(热-可见光-LiDAR)的像素级融合,从而增强了地图的环境感知能力,特别是在高温目标识别方面,为SLAM技术提供了新的语义增强途径。

Abstract: In complex environments, autonomous robot navigation and environmental perception pose higher requirements for SLAM technology. This paper presents a novel method for semantically enhancing 3D point cloud maps with thermal information. By first performing pixel-level fusion of visible and infrared images, the system projects real-time LiDAR point clouds onto this fused image stream. It then segments heat source features in the thermal channel to instantly identify high temperature targets and applies this temperature information as a semantic layer on the final 3D map. This approach generates maps that not only have accurate geometry but also possess a critical semantic understanding of the environment, making it highly valuable for specific applications like rapid disaster assessment and industrial preventive maintenance.


cs.AI [Back]

[80] AviationLMM: A Large Multimodal Foundation Model for Civil Aviation cs.AI | cs.CL | cs.CVPDF

Wenbin Li, Jingling Wu, Xiaoyong Lin. Jing Chen, Cong Chen

TL;DR: 本文提出了AviationLMM,一个面向民航领域的大型多模态基础模型,旨在统一整合民航中的异构数据流(如空地语音、监视、机载遥测、视频和结构化文本),以实现理解、推理、生成和智能体应用,从而提升民航的安全性、效率和客户满意度。

Details

Motivation: 现有民航AI解决方案通常是孤立的、狭窄的,专注于单一任务或模态,难以整合语音通信、雷达轨迹、传感器流和文本报告等异构数据,限制了态势感知、适应性和实时决策支持能力。

Result: 摘要中未提及具体的定量实验结果或基准测试性能,主要阐述了模型的设计愿景、架构和待解决的关键研究挑战。

Insight: 论文的核心创新点在于提出了一个统一的多模态基础模型框架来整合民航的异构数据,并系统性地指出了实现该愿景所需解决的关键研究机会,包括数据获取与对齐、预训练、推理、可信性、隐私、模态缺失鲁棒性和合成场景生成等,旨在推动集成、可信且保护隐私的民航AI生态系统发展。

Abstract: Civil aviation is a cornerstone of global transportation and commerce, and ensuring its safety, efficiency and customer satisfaction is paramount. Yet conventional Artificial Intelligence (AI) solutions in aviation remain siloed and narrow, focusing on isolated tasks or single modalities. They struggle to integrate heterogeneous data such as voice communications, radar tracks, sensor streams and textual reports, which limits situational awareness, adaptability, and real-time decision support. This paper introduces the vision of AviationLMM, a Large Multimodal foundation Model for civil aviation, designed to unify the heterogeneous data streams of civil aviation and enable understanding, reasoning, generation and agentic applications. We firstly identify the gaps between existing AI solutions and requirements. Secondly, we describe the model architecture that ingests multimodal inputs such as air-ground voice, surveillance, on-board telemetry, video and structured texts, and performs cross-modal alignment and fusion, and produces flexible outputs ranging from situation summaries and risk alerts to predictive diagnostics and multimodal incident reconstructions. In order to fully realize this vision, we identify key research opportunities to address, including data acquisition, alignment and fusion, pretraining, reasoning, trustworthiness, privacy, robustness to missing modalities, and synthetic scenario generation. By articulating the design and challenges of AviationLMM, we aim to boost the civil aviation foundation model progress and catalyze coordinated research efforts toward an integrated, trustworthy and privacy-preserving aviation AI ecosystem.


[81] Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning cs.AI | cs.CLPDF

Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang

TL;DR: 本文提出了一种名为多智能体测试时强化学习(MATTRL)的框架,旨在通过将结构化的文本经验注入多智能体在推理时的讨论中,以提升其在复杂任务中的推理能力,而无需进行资源密集的训练。

Details

Motivation: 多智能体系统虽能通过多样性和交叉验证增强鲁棒性,但其强化学习训练过程资源消耗大且不稳定,存在非平稳性和稀疏高方差奖励的问题,因此需要一种无需调优即可在测试时提升推理性能的方法。

Result: 在医学、数学和教育等领域的挑战性基准测试中,MATTRL相比多智能体基线平均准确率提升了3.67%,相比单智能体基线提升了8.67%,并通过消融实验分析了不同信用分配方案对训练结果的影响。

Insight: 创新点在于将强化学习经验构建与注入机制引入多智能体推理过程,通过检索和整合测试时经验、形成专家团队进行多轮讨论并达成共识,实现了无需训练即可应对分布偏移的稳定高效多智能体推理。

Abstract: Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67% over a multi-agent baseline, and by 8.67% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.


[82] PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records cs.AI | cs.CV | cs.HC | cs.LGPDF

Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie

TL;DR: 本文提出了PersonalAlign任务,旨在解决GUI代理与用户复杂隐式意图对齐的问题,并引入了AndroidIntent基准测试来评估代理利用长期用户记录解析模糊指令和提供主动建议的能力。作者进一步提出了分层意图记忆代理(HIM-Agent),通过持续更新的个人记忆分层组织用户偏好和习惯以实现个性化,实验表明HIM-Agent在AndroidIntent基准上显著提升了执行和主动性能。

Details

Motivation: 现有GUI代理在显式和完整指令下表现良好,但在现实部署中需要与用户更复杂的隐式意图对齐,解决模糊指令中省略的偏好并根据用户状态预测潜在习惯以提供主动协助。

Result: 在提出的AndroidIntent基准上评估了包括GPT-5、Qwen3-VL和UI-TARS在内的多种GUI代理,HIM-Agent显著将执行性能和主动性能分别提升了15.7%和7.3%。

Insight: 创新点在于提出了分层隐式意图对齐任务和相应的基准测试,以及HIM-Agent模型,该模型通过持续更新的个人记忆分层组织用户偏好和习惯,有效利用长期用户记录进行个性化推理,为GUI代理的隐式意图理解和主动服务提供了新思路。

Abstract: While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users’ more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents’ ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.


cs.IR [Back]

[83] Fine Grained Evaluation of LLMs-as-Judges cs.IR | cs.CL | cs.LGPDF

Sourav Saha, Mandar Mitra

TL;DR: 本文研究了将大型语言模型(LLMs)用作信息检索(IR)中相关性评估的‘法官’,并扩展了现有研究,通过使用INEX创建的基于维基百科的测试集,评估LLMs在文档层面和段落层面的判断能力。研究发现,LLMs作为法官在人类监督下表现最佳。

Details

Motivation: 动机是扩展LLMs作为评估者(judges)的研究,特别是在信息检索的标准临时任务中,不仅评估文档级别的相关性,还量化LLMs在段落级别判断的准确性,以检验其判断是否‘理由正确’。

Result: 研究使用了INEX的维基百科测试集,通过提示LLMs判断文档相关性并高亮相关段落,与人类评估者的标注进行对比。结果表明,LLMs作为法官在人类监督下效果最好,但未提及具体定量指标(如准确率或F1分数)或与SOTA的比较。

Insight: 创新点在于将LLMs评估从文档级别细化为段落级别,通过高亮相关段落来评估其判断的合理性,这提供了更精细的评估视角;客观分析认为,这种方法有助于更深入地理解LLMs作为评估者的局限性和潜力,强调了人类监督的重要性。

Abstract: A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges' in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that respond to the information need expressed in a query. This enables us to evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these judges’ are right for the right reasons. Our findings suggest that LLMs-as-judges work best under human supervision.


Pei-Chi Lo, Thomas Y. Lu

TL;DR: 本研究提出了一种基于话语分析的大语言模型方法,结合修辞结构理论和智能体工作流,从美国版权损害赔偿判决中提取和量化先前不透明的推理模式。该方法通过将判决意见解析为层次化话语结构,并采用数据集构建、话语分析和智能体特征提取的三阶段流程,识别推理组件并提取特征标签及其对应的话语子树。

Details

Motivation: 解决版权损害赔偿中司法推理的不一致性问题,即联邦法院虽遵循1976年《版权法》,但不同司法管辖区对法律解释和因素权重存在广泛差异,导致诉讼结果不可预测且法律决策的实证基础模糊。

Result: 在分析版权损害赔偿裁决时,话语增强的LLM分析优于传统方法,同时揭示了不同巡回法院在因素权重方面未量化的差异。

Insight: 创新点在于将修辞结构理论与智能体工作流集成到LLM中,以层次化话语结构解析法律文本,从而量化司法推理模式;客观来看,该方法为计算法律分析提供了新的方法论框架,并可能推广到其他法律领域的不一致性研究。

Abstract: Judicial reasoning in copyright damage awards poses a core challenge for computational legal analysis. Although federal courts follow the 1976 Copyright Act, their interpretations and factor weightings vary widely across jurisdictions. This inconsistency creates unpredictability for litigants and obscures the empirical basis of legal decisions. This research introduces a novel discourse-based Large Language Model (LLM) methodology that integrates Rhetorical Structure Theory (RST) with an agentic workflow to extract and quantify previously opaque reasoning patterns from judicial opinions. Our framework addresses a major gap in empirical legal scholarship by parsing opinions into hierarchical discourse structures and using a three-stage pipeline, i.e., Dataset Construction, Discourse Analysis, and Agentic Feature Extraction. This pipeline identifies reasoning components and extract feature labels with corresponding discourse subtrees. In analyzing copyright damage rulings, we show that discourse-augmented LLM analysis outperforms traditional methods while uncovering unquantified variations in factor weighting across circuits. These findings offer both methodological advances in computational legal analysis and practical insights into judicial reasoning, with implications for legal practitioners seeking predictive tools, scholars studying legal principle application, and policymakers confronting inconsistencies in copyright law.


cs.SD [Back]

[85] SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing cs.SD | cs.CL | cs.MMPDF

Ziyang Ma, Guanrou Yang, Wenxi Chen, Zhifu Gao, Yexing Du

TL;DR: 本文提出了SLAM-LLM,一个专注于语音、语言、音频和音乐处理的开源、模块化多模态大语言模型框架。它旨在解决现有MLLM框架主要关注视觉模态而对音频相关模态支持不足的问题,通过提供模块化组件、训练配方和预训练模型来加速音频-语言模型的研究与开发。

Details

Motivation: 当前开源的MLLM框架(如LLaVA)主要面向视觉模态,对语音、音频和音乐模态的深度支持有限,这阻碍了音频-语言模型的发展,并迫使研究人员在代码编写和超参数调优上耗费大量精力。

Result: 框架提供了针对主流任务(如基于LLM的自动语音识别、自动音频描述和音乐描述)的详细训练方案和高性能检查点。其中一些方案已经达到或接近最先进(SOTA)性能,相关技术已被学术论文接受。

Insight: 主要创新点在于提供了一个专门针对音频相关模态的模块化、开源MLLM框架,集成了不同编码器、投影器、LLM和参数高效微调插件,并提供了经过验证的实践方案,旨在降低研究门槛并推动社区协作。

Abstract: The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.


[86] Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception cs.SD | cs.AI | cs.CL | cs.MA | eess.ASPDF

Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian, Hanrong Ye, Ankita Pasad

TL;DR: 本文提出了一个名为Speech-Hands的语音代理框架,其核心是学习一种全感知理解技能:判断何时信任自身推理,何时需要借助外部音频感知。该框架将问题重构为显式的自我反思决策过程,有效防止模型被有缺陷的外部候选假设误导。

Details

Motivation: 动机源于一个关键且反直觉的发现:简单地在语音识别和外部声音理解任务上对全能模型进行微调,往往会因模型被噪声假设误导而导致性能下降。因此,需要一种机制来明确管理何时依赖内部判断、何时寻求外部帮助。

Result: 在OpenASR排行榜的七个基准测试上,Speech-Hands持续超越强基线模型,词错误率(WER)平均降低12.1%。在音频问答决策任务上,模型达到了77.37%的准确率和高F1分数,在多样化的音频问答数据集上展现了鲁棒的泛化能力和可靠性。

Insight: 主要创新点在于将全能音频理解任务重构为一个可学习的自我反思决策问题,引入了“代理行动机制”。这为统一感知与决策、构建更可靠和稳健的音频智能系统提供了一条实用路径,其机制可自然地从语音识别泛化到复杂的多选音频推理任务。

Abstract: We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.


eess.IV [Back]

[87] POWDR: Pathology-preserving Outpainting with Wavelet Diffusion for 3D MRI eess.IV | cs.CVPDF

Fei Tan, Ashok Vardhan Addala, Bruno Astuto Arouche Nunes, Xucheng Zhu, Ravi Soni

TL;DR: 该论文提出了一种名为POWDR的病理保留外绘框架,用于3D MRI数据增强。该方法基于条件小波扩散模型,能够在保留真实病灶区域的同时,生成解剖学上合理的周围组织,从而在不虚构病变的情况下增加数据多样性。

Details

Motivation: 医学影像数据集常受类别不平衡和富含病理案例有限的制约,这限制了分割、分类和视觉-语言任务中机器学习模型的性能。POWDR旨在解决数据稀缺和类别不平衡问题。

Result: 在脑部MRI(使用BraTS数据集)和膝部MRI上进行了评估。定量指标(FID、SSIM、LPIPS)证实了图像的真实性。随机掩码训练显著提升了多样性(余弦相似度从0.9947降至0.9580;KL散度从0.00026增至0.01494)。使用nnU-Net进行肿瘤分割时,添加50个合成案例后Dice分数从0.6992提升至0.7137。组织体积分析显示CSF和GM与真实图像无显著差异。

Insight: 创新点在于提出了一个基于条件小波扩散模型的病理保留外绘框架,通过小波域条件增强高频细节并减少模糊,以及引入随机连通掩码训练策略来防止条件崩溃并提高病灶外区域的多样性。该方法具有组织无关的适用性,为生成多样且保留病理的合成数据提供了一个可控的解决方案。

Abstract: Medical imaging datasets often suffer from class imbalance and limited availability of pathology-rich cases, which constrains the performance of machine learning models for segmentation, classification, and vision-language tasks. To address this challenge, we propose POWDR, a pathology-preserving outpainting framework for 3D MRI based on a conditioned wavelet diffusion model. Unlike conventional augmentation or unconditional synthesis, POWDR retains real pathological regions while generating anatomically plausible surrounding tissue, enabling diversity without fabricating lesions. Our approach leverages wavelet-domain conditioning to enhance high-frequency detail and mitigate blurring common in latent diffusion models. We introduce a random connected mask training strategy to overcome conditioning-induced collapse and improve diversity outside the lesion. POWDR is evaluated on brain MRI using BraTS datasets and extended to knee MRI to demonstrate tissue-agnostic applicability. Quantitative metrics (FID, SSIM, LPIPS) confirm image realism, while diversity analysis shows significant improvement with random-mask training (cosine similarity reduced from 0.9947 to 0.9580; KL divergence increased from 0.00026 to 0.01494). Clinically relevant assessments reveal gains in tumor segmentation performance using nnU-Net, with Dice scores improving from 0.6992 to 0.7137 when adding 50 synthetic cases. Tissue volume analysis indicates no significant differences for CSF and GM compared to real images. These findings highlight POWDR as a practical solution for addressing data scarcity and class imbalance in medical imaging. The method is extensible to multiple anatomies and offers a controllable framework for generating diverse, pathology-preserving synthetic data to support robust model development.


[88] Equi-ViT: Rotational Equivariant Vision Transformer for Robust Histopathology Analysis eess.IV | cs.AI | cs.CVPDF

Fuyao Chen, Yuexi Du, Elèonore V. Lieffrig, Nicha C. Dvornek, John A. Onofrey

TL;DR: 本文提出Equi-ViT,一种旋转等变的视觉Transformer,用于增强组织病理学图像分析的鲁棒性。它通过在ViT的patch embedding阶段集成等变卷积核,使模型对图像旋转和反射等常见变换具有内在的等变性,从而提升数据效率和分类稳定性。

Details

Motivation: 标准ViT对旋转和反射等变换缺乏等变性,而这类变换在组织病理学成像中普遍存在,限制了模型在全局上下文建模中的鲁棒性和泛化能力。

Result: 在公开结直肠癌数据集上的实验表明,Equi-ViT实现了更优的旋转一致性patch embedding和跨图像方向的稳定分类性能,增强了数据效率和鲁棒性。

Insight: 将等变卷积核集成到ViT的patch embedding阶段,为Transformer架构注入了内置的旋转等变性,这为构建更泛化的数字病理学基础模型提供了潜在的骨干网络方案。

Abstract: Vision Transformers (ViTs) have gained rapid adoption in computational pathology for their ability to model long-range dependencies through self-attention, addressing the limitations of convolutional neural networks that excel at local pattern capture but struggle with global contextual reasoning. Recent pathology-specific foundation models have further advanced performance by leveraging large-scale pretraining. However, standard ViTs remain inherently non-equivariant to transformations such as rotations and reflections, which are ubiquitous variations in histopathology imaging. To address this limitation, we propose Equi-ViT, which integrates an equivariant convolution kernel into the patch embedding stage of a ViT architecture, imparting built-in rotational equivariance to learned representations. Equi-ViT achieves superior rotation-consistent patch embeddings and stable classification performance across image orientations. Our results on a public colorectal cancer dataset demonstrate that incorporating equivariant patch embedding enhances data efficiency and robustness, suggesting that equivariant transformers could potentially serve as more generalizable backbones for the application of ViT in histopathology, such as digital pathology foundation models.


cs.LG [Back]

[89] Spectral Generative Flow Models: A Physics-Inspired Replacement for Vectorized Large Language Models cs.LG | cs.CLPDF

Andrew Kiruluta

TL;DR: 本文提出了谱生成流模型(SGFMs),这是一种受物理学启发的生成模型,旨在替代基于Transformer的大语言模型。它将文本或视频的生成建模为在连续场中受约束的随机动力学演化过程,使用多尺度小波基进行表示,从而用局部算子、谱投影和类Navier-Stokes输运取代了全局注意力机制。

Details

Motivation: 动机是克服传统Transformer模型在长程连贯性、计算效率和对物理结构归纳偏置方面的局限性,为下一代生成模型提供一个基于连续性、几何和物理结构的全新生成机制。

Result: 摘要中未提及具体的定量实验结果或基准测试对比。

Insight: 创新点包括:1)将文本和视频统一为随机偏微分方程轨迹的场论本体论;2)引入诱导稀疏性、尺度分离和计算效率的小波域表示;3)设计确保稳定性、连贯性和不确定性传播的约束随机流。这为生成模型提供了根本不同于自回归和扩散模型的新架构路径。

Abstract: We introduce Spectral Generative Flow Models (SGFMs), a physics-inspired alternative to transformer-based large language models. Instead of representing text or video as sequences of discrete tokens processed by attention, SGFMs treat generation as the evolution of a continuous field governed by constrained stochastic dynamics in a multiscale wavelet basis. This formulation replaces global attention with local operators, spectral projections, and Navier–Stokes-like transport, yielding a generative mechanism grounded in continuity, geometry, and physical structure. Our framework provides three key innovations: (i) a field-theoretic ontology in which text and video are unified as trajectories of a stochastic partial differential equation; (ii) a wavelet-domain representation that induces sparsity, scale separation, and computational efficiency; and (iii) a constrained stochastic flow that enforces stability, coherence, and uncertainty propagation. Together, these components define a generative architecture that departs fundamentally from autoregressive modeling and diffusion-based approaches. SGFMs offer a principled path toward long-range coherence, multimodal generality, and physically structured inductive bias in next-generation generative models.


[90] MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting cs.LG | cs.AI | cs.CL | cs.IRPDF

Kangda Wei, Ruihong Huang

TL;DR: 本文提出MMR-GRPO方法,通过基于最大边际相关性的奖励重加权来加速GRPO风格的训练。该方法的核心是利用完成样本的多样性来重新分配奖励,减少语义冗余样本对训练信号的贡献,从而在保持峰值性能的同时显著减少训练步骤和实际时间。

Details

Motivation: GRPO已成为训练数学推理模型的标准方法,但其依赖每个提示生成多个完成样本,导致训练计算成本高昂。尽管近期工作减少了达到峰值性能所需的训练步数,但每步成本增加使得总训练时间并未减少甚至增加。

Result: 在三个模型规模(1.5B、7B、8B)、三种GRPO变体和五个数学推理基准上的广泛评估表明,MMR-GRPO实现了可比的峰值性能,同时平均减少了47.9%的训练步骤和70.2%的实际训练时间,这些增益在不同模型、方法和基准上保持一致。

Insight: 创新点在于将最大边际相关性引入GRPO训练,通过多样性感知的奖励重加权,优先考虑多样化的解决方案以提供更具信息量的更新,从而加速收敛。这揭示了语义冗余完成样本对学习信号的边际贡献有限,而强调多样性可以更高效地利用计算资源。

Abstract: Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. We will release our code, trained models, and experimental protocols.


[91] Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning cs.LG | cs.CLPDF

Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan

TL;DR: 本文提出了DASD-4B-Thinking,一个轻量级、高性能、完全开源的长链思维推理模型。该模型在数学、科学推理和代码生成等挑战性基准测试中,在同等规模的开源模型中取得了SOTA性能,甚至超越了一些更大的模型。其核心创新在于重新审视并改进了当前广泛采用的序列级知识蒸馏范式,通过提出一种增强的蒸馏训练流程,解决了现有方法在教师分布表示、师生能力对齐以及曝光偏差等方面的关键局限。

Details

Motivation: 当前社区广泛采用的基于教师生成响应的SFT(序列级蒸馏)范式,虽然展现了高效性和强经验性能,但其主要基于SFT视角,侧重于设计启发式规则进行数据过滤,而忽视了蒸馏的核心原则——让学生模型学习教师的完整输出分布以继承其泛化能力。具体而言,现有方法存在教师序列级分布表示不足、教师输出分布与学生学习能力不匹配、以及教师强制训练与自回归推理之间的曝光偏差三个关键局限,反映了蒸馏过程中缺乏明确的师生交互。

Result: DASD-4B-Thinking在数学、科学推理和代码生成的挑战性基准测试中,在同等规模的开源模型中达到了SOTA性能,甚至优于一些更大的模型。值得注意的是,该模型仅使用了44.8万个训练样本就取得了有竞争力的结果,这比大多数现有开源工作使用的数据量少了一个数量级。

Insight: 论文的主要创新点在于从蒸馏的本质出发,系统性地识别并解决了当前序列级蒸馏范式的三个核心局限,从而构建了一个增强的蒸馏训练流程。其可借鉴之处在于强调了在知识蒸馏中显式建模师生交互、对齐输出分布与学习能力、以及缓解训练-推理模式不匹配的重要性,并且证明了通过方法论的改进,可以用极少的训练数据实现高性能,这对数据高效训练具有启发意义。

Abstract: In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation – even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself – enabling the student model to learn the teacher’s full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher’s sequence-level distribution; ii) Misalignment between the teacher’s output distribution and the student’s learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples – an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.


[92] GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization cs.LG | cs.AI | cs.CLPDF

Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng

TL;DR: 本文提出了一种名为GIFT(有限温度吉布斯初始化)的新方法,用于解决大型推理模型后训练中监督微调(SFT)与强化学习(RL)之间的优化不匹配问题。GIFT通过将监督信息建模为有限温度的能量势,构建了一个分布桥梁,确保了整个后训练流程的目标一致性,从而为RL初始化提供了更优的起点。

Details

Motivation: 当前大型推理模型的主流后训练范式(SFT后接RL)存在内在的优化不匹配:SFT固有的刚性监督会导致分布坍缩,从而耗尽了后续RL所需的探索空间。

Result: 实验表明,当GIFT被用于RL初始化时,其性能显著优于标准的SFT和其他竞争基线方法,为后训练实现全局最优性提供了一条数学上严谨的路径。

Insight: 核心创新点在于将标准SFT重新表述为退化的零温度极限,并提出了一个统一的有限温度吉布斯初始化框架。这从理论上保证了目标一致性,避免了分布坍缩,为RL保留了必要的探索空间,是一种原理驱动的后训练初始化方法。

Abstract: The prevailing post-training paradigm for Large Reasoning Models (LRMs)–Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)–suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training. Our code is available at https://github.com/zzy1127/GIFT.