Table of Contents
- cs.CL [Total: 55]
- cs.CV [Total: 158]
- cs.NI [Total: 1]
- cs.AI [Total: 11]
- cs.SD [Total: 2]
- cs.CR [Total: 2]
- cs.CY [Total: 1]
- q-bio.QM [Total: 1]
- eess.IV [Total: 7]
- cs.RO [Total: 6]
- cs.LG [Total: 9]
- stat.ML [Total: 1]
- eess.AS [Total: 1]
- q-bio.NC [Total: 2]
- cs.HC [Total: 2]
cs.CL [Back]
[1] Retracing the Past: LLMs Emit Training Data When They Get Lost
Myeongseob Ko,Nikhil Reddy Billa,Adam Nguyen,Charles Fleming,Ming Jin,Ruoxi Jia
Main category: cs.CL
TL;DR: 该论文提出了一种名为混淆诱导攻击(CIA)的系统化框架,通过最大化模型不确定性来提取LLM中记忆的训练数据,并展示了其优于现有方法的性能。
Details
Motivation: 大型语言模型(LLM)对训练数据的记忆引发了隐私和版权问题,现有数据提取方法效果有限且缺乏对记忆泄漏原因的深入理解。Contribution: 提出了CIA框架,通过诱导模型的高熵状态来提取记忆数据;针对对齐的LLM,提出了不匹配监督微调(SFT)以增加模型对攻击的敏感性。
Method: CIA通过优化输入片段诱导模型的高熵状态;SFT用于削弱模型对齐性并诱导目标混淆。
Result: 实验表明,CIA在未对齐和对齐的LLM上均优于现有基线方法,能够提取出更多原始或接近原始的训练数据。
Insight: 研究发现记忆泄漏现象普遍存在于各类LLM中,CIA提供了一种更系统化的方法来评估这些漏洞。
Abstract: The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing susceptibility to our attacks. Experiments on various unaligned and aligned LLMs demonstrate that our proposed attacks outperform existing baselines in extracting verbatim and near-verbatim training data without requiring prior knowledge of the training data. Our findings highlight persistent memorization risks across various LLMs and offer a more systematic method for assessing these vulnerabilities.
[2] MCP4IFC: IFC-Based Building Design Using Large Language Models
Bharathi Kannan Nithyanantham,Tobias Sesterhenn,Ashwin Nedungadi,Sergio Peral Garijo,Janis Zenkner,Christian Bartelt,Stefan Lüdtke
Main category: cs.CL
TL;DR: MCP4IFC是一个开源框架,通过LLMs直接操作IFC数据,结合MCP协议和BIM工具,支持建筑设计的自然语言指令转换。
Details
Motivation: 将生成式AI引入建筑、工程和施工(AEC)领域,需将自然语言指令转换为标准化数据模型的操作。Contribution: 提出了MCP4IFC框架,支持LLMs直接操作IFC数据,提供工具集和动态代码生成系统。
Method: 采用Model Context Protocol(MCP)和检索增强生成(RAG),结合场景查询工具和预定义函数。
Result: 实验证明LLMs能成功完成复杂任务,包括从简单房屋构建到IFC数据的查询与编辑。
Insight: 开源框架为LLM驱动的BIM设计和AI辅助建模工作流提供了基础。
Abstract: Bringing generative AI into the architecture, engineering and construction (AEC) field requires systems that can translate natural language instructions into actions on standardized data models. We present MCP4IFC, a comprehensive open-source framework that enables Large Language Models (LLMs) to directly manipulate Industry Foundation Classes (IFC) data through the Model Context Protocol (MCP). The framework provides a set of BIM tools, including scene querying tools for information retrieval, predefined functions for creating and modifying common building elements, and a dynamic code-generation system that combines in-context learning with retrieval-augmented generation (RAG) to handle tasks beyond the predefined toolset. Experiments demonstrate that an LLM using our framework can successfully perform complex tasks, from building a simple house to querying and editing existing IFC data. Our framework is released as open-source to encourage research in LLM-driven BIM design and provide a foundation for AI-assisted modeling workflows. Our code is available at https://show2instruct.github.io/mcp4ifc/.
[3] FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference
Kunxi Li,Yufan Xiong,Zhonghua Jiang,Yiyun Zhou,Zhaode Wang,Chengfei Lv,Shengyu Zhang
Main category: cs.CL
TL;DR: FlowMM提出了一种基于跨模态信息流的KV缓存合并框架,解决了多模态场景中KV缓存合并的局限性,显著减少了内存占用和延迟。
Details
Motivation: 传统的KV缓存淘汰策略可能导致生成质量下降,而最近的研究转向KV合并,但在多模态场景中仍受限于模态间的分布和注意力偏差。Contribution: FlowMM通过跨模态信息流动态调整合并策略,提出了敏感度自适应的token匹配机制,显著提升了效率和性能。
Method: FlowMM利用跨模态信息流进行分层合并策略,并结合token相似性和任务关键性进行敏感性自适应匹配。
Result: 实验显示,FlowMM能将KV缓存内存减少80%-95%,解码延迟降低1.3-1.8倍,同时保持任务性能。
Insight: 在多模态场景中,动态调整合并策略并结合任务敏感性是关键,FlowMM为此提供了有效解决方案。
Abstract: Traditional KV cache eviction strategies, which discard less critical KV-pairs based on attention scores, often degrade generation quality, causing context loss or hallucinations. Recent efforts shift toward KV merging, merging eviction tokens with retention tokens based on similarity. However, in multimodal scenarios, distributional biases across modality tokens and attentional biases in cross-modal interactions limit its effectiveness. This work introduces FlowMM, an adaptive framework for cross-modal information flow-guided multimodal KV cache merging. FlowMM leverages cross-modal information flow to dynamically apply layer-specific merging strategies, capturing modality-specific patterns while preserving contextual integrity. Furthermore, we introduce a sensitivity-adaptive token matching mechanism that jointly evaluates token similarity and task-critical sensitivity, merging low-risk tokens while safeguarding high-sensitivity ones. Extensive experiments across diverse leading MLLMs show that FlowMM reduces KV cache memory by 80% to 95% and decoding latency by 1.3-1.8x, while maintaining competitive task performance.
[4] Future of AI Models: A Computational perspective on Model collapse
Trivikram Satharasi,S Sitharama Iyengar
Main category: cs.CL
TL;DR: 论文探讨了AI模型(尤其是大语言模型)递归训练导致的模型崩溃问题,通过分析维基百科语义相似性变化,量化了合成内容对数据多样性和模型泛化能力的威胁。
Details
Motivation: 随着AI生成内容在网络中的快速扩散,递归训练可能导致语言和语义多样性丧失(模型崩溃),从而威胁AI模型的泛化能力和数据丰富性。Contribution: 1) 量化了合成内容对数据多样性的影响;2) 预测了模型崩溃的发生时间;3) 提供了基于历史数据的实证分析。
Method: 使用Transformer嵌入和余弦相似性度量,分析了2013至2025年英语维基百科的语义相似性变化。
Result: 研究发现LLM公开使用后语义相似性呈指数增长,同时早期RNN/LSTM的影响较小。
Insight: 合成数据的过度依赖会加速模型崩溃,需关注数据多样性保护以维持AI模型的长期性能。
Abstract: Artificial Intelligence, especially Large Language Models (LLMs), has transformed domains such as software engineering, journalism, creative writing, academia, and media (Naveed et al. 2025; arXiv:2307.06435). Diffusion models like Stable Diffusion generate high-quality images and videos from text. Evidence shows rapid expansion: 74.2% of newly published webpages now contain AI-generated material (Ryan Law 2025), 30-40% of the active web corpus is synthetic (Spennemann 2025; arXiv:2504.08755), 52% of U.S. adults use LLMs for writing, coding, or research (Staff 2025), and audits find AI involvement in 18% of financial complaints and 24% of press releases (Liang et al. 2025). The underlying neural architectures, including Transformers (Vaswani et al. 2023; arXiv:1706.03762), RNNs, LSTMs, GANs, and diffusion networks, depend on large, diverse, human-authored datasets (Shi & Iyengar 2019). As synthetic content dominates, recursive training risks eroding linguistic and semantic diversity, producing Model Collapse (Shumailov et al. 2024; arXiv:2307.15043; Dohmatob et al. 2024; arXiv:2402.07712). This study quantifies and forecasts collapse onset by examining year-wise semantic similarity in English-language Wikipedia (filtered Common Crawl) from 2013 to 2025 using Transformer embeddings and cosine similarity metrics. Results reveal a steady rise in similarity before public LLM adoption, likely driven by early RNN/LSTM translation and text-normalization pipelines, though modest due to a smaller scale. Observed fluctuations reflect irreducible linguistic diversity, variable corpus size across years, finite sampling error, and an exponential rise in similarity after the public adoption of LLM models. These findings provide a data-driven estimate of when recursive AI contamination may significantly threaten data richness and model generalization.
[5] Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
Usha Bhalla,Alex Oesterling,Claudio Mayrink Verdun,Himabindu Lakkaraju,Flavio P. Calmon
Main category: cs.CL
TL;DR: 论文提出了Temporal Sparse Autoencoders (T-SAEs),这是一种通过利用语言的序列特性改进稀疏自编码器(SAEs)的方法,旨在更好地捕捉和理解语言模型中的语义信息。
Details
Motivation: 现有稀疏自编码器在捕捉语言模型中的高级语义特征时表现不佳,倾向于捕捉浅层或噪音特征。作者认为这是由于训练方法忽略了语言的长程依赖性和平滑性,因此提出结合语言结构的改进方法。Contribution: 提出了T-SAEs,通过引入对比损失函数,鼓励相邻token的高级特征激活一致性,从而在无监督条件下更好地分离语义和句法特征。
Method: T-SAEs通过在SAEs中引入对比损失,强制相邻token的特征激活一致性,从而利用语言的序列平滑性来提升语义特征的提取。
Result: 实验表明,T-SAEs在多数据集和模型上能够恢复更平滑、更一致的语义概念,而不牺牲重建质量。
Insight: 语言的语义信息具有长程依赖性和平滑性,可以通过对比损失函数在无监督条件下捕捉到更高级的语义概念。
Abstract: Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they suffer from a variety of problems, including a systematic failure to capture the rich conceptual information that drives linguistic understanding. Instead, they exhibit a bias towards shallow, token-specific, or noisy features, such as “the phrase ‘The’ at the start of sentences”. In this work, we propose that this is due to a fundamental issue with how dictionary learning methods for LLMs are trained. Language itself has a rich, well-studied structure spanning syntax, semantics, and pragmatics; however, current unsupervised methods largely ignore this linguistic knowledge, leading to poor feature discovery that favors superficial patterns over meaningful concepts. We focus on a simple but important aspect of language: semantic content has long-range dependencies and tends to be smooth over a sequence, whereas syntactic information is much more local. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.
[6] OckBench: Measuring the Efficiency of LLM Reasoning
Zheng Du,Hao Kang,Song Han,Tushar Krishna,Ligeng Zhu
Main category: cs.CL
TL;DR: OckBench是一个新的基准测试工具,专注于评估大型语言模型(LLM)在推理和代码生成任务中的准确性及解码令牌效率,揭示了许多模型在准确性相当的情况下令牌消耗差异巨大的现象。
Details
Motivation: 现有的基准测试主要关注模型的输出质量和准确性,而忽略了令牌效率对延迟、成本和能源消耗的重要影响。OckBench旨在填补这一空白。Contribution: 提出了OckBench,一个模型无关和硬件无关的基准测试,首次将令牌效率作为重要指标引入模型评估,并通过实验揭示了准确性相当的模型在令牌消耗上的显著差异。
Method: 通过设计一个统一的基准测试平台,结合准确性任务和令牌计数,对多种开源和闭源模型进行比较分析,并绘制精度-效率的帕累托前沿。
Result: 实验结果表明,许多准确性相近的模型在令牌消耗上差异显著,证明了令牌效率是一个被忽视但重要的模型区分维度。
Insight: 令牌不应被视为可以无限复制的免费资源,模型评估需要同时考虑准确性和效率,OckBench为研究令牌高效的推理提供了统一平台。
Abstract: Large language models such as GPT-4, Claude 3, and the Gemini series have improved automated reasoning and code generation. However, existing benchmarks mainly focus on accuracy and output quality, and they ignore an important factor: decoding token efficiency. In real systems, generating 10,000 tokens versus 100,000 tokens leads to large differences in latency, cost, and energy. In this work, we introduce OckBench, a model-agnostic and hardware-agnostic benchmark that evaluates both accuracy and token count for reasoning and coding tasks. Through experiments comparing multiple open- and closed-source models, we uncover that many models with comparable accuracy differ wildly in token consumption, revealing that efficiency variance is a neglected but significant axis of differentiation. We further demonstrate Pareto frontiers over the accuracy-efficiency plane and argue for an evaluation paradigm shift: we should no longer treat tokens as “free” to multiply. OckBench provides a unified platform for measuring, comparing, and guiding research in token-efficient reasoning. Our benchmarks are available at https://ockbench.github.io/ .
[7] In-Context Learning Without Copying
Kerem Sahin,Sheridan Feucht,Adam Belfki,Jannik Brinkmann,Aaron Mueller,David Bau,Chris Wendler
Main category: cs.CL
TL;DR: 论文探讨了抑制归纳复制(inductive copying)是否会影响Transformer模型的上下文学习能力(ICL)。作者提出Hapax方法,通过忽略可由归纳头正确预测的token的损失贡献,实验表明模型在抽象ICL任务上的表现仍然保持甚至超过基线模型。
Details
Motivation: 研究归纳复制是否为上下文学习能力的必要前提,探索模型是否能在抑制这种机制的情况下仍具备抽象推理能力。Contribution: 提出Hapax方法,证明归纳复制并非抽象ICL的必要条件;实验显示模型在抑制归纳复制后仍能高效学习,并在部分任务中表现更优。
Method: 通过省略可由归纳头预测的token的损失贡献(Hapax),训练模型以减少归纳复制的依赖,同时分析模型的归纳头变化及其对ICL的影响。
Result: Hapax模型在21个抽象ICL任务中的13个表现优于基线,31.7%的token被忽略后仍保持性能,且在非归纳头可预测位置上损失更低。
Insight: 归纳复制并非抽象ICL能力的核心机制,模型可通过其他方式学习上下文推理能力;抑制归纳复制可能促进更通用的学习策略。
Abstract: Induction heads are attention heads that perform inductive copying by matching patterns from earlier context and copying their continuations verbatim. As models develop induction heads, they often experience a sharp drop in training loss, a phenomenon cited as evidence that induction heads may serve as a prerequisite for more complex in-context learning (ICL) capabilities. In this work, we ask whether transformers can still acquire ICL capabilities when inductive copying is suppressed. We propose Hapax, a setting where we omit the loss contribution of any token that can be correctly predicted by induction heads. Despite a significant reduction in inductive copying, performance on abstractive ICL tasks (i.e., tasks where the answer is not contained in the input context) remains comparable and surpasses the vanilla model on 13 of 21 tasks, even though 31.7% of tokens are omitted from the loss. Furthermore, our model achieves lower loss values on token positions that cannot be predicted correctly by induction heads. Mechanistic analysis further shows that models trained with Hapax develop fewer and weaker induction heads but still preserve ICL capabilities. Taken together, our findings indicate that inductive copying is not essential for learning abstractive ICL mechanisms.
[8] DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning
Yaxuan Wang,Chris Yuhao Liu,Quan Liu,Jinglong Pang,Wei Wei,Yujia Bao,Yang Liu
Main category: cs.CL
TL;DR: DRAGON 是一个基于推理的系统框架,通过上下文思维链(CoT)指令保护已部署的大型语言模型(LLM),在不修改基础模型的情况下实现高效的“遗忘”功能。
Details
Motivation: 现有的大型语言模型遗忘方法通常需要微调或依赖保留数据,但在实际场景中这些数据往往不可用。DRAGON 旨在解决这一局限性,提供一种无需保留数据的实用解决方案。Contribution: 1. 提出 DRAGON 框架,通过上下文思维链和轻量级检测模块实现高效的模型遗忘;2. 引入了新的遗忘性能评估指标;3. 在三个典型任务中验证了 DRAGON 的有效性和可扩展性。
Method: DRAGON 结合轻量级检测模块识别需要遗忘的提示(prompt),并将其路由到一个专用的 CoT 保护模型进行安全干预,而无需修改基础模型。
Result: 实验表明,DRAGON 在三个代表性任务中表现出强大的遗忘能力、可扩展性和实用性。
Insight: 通过利用 LLM 的固有指令跟随能力和上下文推理,可以在不依赖保留数据的情况下实现高效的模型遗忘。
Abstract: Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.
[9] Quantifying Edits Decay in Fine-tuned LLMs
Yinjie Cheng,Paul Youssef,Christin Seifert,Jörg Schlötterer,Zhixue Zhao
Main category: cs.CL
TL;DR: 该论文研究了在经过知识编辑后的LLMs上进行微调时编辑知识的衰退情况,揭示了微调对编辑知识的影响,并提出了选择性层微调的方法以减少负面影响。
Details
Motivation: 知识编辑是轻量级修正LLM中特定事实的方法,而微调则是适应新领域的默认操作。然而,这两者的交互影响尚未被研究,尤其是微调是否会损害编辑的持久性,这对实际应用中的成本和安全性至关重要。Contribution: 系统地量化了微调后编辑知识的衰退情况,并对比了不同编辑方法和微调策略的效果。提出了选择性层微调的方法以有效控制编辑知识的保留或去除。
Method: 评估了两种知识编辑方法(MEMIT、AlphaEdit)和三种微调策略(全参数、LoRA、DoRA)在五个LLM和三个数据集上的表现。通过选择性层微调,探索如何最小化编辑衰退。
Result: 研究发现微调会导致编辑知识衰退,衰退程度因配置而异(如AlphaEdit衰退更明显)。选择性层微调可有效去除编辑,但对下游性能略有影响。
Insight: 编辑知识的持久性受微调策略和LLM配置影响显著。选择性层微调是一种实用策略,但需权衡性能和编辑保留。研究强调了在LLM应用流程中全面评估编辑效果的必要性。
Abstract: Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits as shown in Figure 1, current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edits decay after fine-tuning, investigating how fine-tuning affects knowledge editing. We evaluate two state-of-the-art editing methods (MEMIT, AlphaEdit) and three fine-tuning approaches (full-parameter, LoRA, DoRA) across five LLMs and three datasets, yielding 232 experimental configurations. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we propose selective-layer fine-tuning and find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.
[10] Retrieval-Augmented Generation in Medicine: A Scoping Review of Technical Implementations, Clinical Applications, and Ethical Considerations
Rui Yang,Matthew Yu Heng Wong,Huitao Li,Xin Li,Wentao Zhu,Jingchi Liao,Kunyu Yu,Jonathan Chong Kai Liew,Weihao Xuan,Yingjian Chen,Yuhe Ke,Jasmine Chiat Ling Ong,Douglas Teodoro,Chuan Hong,Daniel Shi Wei Ting,Nan Liu
Main category: cs.CL
TL;DR: 本文回顾了检索增强生成(RAG)技术在医学领域的应用现状,探讨了其技术实现、临床应用和伦理问题。研究发现RAG在医学领域的应用仍处于早期阶段,主要集中在问答、报告生成、文本摘要和信息提取任务上。
Details
Motivation: 医学知识的快速增长和临床实践的复杂性提出了挑战,而大语言模型(LLMs)的局限性促使研究者探索检索增强生成技术(RAG)以提升其临床实用性。Contribution: 本文的主要贡献是对医学领域RAG应用的全面综述,总结了当前的技术实现形式、临床应用场景和伦理挑战,并为未来研究提供了方向。
Method: 研究通过文献综述方法,分析了RAG在医学领域的应用案例,包括检索方法(主要依赖英文为中心的嵌入模型)、生成模型(多为通用LLMs)及评估指标(自动化指标与人工评估结合)。
Result: 研究发现RAG在医学领域的应用集中在公开数据集上,私有数据应用有限;评估指标多为生成质量和任务性能,但对偏见和安全性关注不足。
Insight: 医学RAG技术仍需在临床验证、多语言支持和低资源环境适应性等方面取得突破,以实现可信赖的全球应用。
Abstract: The rapid growth of medical knowledge and increasing complexity of clinical practice pose challenges. In this context, large language models (LLMs) have demonstrated value; however, inherent limitations remain. Retrieval-augmented generation (RAG) technologies show potential to enhance their clinical applicability. This study reviewed RAG applications in medicine. We found that research primarily relied on publicly available data, with limited application in private data. For retrieval, approaches commonly relied on English-centric embedding models, while LLMs were mostly generic, with limited use of medical-specific LLMs. For evaluation, automated metrics evaluated generation quality and task performance, whereas human evaluation focused on accuracy, completeness, relevance, and fluency, with insufficient attention to bias and safety. RAG applications were concentrated on question answering, report generation, text summarization, and information extraction. Overall, medical RAG remains at an early stage, requiring advances in clinical validation, cross-linguistic adaptation, and support for low-resource settings to enable trustworthy and responsible global use.
[11] NILC: Discovering New Intents with LLM-assisted Clustering
Hongtao Wang,Renchi Yang,Wenqing Lin
Main category: cs.CL
TL;DR: 本文提出NILC,一种结合大语言模型(LLM)的新意图发现(NID)框架,通过迭代优化聚类中心和文本嵌入,显著提升了性能。
Details
Motivation: 现有NID方法采用级联架构,未能充分利用嵌入与聚类间的反馈,且仅基于嵌入的聚类忽略了文本语义的细微差别,导致性能不佳。Contribution: NILC提出一个迭代框架,利用LLM增强聚类中心和模糊样本,并结合半监督技术,显著改进了NID任务的性能。
Method: NILC通过LLM生成语义中心点增强嵌入的欧几里得中心点,并重写模糊样本以修正聚类。在半监督设定下,结合种子技术和软必连约束。
Result: 在无监督和半监督设定下,NILC在多个基准数据集上显著优于现有基线方法。
Insight: LLM可以丰富聚类语义,迭代优化和半监督技术的结合能有效提升意图发现任务的性能。
Abstract: New intent discovery (NID) seeks to recognize both new and known intents from unlabeled user utterances, which finds prevalent use in practical dialogue systems. Existing works towards NID mainly adopt a cascaded architecture, wherein the first stage focuses on encoding the utterances into informative text embeddings beforehand, while the latter is to group similar embeddings into clusters (i.e., intents), typically by K-Means. However, such a cascaded pipeline fails to leverage the feedback from both steps for mutual refinement, and, meanwhile, the embedding-only clustering overlooks nuanced textual semantics, leading to suboptimal performance. To bridge this gap, this paper proposes NILC, a novel clustering framework specially catered for effective NID. Particularly, NILC follows an iterative workflow, in which clustering assignments are judiciously updated by carefully refining cluster centroids and text embeddings of uncertain utterances with the aid of large language models (LLMs). Specifically, NILC first taps into LLMs to create additional semantic centroids for clusters, thereby enriching the contextual semantics of the Euclidean centroids of embeddings. Moreover, LLMs are then harnessed to augment hard samples (ambiguous or terse utterances) identified from clusters via rewriting for subsequent cluster correction. Further, we inject supervision signals through non-trivial techniques seeding and soft must links for more accurate NID in the semi-supervised setting. Extensive experiments comparing NILC against multiple recent baselines under both unsupervised and semi-supervised settings showcase that NILC can achieve significant performance improvements over six benchmark datasets of diverse domains consistently.
[12] Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs
Renfei Zhang,Manasa Kaniselvan,Niloofar Mireshghallah
Main category: cs.CL
TL;DR: 强化学习(RL)提升了大型语言模型(LLMs)在分层知识遍历中的表现,尤其在知识回忆任务上优于基础模型和监督微调(SFT)模型。研究表明,RL主要通过改进知识搜索和导航的能力,而非学习新知识。
Details
Motivation: 传统观点认为强化学习会以牺牲记忆知识为代价提升语言模型的推理和泛化能力。本文挑战了这一观点,发现RL模型在知识回忆任务中表现更优,尤其是在分层结构化知识的遍历任务上。Contribution: 本文的核心贡献在于揭示了RL主要通过改进模型对现有知识的分层遍历能力提升性能,而非增加新知识。此外,通过结构化的提示方法,SFT模型的性能可以接近RL模型,但RL在深层次检索任务中仍具有优势。
Method: 作者通过对比RL和SFT模型的表现,使用结构化提示(explicit hierarchical traversal prompting)验证假设。他们还通过层级的内部激活分析,比较模型在事实表示和查询表示上的差异。
Result: 实验表明,RL模型在MedConceptsQA任务上显著优于SFT模型(差距从24pp降至7pp)。同时,RL模型在深层次检索任务中保持更强的能力,且内部激活分析显示RL主要改变了知识遍历方式,而非知识表示。
Insight: RL对模型的改进主要体现在知识遍历和搜索的策略上,而非知识的存储本身。这一发现强调了模型在结构化知识处理中的动态能力的重要性。
Abstract: Reinforcement learning (RL) is often credited with improving language model reasoning and generalization at the expense of degrading memorized knowledge. We challenge this narrative by observing that RL-enhanced models consistently outperform their base and supervised fine-tuned (SFT) counterparts on pure knowledge recall tasks, particularly those requiring traversal of hierarchical, structured knowledge (e.g., medical codes). We hypothesize these gains stem not from newly acquired data, but from improved procedural skills in navigating and searching existing knowledge hierarchies within the model parameters. To support this hypothesis, we show that structured prompting, which explicitly guides SFTed models through hierarchical traversal, recovers most of the performance gap (reducing 24pp to 7pp on MedConceptsQA for DeepSeek-V3/R1). We further find that while prompting improves final-answer accuracy, RL-enhanced models retain superior ability to recall correct procedural paths on deep-retrieval tasks. Finally our layer-wise internal activation analysis reveals that while factual representations (e.g., activations for the statement “code 57.95 refers to urinary infection”) maintain high cosine similarity between SFT and RL models, query representations (e.g., “what is code 57.95”) diverge noticeably, indicating that RL primarily transforms how models traverse knowledge rather than the knowledge representation itself.
[13] Interpretable Recognition of Cognitive Distortions in Natural Language Texts
Anton Kolonin,Anna Arinicheva
Main category: cs.CL
TL;DR: 本文提出了一种基于加权结构化模式(如N-gram)的多因素分类方法,用于自然语言文本中认知扭曲的识别,具有可解释性。
Details
Motivation: 自动化检测心理学护理中的认知扭曲是一个社会影响深远但缺乏透明模型的问题。本文旨在提供一个可解释且稳健的AI模型来解决这一挑战。Contribution: 1. 提出了一种基于加权结构化模式的多因素分类方法;2. 模型具有可解释性和透明度;3. 在两个公开数据集上验证了其优于现有方法的性能。
Method: 使用N-gram等加权结构化模式,并考虑它们之间的异层次关系,构建了一个透明且稳健的分类模型。
Result: 在两个公开数据集上显著提升了F1分数,代码和模型已开源供社区使用。
Insight: 通过结构化模式和多因素分析,可以在保持模型可解释性的同时提升性能,适用于需要透明度的领域(如心理学)。
Abstract: We propose a new approach to multi-factor classification of natural language texts based on weighted structured patterns such as N-grams, taking into account the heterarchical relationships between them, applied to solve such a socially impactful problem as the automation of detection of specific cognitive distortions in psychological care, relying on an interpretable, robust and transparent artificial intelligence model. The proposed recognition and learning algorithms improve the current state of the art in this field. The improvement is tested on two publicly available datasets, with significant improvements over literature-known F1 scores for the task, with optimal hyper-parameters determined, having code and models available for future use by the community.
[14] Revisiting Entropy in Reinforcement Learning for Large Reasoning Models
Renren Jin,Pengzhi Gao,Yuqi Ren,Zhuowen Han,Tongxuan Zhang,Wuwei Huang,Wei Liu,Jian Luan,Deyi Xiong
Main category: cs.CL
TL;DR: 论文研究了强化学习中大语言模型(LLMs)的熵动态变化,揭示了关键因素(如离线策略更新次数、训练数据多样性等)对熵崩溃的影响,并提出通过调整正负优势令牌的损失权重来有效调节熵。
Details
Motivation: RLVR训练中LLMs的熵崩溃导致模型陷入次优局部极小值,影响性能提升。现有研究对此缺乏系统性分析,因此需要深入探讨熵动态及其影响。Contribution: 1. 系统研究了RLVR训练中LLMs的熵动态变化;2. 揭示了影响熵的关键因素;3. 提出通过调整正负优势令牌的损失权重调节熵的方法。
Method: 通过理论分析和实验验证,研究了RLVR训练中熵的动态变化及影响因素,并提出基于损失权重调节熵的策略。
Result: 实验表明,正优势令牌是熵崩溃的主要来源,调整其损失权重能有效调节模型熵。
Insight: 熵崩溃与大模型性能密切相关,调节熵可通过优化训练策略实现,这对提升RLVR训练效果具有指导意义。
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a predominant approach for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, causing premature convergence to suboptimal local minima and hinder further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To address this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our findings reveal that the number of off-policy updates, the diversity of training data, and the clipping thresholds in the optimization objective are critical factors influencing the entropy of LLMs trained with RLVR. Moreover, we theoretically and empirically demonstrate that tokens with positive advantages are the primary contributors to entropy collapse, and that model entropy can be effectively regulated by adjusting the relative loss weights of tokens with positive and negative advantages during training.
[15] ReMoD: Rethinking Modality Contribution in Multimodal Stance Detection via Dual Reasoning
Bingbing Wang,Zhengda Jin,Bin Liang,Jing Li,Ruifeng Xu
Main category: cs.CL
TL;DR: ReMoD是一种通过双重推理范式重新思考多模态立场检测中模态贡献的框架,整合了直觉推理和反思推理以动态调整模态权重。
Details
Motivation: 现有的多模态立场检测工作简单融合多模态信息,忽略了不同模态对立场表达的贡献差异,导致立场学习过程中可能引入误解噪声。Contribution: 提出ReMoD框架,通过双重推理(直觉推理和反思推理)动态调整模态贡献,优化立场判断。
Method: 结合经验驱动的直觉推理捕捉初始立场线索,并通过反思推理(Modality-CoT和Semantic-CoT)调整模态偏差和语义理解。
Result: 在MMSD基准测试中显著优于基线模型,表现出强泛化能力。
Insight: 人类认知的双过程理论可为多模态任务中的动态模态权重分配提供启发。
Abstract: Multimodal Stance Detection (MSD) is a crucial task for understanding public opinion on social media. Existing work simply fuses information from various modalities to learn stance representations, overlooking the varying contributions of stance expression from different modalities. Therefore, stance misunderstanding noises may be drawn into the stance learning process due to the risk of learning errors by rough modality combination. To address this, we get inspiration from the dual-process theory of human cognition and propose ReMoD, a framework that Rethinks Modality contribution of stance expression through a Dual-reasoning paradigm. ReMoD integrates experience-driven intuitive reasoning to capture initial stance cues with deliberate reflective reasoning to adjust for modality biases, refine stance judgments, and thereby dynamically weight modality contributions based on their actual expressive power for the target stance. Specifically, the intuitive stage queries the Modality Experience Pool (MEP) and Semantic Experience Pool (SEP) to form an initial stance hypothesis, prioritizing historically impactful modalities. This hypothesis is then refined in the reflective stage via two reasoning chains: Modality-CoT updates MEP with adaptive fusion strategies to amplify relevant modalities, while Semantic-CoT refines SEP with deeper contextual insights of stance semantics. These dual experience structures are continuously refined during training and recalled at inference to guide robust and context-aware stance decisions. Extensive experiments on the public MMSD benchmark demonstrate that our ReMoD significantly outperforms most baseline models and exhibits strong generalization capabilities.
[16] Automating Hardware Design and Verification from Architectural Papers via a Neural-Symbolic Graph Framework
Haoyue Yang,Xuanle Zhao,Yujie Liu,Zhuojun Zou,Kailin Lyu,Changchun Zhou,Yao Zhu,Jie Hao
Main category: cs.CL
TL;DR: ArchCraft框架通过将学术论文中的抽象架构描述转化为可综合的Verilog项目,解决了硬件设计复现的挑战,并结合验证和性能评估。
Details
Motivation: 硬件架构复现因缺乏公开源代码和HDL复杂性而困难,需要自动化工具加速设计和验证过程。Contribution: 1. 提出ArchCraft框架,通过符号化图和结构化工作流实现论文到可验证硬件的转换;2. 发布首个硬件合成基准ArchSynthBench。
Method: 使用形式化图捕获架构蓝图,符号定义功能规范,生成解耦的RTL和测试平台代码,并验证PPA指标。
Result: 实验表明ArchCraft在代码完成度和理解能力上优于VerilogCoder,生成代码符合时序和性能要求。
Insight: 神经符号结合的方法能有效处理硬件设计复杂性,基准集为解决类似问题提供了重要资源。
Abstract: The reproduction of hardware architectures from academic papers remains a significant challenge due to the lack of publicly available source code and the complexity of hardware description languages (HDLs). To this end, we propose \textbf{ArchCraft}, a Framework that converts abstract architectural descriptions from academic papers into synthesizable Verilog projects with register-transfer level (RTL) verification. ArchCraft introduces a structured workflow, which uses formal graphs to capture the Architectural Blueprint and symbols to define the Functional Specification, translating unstructured academic papers into verifiable, hardware-aware designs. The framework then generates RTL and testbench (TB) code decoupled via these symbols to facilitate verification and debugging, ultimately reporting the circuit’s Power, Area, and Performance (PPA). Moreover, we propose the first benchmark, \textbf{ArchSynthBench}, for synthesizing hardware from architectural descriptions, with a complete set of evaluation indicators, 50 project-level circuits, and around 600 circuit blocks. We systematically assess ArchCraft on ArchSynthBench, where the experiment results demonstrate the superiority of our proposed method, surpassing direct generation methods and the VerilogCoder framework in both paper understanding and code completion. Furthermore, evaluation and physical implementation of the generated executable RTL code show that these implementations meet all timing constraints without violations, and their performance metrics are consistent with those reported in the original papers.
[17] Evaluation of retrieval-based QA on QUEST-LOFT
Nathan Scales,Nathanael Schärli,Olivier Bousquet
Main category: cs.CL
TL;DR: 该论文分析了检索增强生成(RAG)在处理QUEST-LOFT基准上的分布式信息和复杂推理问题时的局限性,并通过结构化输出格式和答案验证优化RAG,显著提升性能。
Details
Motivation: 当前RAG方法在处理需要跨多个文档检索或结合复杂推理的问题时表现不佳,LOFT研究进一步揭示长上下文语言模型也存在类似局限性。论文旨在深入分析这一问题并提出改进方案。Contribution: 1. 深入分析QUEST-LOFT表现不佳的原因;2. 发布基于人工评估的更新数据;3. 展示了如何通过结构化输出和答案验证优化RAG,显著超越长上下文方法。
Method: 采用结构化输出格式(包含推理和证据链),并结合答案重新验证机制优化RAG性能。
Result: 优化后的RAG方法在QUEST-LOFT上显著优于长上下文语言模型。
Insight: 结构化输出和验证机制是提升RAG在复杂检索和推理任务中表现的关键。
Abstract: Despite the popularity of retrieval-augmented generation (RAG) as a solution for grounded QA in both academia and industry, current RAG methods struggle with questions where the necessary information is distributed across many documents or where retrieval needs to be combined with complex reasoning. Recently, the LOFT study has shown that this limitation also applies to approaches based on long-context language models, with the QUEST benchmark exhibiting particularly large headroom. In this paper, we provide an in-depth analysis of the factors contributing to the poor performance on QUEST-LOFT, publish updated numbers based on a thorough human evaluation, and demonstrate that RAG can be optimized to significantly outperform long-context approaches when combined with a structured output format containing reasoning and evidence, optionally followed by answer re-verification.
[18] Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
Akshar Tumu,Varad Shinde,Parisa Kordjamshidi
Main category: cs.CL
TL;DR: 该论文提出使用指代表达理解任务作为评估视觉语言模型(VLMs)空间推理能力的平台,分析了模型在对象检测模糊、复杂空间表达及否定表达中的表现。
Details
Motivation: 当前视觉语言模型在空间推理方面存在困难,而现有的分析方法(如图像描述和视觉问答)未能充分评估其空间理解能力。作者希望通过指代表达理解任务更深入地分析模型的空间基础能力。Contribution: 1)引入指代表达理解任务作为评估VLMs空间推理能力的新平台;2)分析了模型在模糊对象检测、复杂空间表达和否定表达中的表现;3)揭示了不同模型在空间语义类别(拓扑、方向、邻近等)上的差异。
Method: 使用任务特定架构和大型VLMs,评估其在指代表达理解任务中的表现,重点关注对象检测模糊、复杂空间关系和否定表达的情况。
Result: 所有模型在处理指代表达任务时均面临挑战,具体表现因模型和空间语义类别而异,揭示了模型在空间推理方面的局限性。
Insight: 研究表明,VLMs在空间推理和语言基础方面仍有提升空间,尤其是在处理复杂空间关系和否定表达时。未来的研究应关注改进模型的空间理解能力。
Abstract: Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation (‘not’). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.
[19] BookAsSumQA: An Evaluation Framework for Aspect-Based Book Summarization via Question Answering
Ryuhei Miyazato,Ting-Ruen Wei,Xuyang Wu,Hsin-Tai Wu,Kei Harada
Main category: cs.CL
TL;DR: 本文提出了BookAsSumQA,一个基于问答的评估框架,用于评估基于方面的书籍摘要生成质量。通过实验发现,LLM方法在短文本上表现更好,而RAG方法在长文本中更有效。
Details
Motivation: 基于方面的摘要生成在书籍领域的应用尚未被探索,主要因为构建长文本的参考摘要较为困难。Contribution: 提出了BookAsSumQA框架,利用问答对自动评估基于方面的书籍摘要质量。
Method: 通过叙事知识图谱自动生成特定方面的问答对,并根据摘要的问答表现评估其质量。
Result: 实验表明,LLM方法在短文本上准确性更高,而RAG方法在长文本中更有效和实用。
Insight: RAG方法在处理长文本摘要时更具优势,为书籍摘要的实际应用提供了新方向。
Abstract: Aspect-based summarization aims to generate summaries that highlight specific aspects of a text, enabling more personalized and targeted summaries. However, its application to books remains unexplored due to the difficulty of constructing reference summaries for long text. To address this challenge, we propose BookAsSumQA, a QA-based evaluation framework for aspect-based book summarization. BookAsSumQA automatically generates aspect-specific QA pairs from a narrative knowledge graph to evaluate summary quality based on its question-answering performance. Our experiments using BookAsSumQA revealed that while LLM-based approaches showed higher accuracy on shorter texts, RAG-based methods become more effective as document length increases, making them more efficient and practical for aspect-based book summarization.
[20] Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning
Sangmook Lee,Dohyung Kim,Hyukhun Koh,Nakyeong Yang,Kyomin Jung
Main category: cs.CL
TL;DR: STEER提出了一种基于置信度引导的逐步模型路由框架,用于在推理任务中降低大语言模型(LLM)的计算成本,同时保持或提升效果。
Details
Motivation: 大语言模型的推理能力虽然强大,但计算成本高。现有的路由模型在领域迁移时缺乏鲁棒性,且需要昂贵的标注数据。STEER旨在不依赖外部路由模型的情况下,实现高效推理。Contribution: 提出了STEER框架,利用小模型的置信度信号,动态决定是否需要大模型的帮助,从而减少计算开销。结果表明,STEER在多个领域任务上表现优异。
Method: 通过小模型的置信度分数(基于logits)进行逐步路由决策,仅在大模型必要时调用。这种方法避免了外部路由模型的训练和标注成本。
Result: 在数学推理、多跳问答和规划任务上,STEER显著降低了FLOPs(最高减少48%),同时提升了准确性(最高+20%)。
Insight: 模型内部的置信度信号是一个鲁棒且领域无关的路由指标,为LLM的高效部署提供了可扩展的解决方案。
Abstract: Recent advances in Large Language Models (LLMs) - particularly model scaling and test-time techniques - have greatly enhanced the reasoning capabilities of language models at the expense of higher inference costs. To lower inference costs, prior works train router models or deferral mechanisms that allocate easy queries to a small, efficient model, while forwarding harder queries to larger, more expensive models. However, these trained router models often lack robustness under domain shifts and require expensive data synthesis techniques such as Monte Carlo rollouts to obtain sufficient ground-truth routing labels for training. In this work, we propose Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning (STEER), a domain-agnostic framework that performs fine-grained, step-level routing between smaller and larger LLMs without utilizing external models. STEER leverages confidence scores from the smaller model’s logits prior to generating a reasoning step, so that the large model is invoked only when necessary. Extensive evaluations using different LLMs on a diverse set of challenging benchmarks across multiple domains such as Mathematical Reasoning, Multi-Hop QA, and Planning tasks indicate that STEER achieves competitive or enhanced accuracy while reducing inference costs (up to +20% accuracy with 48% less FLOPs compared to solely using the larger model on AIME), outperforming baselines that rely on trained external modules. Our results establish model-internal confidence as a robust, domain-agnostic signal for model routing, offering a scalable pathway for efficient LLM deployment.
[21] Explicit Knowledge-Guided In-Context Learning for Early Detection of Alzheimer’s Disease
Puzhen Su,Yongzhu Miao,Chunxi Guo,Jintao Tang,Shasha Li,Ting Wang
Main category: cs.CL
TL;DR: EK-ICL 是一种新颖的显式知识引导的上下文学习框架,通过整合结构化知识显著提升了阿尔茨海默病早期检测的性能,解决了现有方法在任务识别、演示选择和语义对齐等方面的问题。
Details
Motivation: 现有的大语言模型(LLMs)在阿尔茨海默病(AD)的早期检测任务中,尤其在分布外(OOD)和数据稀缺条件下表现不佳。上下文学习(ICL)虽有效但存在任务识别失败、演示选择不佳和标签语义不匹配等问题。Contribution: 提出了 EK-ICL 框架,通过引入显式知识(置信度分数、解析特征分数和标签词替换)提升了推理稳定性和任务对齐,显著优于现有方法。
Method: 结合了小语言模型(SLMs)的置信度分数、解析特征分数和标签词替换,采用了基于解析的检索策略和集成预测。
Result: 在三个 AD 数据集上的实验表明,EK-ICL 显著优于现有微调和 ICL 基线。
Insight: AD 检测任务中,ICL 的性能对标签语义和任务上下文的对齐高度敏感,显式知识在低资源临床推理中至关重要。
Abstract: Detecting Alzheimer’s Disease (AD) from narrative transcripts remains a challenging task for large language models (LLMs), particularly under out-of-distribution (OOD) and data-scarce conditions. While in-context learning (ICL) provides a parameter-efficient alternative to fine-tuning, existing ICL approaches often suffer from task recognition failure, suboptimal demonstration selection, and misalignment between label words and task objectives, issues that are amplified in clinical domains like AD detection. We propose Explicit Knowledge In-Context Learners (EK-ICL), a novel framework that integrates structured explicit knowledge to enhance reasoning stability and task alignment in ICL. EK-ICL incorporates three knowledge components: confidence scores derived from small language models (SLMs) to ground predictions in task-relevant patterns, parsing feature scores to capture structural differences and improve demo selection, and label word replacement to resolve semantic misalignment with LLM priors. In addition, EK-ICL employs a parsing-based retrieval strategy and ensemble prediction to mitigate the effects of semantic homogeneity in AD transcripts. Extensive experiments across three AD datasets demonstrate that EK-ICL significantly outperforms state-of-the-art fine-tuning and ICL baselines. Further analysis reveals that ICL performance in AD detection is highly sensitive to the alignment of label semantics and task-specific context, underscoring the importance of explicit knowledge in clinical reasoning under low-resource conditions.
[22] SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization
Yue Huang,Xiangqi Wang,Xiangliang Zhang
Main category: cs.CL
TL;DR: SPA提出了一种无监督的LLM对齐框架,通过自我优先级优化(Self-Priority Alignment),在保持安全性的前提下提升模型的帮助性。
Details
Motivation: 在高风险场景(如自残、法律或医疗咨询)中,LLM必须既可信又有帮助,但这两种目标常常冲突。作者提出优先级对齐(priority alignment),要求在满足可信阈值的基础上优化帮助性。Contribution: 1. 提出优先级对齐范式,强制LLM在满足可信度阈值的前提下优化帮助性。2. 设计了无监督的Self-Priority Alignment(SPA)框架,通过自我评估和双重准则去噪生成多样性回答并控制方差。
Method: 1. 生成多样性回答并进行自我评估。2. 使用双重准则去噪(dual-criterion denoising)去除不一致性并控制方差。3. 构建词典顺序偏好对(lexicographically ordered preference pairs)。4. 采用不确定性加权的对齐损失(uncertainty-weighted alignment loss)进行微调。
Result: 实验表明,SPA在多个基准测试中提高了帮助性且未损害安全性,优于现有基线并保留了模型的通用能力。
Insight: SPA的优先级对齐策略为关键LLM应用提供了一种可扩展且可解释的对齐方法,能够在高风险场景中平衡可信与帮助性。
Abstract: In high-stakes scenarios-such as self-harm, legal, or medical queries-LLMs must be both trustworthy and helpful. However, these goals often conflict. We propose priority alignment, a new alignment paradigm that enforces a strict “trustworthy-before-helpful” ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). To realize this, we introduce Self-Priority Alignment (SPA)-a fully unsupervised framework that generates diverse responses, self-evaluates them and refines them by the model itself, and applies dual-criterion denoising to remove inconsistency and control variance. From this, SPA constructs lexicographically ordered preference pairs and fine-tunes the model using an uncertainty-weighted alignment loss that emphasizes high-confidence, high-gap decisions. Experiments across multiple benchmarks show that SPA improves helpfulness without compromising safety, outperforming strong baselines while preserving general capabilities. Our results demonstrate that SPA provides a scalable and interpretable alignment strategy for critical LLM applications.
[23] TimeSense:Making Large Language Models Proficient in Time-Series Analysis
Zhirui Zhang,Changhua Pei,Tianyi Gao,Zhe Xie,Yibo Hao,Zhaoyang Yu,Longlong Xu,Tong Xiao,Jing Han,Dan Pei
Main category: cs.CL
TL;DR: 论文提出了TimeSense框架,旨在解决大语言模型(LLMs)在时间序列分析中过度依赖文本监督而忽视时间特征的问题。通过引入Temporal Sense模块和坐标位置嵌入,TimeSense提升了模型的时间敏感性和空间理解能力。
Details
Motivation: 现有方法在时间序列分析中过度依赖文本监督,导致模型偏向文本线索而忽视时间特征,可能生成与时间序列背景矛盾的结果。Contribution: 1. 构建了EvalTS基准测试,包含10个不同难度任务;2. 提出TimeSense框架,通过Temporal Sense模块和坐标位置嵌入平衡文本推理和时间特征。
Method: TimeSense引入Temporal Sense模块重构输入时间序列,同时使用坐标位置嵌入增强空间理解能力。
Result: TimeSense在多项任务中达到SOTA性能,尤其在复杂多维时间序列推理任务中表现突出。
Insight: 平衡文本推理和时间特征的动态性是提升LLMs在时间序列分析中性能的关键。
Abstract: In the time-series domain, an increasing number of works combine text with temporal data to leverage the reasoning capabilities of large language models (LLMs) for various downstream time-series understanding tasks. This enables a single model to flexibly perform tasks that previously required specialized models for each domain. However, these methods typically rely on text labels for supervision during training, biasing the model toward textual cues while potentially neglecting the full temporal features. Such a bias can lead to outputs that contradict the underlying time-series context. To address this issue, we construct the EvalTS benchmark, comprising 10 tasks across three difficulty levels, from fundamental temporal pattern recognition to complex real-world reasoning, to evaluate models under more challenging and realistic scenarios. We also propose TimeSense, a multimodal framework that makes LLMs proficient in time-series analysis by balancing textual reasoning with a preserved temporal sense. TimeSense incorporates a Temporal Sense module that reconstructs the input time-series within the model’s context, ensuring that textual reasoning is grounded in the time-series dynamics. Moreover, to enhance spatial understanding of time-series data, we explicitly incorporate coordinate-based positional embeddings, which provide each time point with spatial context and enable the model to capture structural dependencies more effectively. Experimental results demonstrate that TimeSense achieves state-of-the-art performance across multiple tasks, and it particularly outperforms existing methods on complex multi-dimensional time-series reasoning tasks.
[24] HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection
Irina Proskurina,Marc-Antoine Carpentier,Julien Velcin
Main category: cs.CL
TL;DR: 提出了HatePrototypes,一种基于语言模型的类级别向量表示方法,用于高效检测显性和隐性仇恨言论,并展示了其跨任务迁移能力和参数无关的早期退出策略的有效性。
Details
Motivation: 现有仇恨言论检测基准主要关注显性仇恨,忽略了隐性仇恨;同时,重复微调模型的效率较低。因此,需要一种能够捕捉隐性仇恨并实现高效迁移的方法。Contribution: 提出了HatePrototypes,通过类级别向量表示实现了显性和隐性仇恨言论的跨任务迁移,并提出了一种参数无关的早期退出策略。
Method: 从少量示例(每类50个)中提取类级别向量表示(HatePrototypes),用于跨任务迁移和早期退出检测。
Result: HatePrototypes在显性和隐性仇恨检测中表现良好,支持高效迁移和参数无关的早期退出。
Insight: 类级别向量表示能够捕捉隐性仇恨的深层语义,且只需少量数据即可实现高效迁移。
Abstract: Optimization of offensive content moderation models for different types of hateful messages is typically achieved through continued pre-training or fine-tuning on new hate speech benchmarks. However, existing benchmarks mainly address explicit hate toward protected groups and often overlook implicit or indirect hate, such as demeaning comparisons, calls for exclusion or violence, and subtle discriminatory language that still causes harm. While explicit hate can often be captured through surface features, implicit hate requires deeper, full-model semantic processing. In this work, we question the need for repeated fine-tuning and analyze the role of HatePrototypes, class-level vector representations derived from language models optimized for hate speech detection and safety moderation. We find that these prototypes, built from as few as 50 examples per class, enable cross-task transfer between explicit and implicit hate, with interchangeable prototypes across benchmarks. Moreover, we show that parameter-free early exiting with prototypes is effective for both hate types. We release the code, prototype resources, and evaluation scripts to support future research on efficient and transferable hate speech detection.
[25] How Well Do LLMs Understand Drug Mechanisms? A Knowledge + Reasoning Evaluation Dataset
Sunil Mohan,Theofanis Karaletsos
Main category: cs.CL
TL;DR: 这篇论文提出了一个评估大型语言模型(LLMs)对药物机制的理解能力的数据集,包括事实知识和推理能力,展示了不同模型在新情境下的表现差异。
Details
Motivation: 在药物开发和个性化医学中,LLMs需要具备对药物机制的深刻理解和推理能力,以应对新情境。文章旨在评估LLMs在这两方面的表现。Contribution: 1. 提出了一个评估数据集,涵盖事实知识和推理能力;2. 展示了LLMs在开放世界和封闭世界设置下的表现差异;3. 发现内部链路的反事实推理更具挑战性。
Method: 使用包含已知药物机制和反事实推理任务的数据集,评估LLMs的性能,对比不同模型的表现。
Result: o4-mini模型优于其他OpenAI模型,Qwen3-4B-thinking模型接近o4-mini,甚至在某些情况下更优。开放世界设置比封闭世界更具挑战性,内部链路反事实推理更难。
Insight: LLMs在新情境下的推理能力仍需提升,尤其是需要自主回忆相关知识的开放世界任务。
Abstract: Two scientific fields showing increasing interest in pre-trained large language models (LLMs) are drug development / repurposing, and personalized medicine. For both, LLMs have to demonstrate factual knowledge as well as a deep understanding of drug mechanisms, so they can recall and reason about relevant knowledge in novel situations. Drug mechanisms of action are described as a series of interactions between biomedical entities, which interlink into one or more chains directed from the drug to the targeted disease. Composing the effects of the interactions in a candidate chain leads to an inference about whether the drug might be useful or not for that disease. We introduce a dataset that evaluates LLMs on both factual knowledge of known mechanisms, and their ability to reason about them under novel situations, presented as counterfactuals that the models are unlikely to have seen during training. Using this dataset, we show that o4-mini outperforms the 4o, o3, and o3-mini models from OpenAI, and the recent small Qwen3-4B-thinking model closely matches o4-mini’s performance, even outperforming it in some cases. We demonstrate that the open world setting for reasoning tasks, which requires the model to recall relevant knowledge, is more challenging than the closed world setting where the needed factual knowledge is provided. We also show that counterfactuals affecting internal links in the reasoning chain present a much harder task than those affecting a link from the drug mentioned in the prompt.
[26] Dutch Metaphor Extraction from Cancer Patients’ Interviews and Forum Data using LLMs and Human in the Loop
Lifeng Han,David Lindevelt,Sander Puts,Erik van Mulligen,Suzan Verberne
Main category: cs.CL
TL;DR: 该论文提出了一种结合LLMs和人类验证的方法,从荷兰癌症患者的访谈和论坛数据中提取隐喻,以改进医患沟通和个性化护理。
Details
Motivation: 隐喻在医患沟通中具有重要作用,但荷兰语相关数据和工具匮乏,论文旨在填补这一空白,提升患者护理质量。Contribution: 1) 构建了荷兰语隐喻语料库HealthQuote.NL;2) 探索了LLMs的提示策略(如思维链、少样本学习);3) 提出了一种人类参与的验证机制。
Method: 使用LLMs(如GPT)结合多种提示策略(如思维链、少样本学习),从患者访谈和论坛数据中提取隐喻,并通过人类验证确保准确性。
Result: 生成了高质量的荷兰语隐喻语料库HealthQuote.NL,支持医患沟通改进和个性化护理设计。
Insight: LLMs在低资源语言任务(如荷兰语隐喻提取)中潜力巨大,但人类反馈对质量验证至关重要。
Abstract: Metaphors and metaphorical language (MLs) play an important role in healthcare communication between clinicians, patients, and patients’ family members. In this work, we focus on Dutch language data from cancer patients. We extract metaphors used by patients using two data sources: (1) cancer patient storytelling interview data and (2) online forum data, including patients’ posts, comments, and questions to professionals. We investigate how current state-of-the-art large language models (LLMs) perform on this task by exploring different prompting strategies such as chain of thought reasoning, few-shot learning, and self-prompting. With a human-in-the-loop setup, we verify the extracted metaphors and compile the outputs into a corpus named HealthQuote.NL. We believe the extracted metaphors can support better patient care, for example shared decision making, improved communication between patients and clinicians, and enhanced patient health literacy. They can also inform the design of personalized care pathways. We share prompts and related resources at https://github.com/aaronlifenghan/HealthQuote.NL
[27] Towards Resource-Efficient Multimodal Intelligence: Learned Routing among Specialized Expert Models
Mayank Saini,Arit Kumar Bishwas
Main category: cs.CL
TL;DR: 该论文提出了一种资源高效的多模态智能框架,通过学习的路由网络将查询智能分配到最适合的专家模型,平衡成本与性能,减少对昂贵模型的依赖。
Details
Motivation: 大型语言模型(LLMs)在处理多模态或复杂查询时成本高昂,而开源小模型虽成本低但性能不足。因此,需要一种既能高效利用资源又能保持高性能的解决方案。Contribution: 论文提出了一种模块化框架,通过学习的路由网络将查询分配到最合适的专家模型,显著降低了成本,同时保持了高准确性。
Method: 采用两阶段开源管道,结合高效的传统视觉组件,并使用路由网络平衡成本与性能。
Result: 在MMLU和VQA等基准测试中,性能与单一高价模型相当,但减少了对昂贵模型的依赖67%以上。
Insight: 通过模块化和智能路由,可以显著提升多模态AI的资源效率,同时保持高性能。
Abstract: As AI moves beyond text, large language models (LLMs) increasingly power vision, audio, and document understanding; however, their high inference costs hinder real-time, scalable deployment. Conversely, smaller open-source models offer cost advantages but struggle with complex or multimodal queries. We introduce a unified, modular framework that intelligently routes each query - textual, multimodal, or complex - to the most fitting expert model, using a learned routing network that balances cost and quality. For vision tasks, we employ a two-stage open-source pipeline optimized for efficiency and reviving efficient classical vision components where they remain SOTA for sub-tasks. On benchmarks such as Massive Multitask Language Understanding (MMLU) and Visual Question Answering (VQA), we match or exceed the performance of always-premium LLM (monolithic systems with one model serving all query types) performance, yet reduce the reliance on costly models by over 67%. With its extensible, multi-agent orchestration, we deliver high-quality, resource-efficient AI at scale.
[28] SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention
Bohan Yu,Wei Huang,Kang Liu
Main category: cs.CL
TL;DR: SR-KI是一种新颖的方法,用于将实时和大规模结构化知识库(KB)集成到大型语言模型(LLM)中。它通过监督注意力机制高效压缩知识,并实现动态更新,显著提升了检索和下游任务的性能。
Details
Motivation: 传统检索增强生成方法依赖外部检索器和多阶段流程,难以实现高效、动态的知识集成和端到端推理。SR-KI旨在解决这一问题。Contribution: 1. 提出SR-KI方法,将KB编码为键值对并注入LLM的KV缓存;2. 采用两阶段训练范式,通过监督注意力机制显式引导模型关注相关KB条目;3. 在单一GPU上支持多达40K KB的集成,实现高效压缩和动态更新。
Method: 1. 使用预训练编码器将KB编码为键值对;2. 在LLM中定位专用检索层;3. 应用注意力损失监督模型关注相关KB条目;4. 实现端到端推理和动态知识更新。
Result: 实验显示,SR-KI在7B参数LLM上集成40K KB,Recall@10超过98%(最佳任务)和88%(平均)。任务性能(如问答和KB ID生成)强,KB压缩率达99.75%。
Insight: SR-KI通过监督注意力机制高效压缩知识,实现了端到端推理和动态更新,为大规模知识集成提供了新思路。
Abstract: This paper proposes SR-KI, a novel approach for integrating real-time and large-scale structured knowledge bases (KBs) into large language models (LLMs). SR-KI begins by encoding KBs into key-value pairs using a pretrained encoder, and injects them into LLMs’ KV cache. Building on this representation, we employ a two-stage training paradigm: first locating a dedicated retrieval layer within the LLM, and then applying an attention-based loss at this layer to explicitly supervise attention toward relevant KB entries. Unlike traditional retrieval-augmented generation methods that rely heavily on the performance of external retrievers and multi-stage pipelines, SR-KI supports end-to-end inference by performing retrieval entirely within the models latent space. This design enables efficient compression of injected knowledge and facilitates dynamic knowledge updates. Comprehensive experiments demonstrate that SR-KI enables the integration of up to 40K KBs into a 7B LLM on a single A100 40GB GPU, and achieves strong retrieval performance, maintaining over 98% Recall@10 on the best-performing task and exceeding 88% on average across all tasks. Task performance on question answering and KB ID generation also demonstrates that SR-KI maintains strong performance while achieving up to 99.75% compression of the injected KBs.
[29] Rethinking what Matters: Effective and Robust Multilingual Realignment for Low-Resource Languages
Quang Phuoc Nguyen,David Anugraha,Felix Gaschi,Jun Bin Cheng,En-Shiun Annie Lee
Main category: cs.CL
TL;DR: 论文探讨了多语言模型中词汇对齐的低资源语言(LRLs)效果,提出选择性对齐策略比全语言覆盖更高效且鲁棒。
Details
Motivation: 多语言模型的对齐效果在高资源语言中表现良好,但在低资源或类型学距离较大的语言中效果不稳定,且依赖高质量平行数据,难以获取。Contribution: 研究发现,选择性语言子集对齐不仅能媲美全语言对齐,还能提升未见过的低资源语言的跨语言迁移效果,减少数据收集开销。
Method: 通过控制实验,对比全语言对齐与选择性语言子集对齐的效果,分析语言多样性对低资源语言的影响。
Result: 实验表明,选择性对齐对低资源语言尤为有效,且能通过语言多样性提升未见语言的性能,同时保持高效和鲁棒性。
Insight: 词汇对齐并非需要覆盖所有语言,通过合理的语言选择策略,可以显著提升低资源语言的效果,同时降低数据需求。
Abstract: Realignment is a promising strategy to improve cross-lingual transfer in multilingual language models. However, empirical results are mixed and often unreliable, particularly for typologically distant or low-resource languages (LRLs) compared to English. Moreover, word realignment tools often rely on high-quality parallel data, which can be scarce or noisy for many LRLs. In this work, we conduct an extensive empirical study to investigate whether realignment truly benefits from using all available languages, or if strategically selected subsets can offer comparable or even improved cross-lingual transfer, and study the impact on LRLs. Our controlled experiments show that realignment can be particularly effective for LRLs and that using carefully selected, linguistically diverse subsets can match full multilingual alignment, and even outperform it for unseen LRLs. This indicates that effective realignment does not require exhaustive language coverage and can reduce data collection overhead, while remaining both efficient and robust when guided by informed language selection.
[30] MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making
Zhi Rui Tam,Yun-Nung Chen
Main category: cs.CL
TL;DR: 研究发现,基于音频的大型语言模型(LLM)在临床决策中易受患者声音特征(如年龄、性别和情绪)的影响,导致推荐结果的显著差异,而纯文本输入则无此问题。
Details
Motivation: 随着大型语言模型从文本交互转向音频交互,其在临床环境中可能通过副语言线索引入新的偏见和漏洞,进而影响医疗决策的公平性。Contribution: 论文揭示了音频LLM在临床决策中存在严重的模态偏见现象,表现为音频输入与文本输入之间的显著推荐差异,以及对患者年龄和性别的不公平对待。
Method: 研究通过170个临床案例,将这些案例合成36种不同声音特征(包括年龄、性别和情绪)的音频,评估音频LLM在临床决策中的表现,并与纯文本输入进行对比。
Result: 结果显示,音频输入导致手术推荐差异高达35%,年龄偏见差距达12%。虽然通过显式推理可消除性别偏见,但情绪影响未被有效识别。
Insight: 音频LLM在临床应用中需特别注意偏见问题,开发偏见感知架构是当前迫切需求,以避免加剧医疗不平等。
Abstract: As large language models transition from text-based interfaces to audio interactions in clinical settings, they might introduce new vulnerabilities through paralinguistic cues in audio. We evaluated these models on 170 clinical cases, each synthesized into speech from 36 distinct voice profiles spanning variations in age, gender, and emotion. Our findings reveal a severe modality bias: surgical recommendations for audio inputs varied by as much as 35% compared to identical text-based inputs, with one model providing 80% fewer recommendations. Further analysis uncovered age disparities of up to 12% between young and elderly voices, which persisted in most models despite chain-of-thought prompting. While explicit reasoning successfully eliminated gender bias, the impact of emotion was not detected due to poor recognition performance. These results demonstrate that audio LLMs are susceptible to making clinical decisions based on a patient’s voice characteristics rather than medical evidence, a flaw that risks perpetuating healthcare disparities. We conclude that bias-aware architectures are essential and urgently needed before the clinical deployment of these models.
[31] Duality-based Mode Operations and Pyramid Multilayer Mapping for Rhetorical Modes
Zi-Niu Wu
Main category: cs.CL
TL;DR: 本文提出了一种基于对偶性(duality-based)的模式操作方法和金字塔多层映射框架,用于扩展修辞模式(rhetorical modes),并通过量化的方式衡量表达多样性和复杂性减少的效果。
Details
Motivation: 修辞模式在学术和非学术写作中都非常有用,但在计算建模和语言学研究中缺乏动态和可测量的系统。建立这些领域之间的概念桥梁,可以让它们相互受益。Contribution: 提出了四种基于对偶性的模式操作方法(split-unite, forward-backward, expansion-reduction和orthogonal dualities),扩展了修辞模式的集合;提出了一种金字塔多层映射框架,降低了认知复杂性。此外,引入了Marginal Rhetorical Bit(MRB)来衡量表达多样性的增长速度。
Method: 通过二项式组合学和香农熵分析量化表达多样性和复杂性减少的效果;提出了金字塔多层映射框架(从修辞模型层到认知层再到知识层)。
Result: 分层选择较小的模式子集显著减少了选择不确定性;MRB的引入使得修辞模式系统更加动态和可测量。
Insight: 本文的工作为未来AI系统提供了在语言符号基础上操作分层修辞推理结构的可能性,从而连接语言学、教育学、学术研究和计算研究。
Abstract: Rhetorical modes are useful in both academic and non-academic writing, and can be subjects to be studied within linguistic research and computational modeling. Establishing a conceptual bridge among these domains could enable each to benefit from the others. This paper proposes duality-based mode operations (split-unite, forward-backward, expansion-reduction and orthogonal dualities) to expand the set of rhetorical modes, introducing generated modes like combination and generalization, thereby enhancing epistemic diversity across multiple applications. It further presents a pyramid multilayer mapping framework (e.g., three layers from the rhetorical model layer, to cognitive layer, and to epistemic layers) that reduces the resulting cognitive complexity. The degrees of expressive diversity and complexity reduction are quantified through binomial combinatorics and Shannon entropy analysis. A Marginal Rhetorical Bit (MRB) is identified, permitting the definition of a rhetorical-scalable parameter that measures expressive growth speed in bits per stage. A direct entropy measure shows that hierarchical selection over smaller subsets markedly reduces choice uncertainty compared with flat selection across all modes. These considerations appear to transform static and non-measurable rhetorical taxonomies into more dynamic and more measurable systems for discourse design. From this work, it would be possible to identify a pathway for future AI systems to operate not only on language tokens but on layered rhetorical reasoning structures, bridging linguistic, pedagogical, academic, and computational research
[32] Sentiment Analysis On YouTube Comments Using Machine Learning Techniques Based On Video Games Content
Adi Danish Bin Muhammad Amin,Mohaiminul Islam Bhuiyan,Nur Shazwani Kamarudin,Zulfahmi Toh,Nur Syafiqah Nafis
Main category: cs.CL
TL;DR: 这篇论文利用机器学习技术(如SVM、朴素贝叶斯和逻辑回归)对YouTube游戏视频评论进行情感分析,发现SVM表现最佳,并揭示了用户对游戏的情感和偏好。
Details
Motivation: 游戏产业的快速发展及用户情感的复杂性,需要通过社交媒体(如YouTube)的分析来理解用户对游戏的评价和情感反馈。Contribution: 研究提出了一个基于YouTube评论的情感分析框架,展示了机器学习方法在游戏社区情感分析中的有效性,尤其是SVM的优越性能。
Method: 使用YouTube API收集游戏视频评论,通过TextBlob进行情感分析预处理,再用朴素贝叶斯、逻辑回归和SVM分类。
Result: SVM在不同数据集上表现最佳,分析结果揭示了用户对游戏的情感趋势和偏好。
Insight: 高级情感分析能捕捉用户复杂情感,为游戏开发者提供有价值的反馈,未来可结合更先进的NLP技术进一步完善。
Abstract: The rapid evolution of the gaming industry, driven by technological advancements and a burgeoning community, necessitates a deeper understanding of user sentiments, especially as expressed on popular social media platforms like YouTube. This study presents a sentiment analysis on video games based on YouTube comments, aiming to understand user sentiments within the gaming community. Utilizing YouTube API, comments related to various video games were collected and analyzed using the TextBlob sentiment analysis tool. The pre-processed data underwent classification using machine learning algorithms, including Naïve Bayes, Logistic Regression, and Support Vector Machine (SVM). Among these, SVM demonstrated superior performance, achieving the highest classification accuracy across different datasets. The analysis spanned multiple popular gaming videos, revealing trends and insights into user preferences and critiques. The findings underscore the importance of advanced sentiment analysis in capturing the nuanced emotions expressed in user comments, providing valuable feedback for game developers to enhance game design and user experience. Future research will focus on integrating more sophisticated natural language processing techniques and exploring additional data sources to further refine sentiment analysis in the gaming domain.
[33] Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights
Hyunjae Kim,Jiwoong Sohn,Aidan Gilson,Nicholas Cochran-Caggiano,Serina Applebaum,Heeju Jin,Seihee Park,Yujin Park,Jiyeong Park,Seoyoung Choi,Brittany Alexandra Herrera Contreras,Thomas Huang,Jaehoon Yun,Ethan F. Wei,Roy Jiang,Leah Colucci,Eric Lai,Amisha Dave,Tuo Guo,Maxwell B. Singer,Yonghoe Koo,Ron A. Adelman,James Zou,Andrew Taylor,Arman Cohan,Hua Xu,Qingyu Chen
Main category: cs.CL
TL;DR: 该论文通过大规模的专家评估系统地分析了医学领域中的检索增强生成(RAG)方法,发现标准RAG在证据检索、选择和生成阶段表现不佳,但通过简单策略(如证据过滤和查询重构)可显著改善性能。
Details
Motivation: 解决大型语言模型(LLMs)在医学领域的两大挑战:快速更新的医学知识缺乏和缺乏可验证的证据支持推理。Contribution: 1. 进行了迄今为止最全面的医学RAG专家评估;2. 揭示了标准RAG在医学领域的局限性;3. 提出了改进策略(证据过滤和查询重构)。
Method: 将RAG流程分解为三个阶段(证据检索、证据选择和生成),并通过专家标注(80,502条)评估800个模型输出。
Result: 标准RAG的表现不佳(仅22%的相关证据),但改进策略显著提升了性能(MedMCQA提升12%,MedXpertQA提升8.2%)。
Insight: RAG在医学领域的设计需分阶段评估且需针对性优化,简单的策略可能带来显著改进。
Abstract: Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG’s role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.
[34] Sensitivity of Small Language Models to Fine-tuning Data Contamination
Nicy Scaria,Silvester John Joseph Kennedy,Deepak Subramani
Main category: cs.CL
TL;DR: 该论文系统地研究了小型语言模型(SLMs)在指令微调过程中对数据污染的敏感性,揭示了其在句法和语义污染下的不对称脆弱性模式。
Details
Motivation: 小型语言模型在资源受限环境中广泛应用,但其在指令微调中对数据污染的鲁棒性尚不明确,亟需系统研究。Contribution: 1. 实证了SLMs对句法模式污染的极度脆弱性;2. 揭示了句法与语义污染之间的不对称敏感性模式;3. 提出了系统的污染鲁棒性评估协议。
Method: 研究测试了23个SLMs(参数从270M到4B),通过在指令微调中引入句法(字符/单词反转)和语义(无关/反事实响应)污染,并在不同污染比例下评估模型表现。
Result: 句法污染导致性能灾难性下降(尤其是字符反转),而语义污染表现阈值行为,核心语言能力更鲁棒;更大模型更易受语义污染(能力诅咒现象)。
Insight: 现有鲁棒性假设对小型模型未必适用,需开发污染感知的训练协议。
Abstract: Small Language Models (SLMs) are increasingly being deployed in resource-constrained environments, yet their behavioral robustness to data contamination during instruction tuning remains poorly understood. We systematically investigate the contamination sensitivity of 23 SLMs (270M to 4B parameters) across multiple model families by measuring susceptibility to syntactic and semantic transformation types during instruction tuning: syntactic transformations (character and word reversal) and semantic transformations (irrelevant and counterfactual responses), each applied at contamination levels of 25%, 50%, 75%, and 100%. Our results reveal fundamental asymmetries in vulnerability patterns: syntactic transformations cause catastrophic performance degradation, with character reversal producing near-complete failure across all models regardless of size or family, while semantic transformations demonstrate distinct threshold behaviors and greater resilience in core linguistic capabilities. Critically, we discover a ``\textit{capability curse}” where larger, more capable models become more susceptible to learning semantic corruptions, effectively following harmful instructions more readily, while our analysis of base versus instruction-tuned variants reveals that alignment provides inconsistent robustness benefits, sometimes even reducing resilience. Our work establishes three core contributions: (1) empirical evidence of SLMs’ disproportionate vulnerability to syntactic pattern contamination, (2) identification of asymmetric sensitivity patterns between syntactic and semantic transformations, and (3) systematic evaluation protocols for contamination robustness assessment. These findings have immediate deployment implications, suggesting that current robustness assumptions may not hold for smaller models and highlighting the need for contamination-aware training protocols.
[35] SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces
Ruiheng Liu,XiaoBing Chen,Jinyu Zhang,Qiongwen Zhang,Yu Zhang,Bailong Yang
Main category: cs.CL
TL;DR: 本文提出了一种名为SAFENLIDB的隐私保护框架,用于解决基于LLM的自然语言数据库接口中的安全和隐私问题。
Details
Motivation: 随着LLM在自然语言数据库接口中的广泛应用,隐私和安全问题日益突出,需要一种新方法来防止数据泄露并确保查询的可靠性。Contribution: 提出了SAFENLIDB框架,结合隐式安全推理和SQL生成,并通过自动生成混合Chain-of-Thought数据和优化方法,提升了安全性和实用性。
Method: 采用自动化管道生成混合Chain-of-Thought数据,引入推理预热和交替偏好优化,以解决DPO方法的多偏好振荡问题。
Result: 实验表明,该方法在安全性和实用性上均优于大规模LLM和理想基线模型,显著提升了安全性。
Insight: 通过结合隐式安全推理和自动化数据生成,可以在无需人工标注数据的情况下,显著提升LLM在数据库接口中的安全性。
Abstract: The rapid advancement of Large Language Models (LLMs) has driven significant progress in Natural Language Interface to Database (NLIDB). However, the widespread adoption of LLMs has raised critical privacy and security concerns. During interactions, LLMs may unintentionally expose confidential database contents or be manipulated by attackers to exfiltrate data through seemingly benign queries. While current efforts typically rely on rule-based heuristics or LLM agents to mitigate this leakage risk, these methods still struggle with complex inference-based attacks, suffer from high false positive rates, and often compromise the reliability of SQL queries. To address these challenges, we propose \textsc{SafeNlidb}, a novel privacy-security alignment framework for LLM-based NLIDB. The framework features an automated pipeline that generates hybrid chain-of-thought interaction data from scratch, seamlessly combining implicit security reasoning with SQL generation. Additionally, we introduce reasoning warm-up and alternating preference optimization to overcome the multi-preference oscillations of Direct Preference Optimization (DPO), enabling LLMs to produce security-aware SQL through fine-grained reasoning without the need for human-annotated preference data. Extensive experiments demonstrate that our method outperforms both larger-scale LLMs and ideal-setting baselines, achieving significant security improvements while preserving high utility.WARNING: This work may contain content that is offensive and harmful!
[36] EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers
Yilin Jiang,Mingzi Zhang,Xuanyu Yin,Sheng Jin,Suyu Lu,Zuocan Ying,Zengyi Yu,Xiangjie Kong
Main category: cs.CL
TL;DR: EduGuardBench是一个全面的基准测试,用于评估大型语言模型(SP-LLMs)作为模拟教师的专业真实性和对抗安全性,揭示模型性能的两极分化和教学转换效应的重要性。
Details
Motivation: 现有基准测试无法衡量角色扮演的真实性或解决教育场景中特有的教学危害问题,因此需要一个新的评估框架。Contribution: 提出了EduGuardBench,包含角色扮演真实性评分(RFS)和对抗性安全测试,揭示了模型性能的两极分化和教学转换效应。
Method: 通过RFS评估专业真实性,使用基于角色的对抗性提示测试安全性,并引入攻击成功率(ASR)和三层次拒绝质量评估。
Result: 实验发现模型性能两极分化,某些中型模型最易受攻击,最安全的模型能将有害请求转化为教学机会。
Insight: 教学转换效应表明,高级AI安全性不仅在于拒绝有害请求,更在于将其转化为教育机会,这对可信AI在教育中的应用至关重要。
Abstract: Large Language Models for Simulating Professions (SP-LLMs), particularly as teachers, are pivotal for personalized education. However, ensuring their professional competence and ethical safety is a critical challenge, as existing benchmarks fail to measure role-playing fidelity or address the unique teaching harms inherent in educational scenarios. To address this, we propose EduGuardBench, a dual-component benchmark. It assesses professional fidelity using a Role-playing Fidelity Score (RFS) while diagnosing harms specific to the teaching profession. It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct, evaluated with metrics including Attack Success Rate (ASR) and a three-tier Refusal Quality assessment. Our extensive experiments on 14 leading models reveal a stark polarization in performance. While reasoning-oriented models generally show superior fidelity, incompetence remains the dominant failure mode across most models. The adversarial tests uncovered a counterintuitive scaling paradox, where mid-sized models can be the most vulnerable, challenging monotonic safety assumptions. Critically, we identified a powerful Educational Transformation Effect: the safest models excel at converting harmful requests into teachable moments by providing ideal Educational Refusals. This capacity is strongly negatively correlated with ASR, revealing a new dimension of advanced AI safety. EduGuardBench thus provides a reproducible framework that moves beyond siloed knowledge tests toward a holistic assessment of professional, ethical, and pedagogical alignment, uncovering complex dynamics essential for deploying trustworthy AI in education. See https://github.com/YL1N/EduGuardBench for Materials.
[37] RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
Haofeng Wang,Yu Zhang
Main category: cs.CL
TL;DR: 本文提出了RPTS(推理过程树评分),一种基于树结构的指标,用于评估多模态模型的推理过程。通过组织推理步骤为树结构并动态调整权重,RPTS不仅能评估推理的整体正确性,还能定位推理失败的具体步骤。作者构建了新基准RPTS-Eval,验证了RPTS的有效性,并揭示了现有模型的局限性。
Details
Motivation: 现有多模态评测基准主要关注答案正确性,而忽略推理过程的质量,尤其是正确答案背后的错误推理或模态间关系的影响。Contribution: 1. 提出RPTS,一种树结构评分方法,动态评估推理过程;2. 构建RPTS-Eval基准,包含374张图像和390个推理实例;3. 定义了三种模态间关系,研究其对推理的影响。
Method: 将推理步骤组织为树结构,利用层次信息分配加权忠实度分数。动态调整权重以评估推理正确性并定位失败步骤。
Result: 评测了GPT4o、Llava-Next等模型,揭示了它们在多模态推理中的局限性,并比较了开源与闭源模型的差异。
Insight: RPTS能更全面地评估模型推理质量,模态间关系对推理有显著影响,现有模型在多模态推理中仍有改进空间。
Abstract: Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks. However, most of these benchmarks evaluate models primarily through multiple-choice or short-answer formats, which do not take the reasoning process into account. Although some benchmarks assess the reasoning process, their methods are often overly simplistic and only examine reasoning when answers are incorrect. This approach overlooks scenarios where flawed reasoning leads to correct answers. In addition, these benchmarks do not consider the impact of intermodal relationships on reasoning. To address this issue, we propose the Reasoning Process Tree Score (RPTS), a tree structure-based metric to assess reasoning processes. Specifically, we organize the reasoning steps into a reasoning tree and leverage its hierarchical information to assign weighted faithfulness scores to each reasoning step. By dynamically adjusting these weights, RPTS not only evaluates the overall correctness of the reasoning, but also pinpoints where the model fails in the reasoning. To validate RPTS in real-world multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374 images and 390 reasoning instances. Each instance includes reliable visual-textual clues that serve as leaf nodes of the reasoning tree. Furthermore, we define three types of intermodal relationships to investigate how intermodal interactions influence the reasoning process. We evaluated representative LVLMs (e.g., GPT4o, Llava-Next), uncovering their limitations in multimodal reasoning and highlighting the differences between open-source and closed-source commercial LVLMs. We believe that this benchmark will contribute to the advancement of research in the field of multimodal reasoning.
[38] HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection
Fangqi Dai,Xingjian Jiang,Zizhuang Deng
Main category: cs.CL
TL;DR: HLPD 提出了一种基于人类语言偏好的对齐方法(HLPO),通过优化评分模型的词汇分布,使其更倾向于人类写作风格,从而提升对机器修订文本的检测能力。在多任务对抗评估中,HLPD 显著优于基线方法。
Details
Motivation: 为了防止 LLMs 生成的可信内容带来的误导和社会问题,需要高效可靠的方法识别文本来源。现有方法在处理高级 LLM 输出或对抗性修订文本时表现不佳。Contribution: 主要贡献是提出了 HLPD 方法,通过 HLPO 将评分模型的对齐目标转移到人类写作风格上,从而增强检测能力。
Method: HLPD 采用基于奖励的对齐过程(HLPO),优化评分模型的词汇分布,使其更敏感于人类写作。实验基于多维提示生成器和多种高级 LLMs 的对抗评估框架。
Result: 在检测 GPT 系列模型修订的文本时,HLPD 的 AUROC 相对提升 15.11%,显著优于基线。在高级 LLMs 生成的文本上,AUROC 平均提升 5.53%。
Insight: 人类写作风格具备独特模式,基于此优化模型对齐目标,可显著提升对机器修订文本的检测能力。
Abstract: To prevent misinformation and social issues arising from trustworthy-looking content generated by LLMs, it is crucial to develop efficient and reliable methods for identifying the source of texts. Previous approaches have demonstrated exceptional performance in detecting texts fully generated by LLMs. However, these methods struggle when confronting more advanced LLM output or text with adversarial multi-task machine revision, especially in the black-box setting, where the generating model is unknown. To address this challenge, grounded in the hypothesis that human writing possesses distinctive stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD employs a reward-based alignment process, Human Language Preference Optimization (HLPO), to shift the scoring model’s token distribution toward human-like writing, making the model more sensitive to human writing, therefore enhancing the identification of machine-revised text. We test HLPD in an adversarial multi-task evaluation framework that leverages a five-dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios. When detecting texts revised by GPT-series models, HLPD achieves a 15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by 45.56%. When evaluated on texts generated by advanced LLMs, HLPD achieves the highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%. Code will be made available at https://github.com/dfq2021/HLPD.
[39] A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation
Siddharth Betala,Kushan Raj,Vipul Betala,Rohan Saswade
Main category: cs.CL
TL;DR: 本文提出了一种两阶段方法,通过自动化错误检测和修正及参数高效微调,解决多模态机器翻译任务中的训练数据质量问题。
Details
Motivation: 多模态机器翻译任务中,训练数据的质量问题影响模型性能,需要系统化的错误检测和修正方法。Contribution: 提出了一种基于视觉增强的法官-修正器流水线,结合多模态语言模型自动识别和修正翻译错误,并通过LoRA高效微调模型。
Method: 1. 法官组件分类翻译结果;2. 修正组件(GPT-4o-mini和IndicTrans2)处理不同类型错误;3. 使用LoRA微调模型。
Result: 修正后训练数据提升了多语言翻译任务的BLEU分数,如英语-孟加拉语提升了1.30分。
Insight: 视觉信息在多模态翻译中起到了关键作用,自动化修正流水线显著提升了数据质量和模型性能。
Abstract: In this paper, we describe our system under the team name BLEU Monday for the English-to-Indic Multimodal Translation Task at WAT 2025. We participate in the text-only translation tasks for English-Hindi, English-Bengali, English-Malayalam, and English-Odia language pairs. We present a two-stage approach that addresses quality issues in the training data through automated error detection and correction, followed by parameter-efficient model fine-tuning. Our methodology introduces a vision-augmented judge-corrector pipeline that leverages multimodal language models to systematically identify and correct translation errors in the training data. The judge component classifies translations into three categories: correct, visually ambiguous (requiring image context), or mistranslated (poor translation quality). Identified errors are routed to specialized correctors: GPT-4o-mini regenerates captions requiring visual disambiguation, while IndicTrans2 retranslates cases with pure translation quality issues. This automated pipeline processes 28,928 training examples across four languages, correcting an average of 17.1% of captions per language. We then apply Low-Rank Adaptation (LoRA) to fine-tune the IndicTrans2 en-indic 200M distilled model on both original and corrected datasets. Training on corrected data yields consistent improvements, with BLEU score gains of +1.30 for English-Bengali on the evaluation set (42.00 -> 43.30) and +0.70 on the challenge set (44.90 -> 45.60), +0.60 for English-Odia on the evaluation set (41.00 -> 41.60), and +0.10 for English-Hindi on the challenge set (53.90 -> 54.00).
[40] Multilingual Lexical Feature Analysis of Spoken Language for Predicting Major Depression Symptom Severity
Anastasiia Tokareva,Judith Dineley,Zoe Firth,Pauline Conde,Faith Matcham,Sara Siddi,Femke Lamers,Ewan Carr,Carolin Oetzmann,Daniel Leightley,Yuezhou Zhang,Amos A. Folarin,Josep Maria Haro,Brenda W. J. H. Penninx,Raquel Bailon,Srinivasan Vairavan,Til Wykes,Richard J. B. Dobson,Vaibhav A. Narayan,Matthew Hotopf,Nicholas Cummins,The RADAR-CNS Consortium
Main category: cs.CL
TL;DR: 该论文通过多语言口语数据的分析,探索了可解释的词汇特征与重度抑郁症(MDD)症状严重性的关联,并测试了机器学习模型的预测性能,发现结果接近随机水平,强调了未来研究中改进方法和更大样本的必要性。
Details
Motivation: 研究动机是利用口语数据为MDD症状严重性提供客观、定期的评估工具,填补临床研究中可解释性有限和高维机器学习模型的空白。Contribution: 主要贡献是在多语言口语数据中识别了与MDD症状相关的词汇特征,并测试了这些特征和高维向量嵌入的预测能力。
Method: 方法包括使用线性混合效应模型分析纵向口语数据和高维向量嵌入,以及测试四种回归机器学习模型的预测性能。
Result: 结果显示,英语数据中MDD症状与7个词汇特征相关,而荷兰语中仅观察到少数关联,西班牙语则无显著关联;预测性能接近随机水平。
Insight: 研究强调了未来需要改进协议和机器学习模型,以更好地捕捉个体内和个体间的语言变化。
Abstract: Background: Captured between clinical appointments using mobile devices, spoken language has potential for objective, more regular assessment of symptom severity and earlier detection of relapse in major depressive disorder. However, research to date has largely been in non-clinical cross-sectional samples of written language using complex machine learning (ML) approaches with limited interpretability. Methods: We describe an initial exploratory analysis of longitudinal speech data and PHQ-8 assessments from 5,836 recordings of 586 participants in the UK, Netherlands, and Spain, collected in the RADAR-MDD study. We sought to identify interpretable lexical features associated with MDD symptom severity with linear mixed-effects modelling. Interpretable features and high-dimensional vector embeddings were also used to test the prediction performance of four regressor ML models. Results: In English data, MDD symptom severity was associated with 7 features including lexical diversity measures and absolutist language. In Dutch, associations were observed with words per sentence and positive word frequency; no associations were observed in recordings collected in Spain. The predictive power of lexical features and vector embeddings was near chance level across all languages. Limitations: Smaller samples in non-English speech and methodological choices, such as the elicitation prompt, may have also limited the effect sizes observable. A lack of NLP tools in languages other than English restricted our feature choice. Conclusion: To understand the value of lexical markers in clinical research and practice, further research is needed in larger samples across several languages using improved protocols, and ML models that account for within- and between-individual variations in language.
[41] Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks
Yauhen Babakhin,Radek Osmulski,Ronay Ak,Gabriel Moreira,Mengyao Xu,Benedikt Schifferer,Bo Liu,Even Oldridge
Main category: cs.CL
TL;DR: Llama-Embed-Nemotron-8B 是一种多语言通用文本嵌入模型,在 MMTEB 基准测试中表现优异,提供了开源权重和详细的数据集与方法研究。
Details
Motivation: 现有模型的训练数据和方法的透明度不足,因此作者希望通过开源模型、权重和研究数据来填补这一空白。Contribution: 1) 开源模型及详细消融实验;2) 新颖的数据混合方法(公共数据集与合成数据);3) 支持用户指令的多功能嵌入模型。
Method: 使用了 16.1 百万查询-文档对(包含公共数据和合成数据),对比不同损失函数和合成数据生成策略,并进行模型融合研究。
Result: 在检索、分类和语义相似性任务中表现优异,尤其在低资源语言和跨语言场景中领先。
Insight: 开源数据和方法的透明度对模型性能的提升至关重要,合成数据的合理利用可以显著增强模型的泛化能力。
Abstract: We introduce llama-embed-nemotron-8b, an open-weights text embedding model that achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark (MMTEB) leaderboard as of October 21, 2025. While recent models show strong performance, their training data or methodologies are often not fully disclosed. We aim to address this by developing a fully open-source model, publicly releasing its weights and detailed ablation studies, and planning to share the curated training datasets. Our model demonstrates superior performance across all major embedding tasks – including retrieval, classification and semantic textual similarity (STS) – and excels in challenging multilingual scenarios, such as low-resource languages and cross-lingual setups. This state-of-the-art performance is driven by a novel data mix of 16.1 million query-document pairs, split between 7.7 million samples from public datasets and 8.4 million synthetically generated examples from various open-weight LLMs. One of our key contributions is a detailed ablation study analyzing core design choices, including a comparison of contrastive loss implementations, an evaluation of synthetic data generation (SDG) strategies, and the impact of model merging. The llama-embed-nemotron-8b is an instruction-aware model, supporting user-defined instructions to enhance performance for specific use-cases. This combination of top-tier performance, broad applicability, and user-driven flexibility enables it to serve as a universal text embedding solution.
[42] Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora
Khalil Hennara,Ahmad Bastati,Muhammad Hreden,Mohamed Motasim Hamed,Zeina Aldallal,Sara Chrouf,Safwan AlModhayan
Main category: cs.CL
TL;DR: 本文提出了一个名为Wasm的流水线,用于从Common Crawl数据集中构建结构化阿拉伯语多模态语料库。该数据集保留了网页内容的完整性,支持多模态预训练,填补了阿拉伯语高质量多模态数据的空白。
Details
Motivation: 大型语言模型(LLMs)和多模态模型(LMMs)的性能依赖于高质量、大规模预训练数据集。然而,阿拉伯语缺乏保留文档结构的高质量多模态数据集,限制了该领域的发展。Contribution: 提出了Wasm流水线,创建了一个独特的阿拉伯语多模态数据集,保留文档结构并提供Markdown输出,填补了现有语料库的空白。
Method: 通过处理Common Crawl数据集,保留网页内容的结构完整性,同时支持仅文本和多模态预训练场景。与其他数据集的处理流水线进行了比较分析。
Result: 发布了阿拉伯语多模态数据集和处理流水线,支持未来研究。
Insight: 保留文档结构的多模态数据集对模型性能至关重要,尤其是在处理阿拉伯语这类资源有限的语种时。
Abstract: The performance of large language models (LLMs) and large multimodal models (LMMs) depends heavily on the quality and scale of their pre-training datasets. Recent research shows that large multimodal models trained on natural documents where images and text are interleaved outperform those trained only on image-text pairs across a wide range of benchmarks, leveraging advanced pre- trained models to enforce semantic alignment, image-sequence consistency, and textual coherence. For Arabic, however, the lack of high-quality multimodal datasets that preserve document structure has limited progress. In this paper, we present our pipeline Wasm for processing the Common Crawl dataset to create a new Arabic multimodal dataset that uniquely provides markdown output. Unlike existing Arabic corpora that focus solely on text extraction, our approach preserves the structural integrity of web content while maintaining flexibility for both text-only and multimodal pre-training scenarios. We provide a comprehensive comparative analysis of our data processing pipeline against those used for major existing datasets, highlighting the convergences in filtering strategies and justifying our specific design choices. To support future research, we publicly release a representative dataset dump along with the multimodal processing pipeline for Arabic.
[43] Think Consistently, Reason Efficiently: Energy-Based Calibration for Implicit Chain-of-Thought
Zhikang Chen,Sen Cui,Deheng Ye,Yu Zhang,Yatao Bian,Tingting Zhu
Main category: cs.CL
TL;DR: 本文提出EBM-CoT框架,通过基于能量的模型校准隐式思维链,提升LLM多步推理的一致性和准确性。
Details
Motivation: 显式思维链(CoT)方法依赖离散的token级推理,容易传播错误且表达受限。隐式推理虽缓解了这些问题,但其缺乏一致性机制,导致推理路径不稳定。Contribution: 提出EBM-CoT框架,通过能量模型动态调整隐式推理路径,提升一致性和效率,且无需修改基础语言模型。
Method: 利用基于能量的模型(EBM)在潜在空间中校准推理轨迹,使其趋向低能量、高一致性的区域。
Result: 在数学、常识和符号推理基准测试中,EBM-CoT显著提升了推理的一致性和准确性。
Insight: 隐式推理需结合一致性校准机制,能量模型是一种有效工具。
Abstract: Large Language Models (LLMs) have demonstrated strong reasoning capabilities through \emph{Chain-of-Thought} (CoT) prompting, which enables step-by-step intermediate reasoning. However, explicit CoT methods rely on discrete token-level reasoning processes that are prone to error propagation and limited by vocabulary expressiveness, often resulting in rigid and inconsistent reasoning trajectories. Recent research has explored implicit or continuous reasoning in latent spaces, allowing models to perform internal reasoning before generating explicit output. Although such approaches alleviate some limitations of discrete CoT, they generally lack explicit mechanisms to enforce consistency among reasoning steps, leading to divergent reasoning paths and unstable outcomes. To address this issue, we propose EBM-CoT, an Energy-Based Chain-of-Thought Calibration framework that refines latent thought representations through an energy-based model (EBM). Our method dynamically adjusts latent reasoning trajectories toward lower-energy, high-consistency regions in the embedding space, improving both reasoning accuracy and consistency without modifying the base language model. Extensive experiments across mathematical, commonsense, and symbolic reasoning benchmarks demonstrate that the proposed framework significantly enhances the consistency and efficiency of multi-step reasoning in LLMs.
[44] LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging
Seungeon Lee,Soumi Das,Manish Gupta,Krishna P. Gummadi
Main category: cs.CL
TL;DR: LoRA on the Go (LoGo)是一个无需训练的动态LoRA选择和合并框架,能够在实例级别自适应地组合适配器,显著提升大语言模型在多任务和多域输入上的表现。
Details
Motivation: 传统LoRA适配器通常为单一任务训练,限制了其在多样化输入场景下的适用性。现有方法需要标注数据或额外训练,成本高昂。Contribution: 提出LoGo框架,无需额外训练或标注数据,通过单次前向传播动态选择和合并LoRA适配器,提升多任务性能。
Method: 利用LoRA适配器的前向传播信号,动态识别最相关适配器并实时确定其贡献。
Result: 在5个NLP基准、27个数据集和3个模型家族上,LoGo在部分任务上优于基线方法3.6%,同时保持推理效率。
Insight: LoGo展示了无训练动态适配器组合的潜力,为高效多任务推理提供了新思路。
Abstract: Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models.However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.
[45] TCM-Eval: An Expert-Level Dynamic and Extensible Benchmark for Traditional Chinese Medicine
Zihao Cheng,Yuheng Lu,Huaiqian Ye,Zeming Liu,Minqi Wang,Jingjing Liu,Zihan Li,Wei Fan,Yuanfang Guo,Ruiji Fu,Shifeng She,Gang Wang,Yunhong Wang
Main category: cs.CL
TL;DR: TCM-Eval 是首个动态可扩展的中医药基准,基于国家医学考试数据,结合专家验证,并开发了 ZMT 大模型,显著超越人类医师水平。
Details
Motivation: 当前大语言模型在现代医学表现优异,但在中医药领域受限于缺乏标准基准和高质量数据。Contribution: 提出了 TCM-Eval 基准、构建了大规模训练语料库,并提出 SI-CoTE 方法以丰富问答对,开发了 ZMT 模型。
Method: SI-CoTE(自迭代推理链增强)通过拒绝采样生成验证过的推理链,实现数据与模型的协同进化。
Result: ZMT 模型显著超过人类医师考试通过阈值。
Insight: 数据与模型的协同进化是提升模型在专业领域性能的关键。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in modern medicine, yet their application in Traditional Chinese Medicine (TCM) remains severely limited by the absence of standardized benchmarks and the scarcity of high-quality training data. To address these challenges, we introduce TCM-Eval, the first dynamic and extensible benchmark for TCM, meticulously curated from national medical licensing examinations and validated by TCM experts. Furthermore, we construct a large-scale training corpus and propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE) to autonomously enrich question-answer pairs with validated reasoning chains through rejection sampling, establishing a virtuous cycle of data and model co-evolution. Using this enriched training data, we develop ZhiMingTang (ZMT), a state-of-the-art LLM specifically designed for TCM, which significantly exceeds the passing threshold for human practitioners. To encourage future research and development, we release a public leaderboard, fostering community engagement and continuous improvement.
[46] Categorical Emotions or Appraisals - Which Emotion Model Explains Argument Convincingness Better?
Lynn Greschner,Meike Bauer,Sabine Weber,Roman Klinger
Main category: cs.CL
TL;DR: 论文比较了基于情感类别的理论和基于认知评估理论(Appraisal Theories)在预测论点说服力方面的效果,发现后者更适合分析论点中的情感。
Details
Motivation: 论点的说服力不仅取决于其结构和演讲者的信誉,还与其引发的情感相关。传统研究多关注情感的强度或类别,但情感的主观性(如接收者的目标和立场)常被忽视。Contribution: 首次系统比较了情感类别与认知评估理论在论点说服力预测中的效果,证明了认知评估理论的优势。
Method: 基于ContArgA语料库的标注数据,采用零样本提示实验,评估情感类别和认知评估(如论点的重要性和影响)对主观说服力预测的贡献。
Result: 虽然情感类别信息能提升预测效果,但认知评估的改进更为显著。
Insight: 情感的主观评估(如认知评估)在论点说服力分析中更具潜力,为计算论证领域的理论和实践提供了新方向。
Abstract: The convincingness of an argument does not only depend on its structure (logos), the person who makes the argument (ethos), but also on the emotion that it causes in the recipient (pathos). While the overall intensity and categorical values of emotions in arguments have received considerable attention in the research community, we argue that the emotion an argument evokes in a recipient is subjective. It depends on the recipient’s goals, standards, prior knowledge, and stance. Appraisal theories lend themselves as a link between the subjective cognitive assessment of events and emotions. They have been used in event-centric emotion analysis, but their suitability for assessing argument convincingness remains unexplored. In this paper, we evaluate whether appraisal theories are suitable for emotion analysis in arguments by considering subjective cognitive evaluations of the importance and impact of an argument on its receiver. Based on the annotations in the recently published ContArgA corpus, we perform zero-shot prompting experiments to evaluate the importance of gold-annotated and predicted emotions and appraisals for the assessment of the subjective convincingness labels. We find that, while categorical emotion information does improve convincingness prediction, the improvement is more pronounced with appraisals. This work presents the first systematic comparison between emotion models for convincingness prediction, demonstrating the advantage of appraisals, providing insights for theoretical and practical applications in computational argumentation.
[47] AdaRec: Adaptive Recommendation with LLMs via Narrative Profiling and Dual-Channel Reasoning
Meiyun Wang,Charin Polpanumas
Main category: cs.CL
TL;DR: AdaRec是一个基于大型语言模型(LLM)的自适应推荐框架,通过叙事分析和双通道推理实现个性化推荐,无需人工特征工程,在少样本和零样本场景中表现优异。
Details
Motivation: 传统推荐系统依赖人工特征工程,难以适应多样任务和稀疏数据场景,AdaRec旨在利用LLM的语义表达能力,实现自适应推荐。Contribution: 1. 提出叙事分析(narrative profiling)将用户-物品交互转化为自然语言表示;2. 设计双通道推理架构(horizontal behavioral alignment和vertical causal attribution);3. 在少样本和零样本场景中显著优于基线方法。
Method: 1. 叙事分析生成自然语言表示;2. 双通道推理结合横向行为对齐和纵向因果归因;3. 支持快速跨任务适应和轻量微调。
Result: AdaRec在电商数据集中比机器学习和LLM基线方法提升8%(少样本)和19%(零样本),且轻量微调表现媲美全微调模型。
Insight: 语义表示的引入使推荐系统更自适应和可读,双通道推理挖掘用户行为的多维模式,为稀疏数据推荐提供了新思路。
Abstract: We propose AdaRec, a few-shot in-context learning framework that leverages large language models for an adaptive personalized recommendation. AdaRec introduces narrative profiling, transforming user-item interactions into natural language representations to enable unified task handling and enhance human readability. Centered on a bivariate reasoning paradigm, AdaRec employs a dual-channel architecture that integrates horizontal behavioral alignment, discovering peer-driven patterns, with vertical causal attribution, highlighting decisive factors behind user preferences. Unlike existing LLM-based approaches, AdaRec eliminates manual feature engineering through semantic representations and supports rapid cross-task adaptation with minimal supervision. Experiments on real ecommerce datasets demonstrate that AdaRec outperforms both machine learning models and LLM-based baselines by up to eight percent in few-shot settings. In zero-shot scenarios, it achieves up to a nineteen percent improvement over expert-crafted profiling, showing effectiveness for long-tail personalization with minimal interaction data. Furthermore, lightweight fine-tuning on synthetic data generated by AdaRec matches the performance of fully fine-tuned models, highlighting its efficiency and generalization across diverse tasks.
[48] EMODIS: A Benchmark for Context-Dependent Emoji Disambiguation in Large Language Models
Jiacheng Huang,Ning Yu,Xiaoyin Yi
Main category: cs.CL
TL;DR: EMODIS是一个新基准,用于评估大语言模型(LLMs)在最小但对立的文本上下文中解析模糊表情符号的能力。实验表明,即使最强大的模型在面对细微上下文线索时也经常失败,揭示了系统性偏差和对语用对比的敏感性不足。
Details
Motivation: LLMs在实际通信环境中部署越来越多,但其解决上下文依赖歧义的能力尚未充分研究。作者希望通过EMODIS基准填补这一空白,强调LLMs在语义推理上与人类之间的差距。Contribution: 1. 提出EMODIS基准,专注于评估LLMs在模糊表情符号上下文解析中的表现;2. 揭示现有模型在处理微妙上下文线索时的系统性偏差和局限性。
Method: EMODIS包含模糊句子、两个对立上下文和一个需上下文推理的问题。通过对比开源和API-based LLMs的表现,分析其语义推理能力。
Result: 实验显示,最强模型在面对细微上下文线索时仍频繁失败,表现出对主导解释的偏差和对语用对比的低敏感性。
Insight: LLMs在上下文依赖的语义推理上与人类存在显著差距,尤其是在处理微妙线索和避免系统性偏差方面需进一步优化。
Abstract: Large language models (LLMs) are increasingly deployed in real-world communication settings, yet their ability to resolve context-dependent ambiguity remains underexplored. In this work, we present EMODIS, a new benchmark for evaluating LLMs’ capacity to interpret ambiguous emoji expressions under minimal but contrastive textual contexts. Each instance in EMODIS comprises an ambiguous sentence containing an emoji, two distinct disambiguating contexts that lead to divergent interpretations, and a specific question that requires contextual reasoning. We evaluate both open-source and API-based LLMs, and find that even the strongest models frequently fail to distinguish meanings when only subtle contextual cues are present. Further analysis reveals systematic biases toward dominant interpretations and limited sensitivity to pragmatic contrast. EMODIS provides a rigorous testbed for assessing contextual disambiguation, and highlights the gap in semantic reasoning between humans and LLMs.
[49] Who Is the Story About? Protagonist Entity Recognition in News
Jorge Gabín,M. Eduardo Ares,Javier Parapar
Main category: cs.CL
TL;DR: 该论文提出了‘主角实体识别’(PER)任务,旨在识别新闻故事中推动情节的核心组织,并通过专家标注与大型语言模型(LLM)的预测对比验证其可行性。
Details
Motivation: 传统命名实体识别(NER)未能区分新闻中哪些组织是故事的核心驱动者,限制了依赖事件显著性、影响力或叙事焦点的下游任务。Contribution: 1. 引入PER任务;2. 标注黄金语料并验证人类与LLM的一致性;3. 利用LLM自动标注大规模新闻数据。
Method: 1. 通过专家标注验证PER;2. 使用NER引导提示的LLM自动标注;3. 评估LLM在有限上下文下推断主角的能力。
Result: PER是一种可行且有意义的任务扩展,引导的LLM可以大规模近似人类对叙事重要性的判断。
Insight: 新闻叙事中的核心组织识别有助于提升下游任务表现,LLM可通过适当引导完成高质量标注。
Abstract: News articles often reference numerous organizations, but traditional Named Entity Recognition (NER) treats all mentions equally, obscuring which entities genuinely drive the narrative. This limits downstream tasks that rely on understanding event salience, influence, or narrative focus. We introduce Protagonist Entity Recognition (PER), a task that identifies the organizations that anchor a news story and shape its main developments. To validate PER, we compare he predictions of Large Language Models (LLMs) against annotations from four expert annotators over a gold corpus, establishing both inter-annotator consistency and human-LLM agreement. Leveraging these findings, we use state-of-the-art LLMs to automatically label large-scale news collections through NER-guided prompting, generating scalable, high-quality supervision. We then evaluate whether other LLMs, given reduced context and without explicit candidate guidance, can still infer the correct protagonists. Our results demonstrate that PER is a feasible and meaningful extension to narrative-centered information extraction, and that guided LLMs can approximate human judgments of narrative importance at scale.
[50] RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
Zhiyuan Zeng,Hamish Ivison,Yiping Wang,Lifan Yuan,Shuyue Stella Li,Zhuorui Ye,Siting Li,Jacqueline He,Runlong Zhou,Tong Chen,Chenyang Zhao,Yulia Tsvetkov,Simon Shaolei Du,Natasha Jaques,Hao Peng,Pang Wei Koh,Hannaneh Hajishirzi
Main category: cs.CL
TL;DR: RLVE是一个通过自适应可验证环境扩展语言模型强化学习的方法,动态调整问题难度分布以提高学习效率。RLVE-Gym是一个包含400个可验证环境的大规模套件,实验表明其能显著提升模型的推理能力。
Details
Motivation: 传统强化学习中,静态问题分布可能导致学习信号消失(问题过易或过难)。RLVE通过自适应可验证环境解决了这一问题。Contribution: 提出RLVE方法及RLVE-Gym套件,动态调整环境难度,显著提升了语言模型的推理能力。
Method: 使用RLVE-Gym中的400个可验证环境,动态适应策略模型的能力,通过联合训练优化模型。
Result: RLVE在6个推理基准上平均提升3.37%,远优于传统方法(仅提升0.49%)。
Insight: 环境扩展和动态难度调整是提升强化学习效率的关键。
Abstract: We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model’s capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM’s original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.
[51] FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation
Song Jin,Shuqi Li,Shukun Zhang,Rui Yan
Main category: cs.CL
TL;DR: 这篇论文首次提出了股权研究报告(ERR)生成任务,并提供了开源评估基准FinRpt,包括数据集构建流程和11个评估指标。作者还提出了一个多智能体框架FinRpt-Gen,通过监督微调和强化学习训练基于LLM的智能体,实验证明了其有效性。
Details
Motivation: 尽管LLMs在股票预测和问答等金融任务中表现出色,但在自动化股权研究报告生成方面的应用尚未探索。数据稀缺和评估指标缺失也阻碍了这一领域的发展。Contribution: 1. 首次提出ERR生成任务;2. 构建开源评估基准FinRpt,包括高质量数据集和11个评估指标;3. 提出多智能体框架FinRpt-Gen。
Method: 1. 设计数据集构建流程,整合7种金融数据类型;2. 提出11个评估指标;3. 使用监督微调和强化学习训练基于LLM的多智能体框架FinRpt-Gen。
Result: 实验证明了FinRpt数据集的质量和评估指标的有效性,以及FinRpt-Gen在ERR生成中的优越性能。
Insight: 通过多智能体框架和全面的评估系统,FinRpt-Gen展示了ERR生成任务的潜力,为该领域的研究提供了新方向。
Abstract: While LLMs have shown great success in financial tasks like stock prediction and question answering, their application in fully automating Equity Research Report generation remains uncharted territory. In this paper, we formulate the Equity Research Report (ERR) Generation task for the first time. To address the data scarcity and the evaluation metrics absence, we present an open-source evaluation benchmark for ERR generation - FinRpt. We frame a Dataset Construction Pipeline that integrates 7 financial data types and produces a high-quality ERR dataset automatically, which could be used for model training and evaluation. We also introduce a comprehensive evaluation system including 11 metrics to assess the generated ERRs. Moreover, we propose a multi-agent framework specifically tailored to address this task, named FinRpt-Gen, and train several LLM-based agents on the proposed datasets using Supervised Fine-Tuning and Reinforcement Learning. Experimental results indicate the data quality and metrics effectiveness of the benchmark FinRpt and the strong performance of FinRpt-Gen, showcasing their potential to drive innovation in the ERR generation field. All code and datasets are publicly available.
[52] Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains
Pingjie Wang,Hongcheng Liu,Yusheng Liao,Ziqing Fan,Yaxin Du,Shuo Tang,Yanfeng Wang,Yu Wang
Main category: cs.CL
TL;DR: 论文提出了一种基于神经切线核(NTK)的框架NTK-Selector,用于高效选择通用领域辅助数据以提升低资源领域性能,解决了LLMs直接应用NTK的理论和计算问题,并在多个低资源领域实验中显著提升了模型表现。
Details
Motivation: 大型语言模型在低资源领域的应用面临数据稀缺和过拟合风险,研究发现通用领域数据可作为辅助监督,但如何高效选择最有价值的辅助数据是关键问题。Contribution: 提出了NTK-Selector框架,通过NTK选择辅助数据,解决了LLMs中NTK应用的理论和计算挑战,并在实验中显著提升了性能。
Method: 基于NTK的方法,包括验证LLMs在LoRA微调中的NTK相似行为,并提出一种无需雅可比矩阵的近似方法。
Result: 在四个低资源领域(医疗、金融、法律、心理)中,NTK-Selector显著提升了模型性能,例如Llama3-8B-Instruct和Qwen3-8B分别获得了8.7和5.1分的提升。
Insight: LLMs在LoRA微调中表现出NTK相似行为,通用领域数据的合理选择可以显著提升低资源领域的模型性能。
Abstract: Large language models (LLMs) have achieved remarkable success across widespread tasks, yet their application in low-resource domains remains a significant challenge due to data scarcity and the high risk of overfitting. While in-domain data is limited, there exist vast amounts of similar general-domain data, and our initial findings reveal that they could potentially serve as auxiliary supervision for domain enhancement. This observation leads us to our central research question: \textbf{\textit{how to effectively select the most valuable auxiliary data to maximize domain-specific performance}}, particularly when traditional methods are inapplicable due to a lack of large in-domain data pools or validation sets. To address this, we propose \textbf{NTK-Selector}, a principled and efficient framework for selecting general-domain auxiliary data to enhance domain-specific performance via neural tangent kernels (NTK). Our method tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive experiments across four low-resource domains (medical, financial, legal, and psychological) demonstrate that NTK-Selector consistently improves downstream performance. Specifically, fine-tuning on 1,000 in-domain samples alone only yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led to substantial \textbf{gains of +8.7 and +5.1 points}, which corresponds to a \textbf{10.9x and 5.7x improvement} over the domain-only setting.
[53] Retriv at BLP-2025 Task 2: Test-Driven Feedback-Guided Framework for Bangla-to-Python Code Generation
K M Nafi Asib,Sourav Saha,Mohammed Moshiul Hoque
Main category: cs.CL
TL;DR: 本文提出了一种结合指令生成与测试驱动、反馈引导迭代优化的方法,用于孟加拉语到Python代码的生成,并在BLP-2025任务2中获得第二名。
Details
Motivation: 大语言模型在自然语言到代码生成方面取得进展,但低资源语言(如孟加拉语)由于缺乏数据和评测基准,表现不佳。Contribution: 提出了一种基于Qwen2.5-14B的测试驱动、反馈迭代优化框架,显著提升了孟加拉语到代码生成的准确性。
Method: 结合指令生成与单元测试反馈,通过三次迭代优化生成代码,每一步均利用测试反馈指导修正。
Result: 在BLP-2025任务2中以Pass@1得分0.934获得第二名。
Insight: 孟加拉语指令理解和代码生成存在独特挑战,针对低资源语言的优化方法尤为重要。
Abstract: Large Language Models (LLMs) have advanced the automated generation of code from natural language prompts. However, low-resource languages (LRLs) like Bangla remain underrepresented due to the limited availability of instruction-to-code datasets and evaluation benchmarks. To address this, the BLP Workshop at IJCNLP-AACL 2025 introduced a shared task on “Code Generation in Bangla”. In this work, we propose a method that combines instruction prompting with a test-driven, feedback-guided iterative refinement process using a fine-tuned Qwen2.5-14B model. The model generates code from Bangla instructions, tests it against unit tests, and iteratively refines any failing outputs through three evaluation passes, using test feedback to guide each step. This approach helped our team “Retriv” to secure 2nd place in the shared task with a Pass@1 score of 0.934. The analysis highlights challenges in Bangla instruction understanding and Python code generation, emphasizing the need for targeted methods in LRLs. We made experimental scripts publicly available for the community.
[54] Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction
Hyeryun Park,Byung Mo Gu,Jun Hee Lee,Byeong Hyeon Choi,Sekeun Kim,Hyun Koo Kim,Kyungsang Kim
Main category: cs.CL
TL;DR: 论文提出了一种基于多智能体框架的语音导向手术代理编排平台(SAOP),用于解决达芬奇机器人手术中医师无法中断手术操作访问多模态患者数据的问题。
Details
Motivation: 达芬奇机器人手术中,医师的手和眼完全集中于手术操作,导致难以无缝访问和操作多模态患者数据。因此,需要一种不中断手术流程的数据交互方法。Contribution: 1)提出了基于分层多智能体框架的SAOP平台;2)设计了由大语言模型(LLMs)驱动的任务专用智能体;3)引入了多级编排评估指标(MOEM)全面评估性能。
Method: 平台由一个编排智能体和三个任务专用智能体组成,LLM驱动的智能体通过自主规划、验证和推理将语音命令映射为具体任务(如检索信息、操作CT扫描等)。
Result: SAOP在240条语音命令中表现出高准确率和成功率,LLM驱动的智能体提升了对语音识别错误和多样化自由命令的鲁棒性。
Insight: SAOP展示了在微创达芬奇机器人手术中支持无缝数据交互的潜力,LLM驱动的智能体在处理复杂命令方面表现优越。
Abstract: In da Vinci robotic surgery, surgeons’ hands and eyes are fully engaged in the procedure, making it difficult to access and manipulate multimodal patient data without interruption. We propose a voice-directed Surgical Agent Orchestrator Platform (SAOP) built on a hierarchical multi-agent framework, consisting of an orchestration agent and three task-specific agents driven by Large Language Models (LLMs). These LLM-based agents autonomously plan, refine, validate, and reason to map voice commands into specific tasks such as retrieving clinical information, manipulating CT scans, or navigating 3D anatomical models on the surgical video. We also introduce a Multi-level Orchestration Evaluation Metric (MOEM) to comprehensively assess the performance and robustness from command-level and category-level perspectives. The SAOP achieves high accuracy and success rates across 240 voice commands, while LLM-based agents improve robustness against speech recognition errors and diverse or ambiguous free-form commands, demonstrating strong potential to support minimally invasive da Vinci robotic surgery.
[55] ConvFill: Model Collaboration for Responsive Conversational Voice Agents
Vidya Srinivas,Zachary Englhardt,Maximus Powers,Shwetak Patel,Vikram Iyer
Main category: cs.CL
TL;DR: ConvFill提出了一种协作模型方法,通过轻量级设备端模型与云端模型的协同工作,解决了对话语音代理在响应速度和知识深度之间的权衡问题,实现了低延迟且丰富的对话体验。
Details
Motivation: 大规模语言模型在云端部署时存在高延迟问题,而设备端模型虽然响应快但能力有限。ConvFill旨在结合二者的优势,提供既快速又知识丰富的对话体验。Contribution: 1)提出了对话填充任务,实现设备端模型与云端模型的协作;2)设计了ConvFill模型,显著提升了小模型的对话能力,同时保持低延迟。
Method: ConvFill是一个3.6亿参数的模型,通过合成多领域对话数据进行训练。设备端模型生成初始对话,实时整合云端模型的流式知识,实现无缝协作。
Result: 实验表明,ConvFill在相同规模的设备端模型中准确率提升了36-42%,且响应延迟始终低于200ms。
Insight: 协作模型是解决设备端与云端模型权衡的有效途径,未来可扩展至更复杂的多模态对话场景。
Abstract: Deploying conversational voice agents with large language models faces a critical challenge: cloud-based foundation models provide deep reasoning and domain knowledge but introduce latency that disrupts natural conversation, while on-device models respond immediately but lack sophistication. We propose conversational infill, a task where a lightweight on-device model generates contextually appropriate dialogue while seamlessly incorporating streaming knowledge from a powerful backend model. This approach decouples response latency from model capability, enabling systems that feel responsive while accessing the full power of large-scale models. We present ConvFill, a 360M parameter model trained on synthetic multi-domain conversations. Evaluation across multiple backend models shows that conversational infill can be successfully learned, with ConvFill achieving accuracy improvements of 36-42% over standalone small models of the same size while consistently retaining sub-200ms response latencies. Our results demonstrate the promise of this approach for building on-device conversational agents that are both immediately responsive and knowledgeable.
cs.CV [Back]
[56] Randomized-MLP Regularization Improves Domain Adaptation and Interpretability in DINOv2
Joel Valdivia Ortega,Lorenz Lamm,Franziska Eckardt,Benedikt Schworm,Marion Jasnin,Tingying Peng
Main category: cs.CV
TL;DR: 该论文提出了一种称为随机MLP(RMLP)正则化的方法,用于增强DINOv2在领域适应和可解释性方面的表现,特别是在医学影像领域。
Details
Motivation: Vision Transformers(ViTs)如DINOv2虽在多领域表现优异,但在处理低信息量的patch token时可能损害注意力机制和特征图的可解释性。医学影像领域中的领域偏移进一步加剧了这一挑战。Contribution: 1. 提出RMLP正则化方法,基于对比学习促进语义对齐的表征;2. 在DINOv2微调中应用RMLP,提升了领域适应能力和注意力图的可解释性;3. 提供对RMLP的数学分析,深化了对对比学习的理解。
Method: 通过随机MLP(RMLP)正则化方法,结合对比学习,对DINOv2进行微调,优化其在医学和自然图像领域的表现。
Result: 实验表明,RMLP不仅维持或提升了下游任务的性能,还生成了更具可解释性的注意力图。
Insight: RMLP正则化通过语义对齐的表征,改善了ViTs在领域适应和可解释性方面的表现,尤其是在医学影像等领域偏移显著的任务中。
Abstract: Vision Transformers (ViTs), such as DINOv2, achieve strong performance across domains but often repurpose low-informative patch tokens in ways that reduce the interpretability of attention and feature maps. This challenge is especially evident in medical imaging, where domain shifts can degrade both performance and transparency. In this paper, we introduce Randomized-MLP (RMLP) regularization, a contrastive learning-based method that encourages more semantically aligned representations. We use RMLPs when fine-tuning DINOv2 to both medical and natural image modalities, showing that it improves or maintains downstream performance while producing more interpretable attention maps. We also provide a mathematical analysis of RMLPs, offering insights into its role in enhancing ViT-based models and advancing our understanding of contrastive learning.
[57] Token Is All You Need: Cognitive Planning through Sparse Intent Alignment
Shiyao Sang
Main category: cs.CV
TL;DR: 论文提出了一种新的端到端自动驾驶规划方法,通过稀疏的语义token而非复杂的场景建模实现高性能规划,实验表明其效果优于传统方法。
Details
Motivation: 挑战传统端到端自动驾驶依赖复杂场景建模的假设,探索是否能通过语义丰富的稀疏token实现高效规划。Contribution: 1. 证明了稀疏token足以支持高性能自动驾驶规划;2. 提出了基于token的条件轨迹解码方法,显著提升性能;3. 揭示了显式重建损失在可靠感知输入下的局限性。
Method: 使用感知增强的BEV表示,通过稀疏语义token建模任务相关语义,并在轨迹解码中利用未来token预测改善性能。
Result: 在nuPlan基准测试中,ADE降低至0.479米,较基线提升12.6%,并发现了时间模糊性现象。
Insight: 稀疏语义token能够适应性地关注任务相关语义,而非固定时间对齐,为不确定性下的规划提供了认知优势。
Abstract: We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Unlike world-model approaches that rely on computationally intensive future scene generation or vision-language-action (VLA) systems constrained by Markov assumptions, we show that a minimal set of semantically rich tokens is sufficient for effective planning. Experiments on the nuPlan benchmark (720 scenarios, over 11,000 samples) using perception-informed BEV representations yield three key findings: (1) even without future prediction, our sparse representation achieves 0.548 m ADE, comparable to or surpassing prior methods reporting around 0.75 m on nuScenes; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.479 m, a 12.6% improvement over current-state baselines; and (3) explicit reconstruction loss offers no benefit and may degrade performance under reliable perception inputs. Notably, we observe the emergence of temporal fuzziness, where the model adaptively attends to task-relevant semantics rather than aligning rigidly to fixed timestamps, providing a cognitive advantage for planning under uncertainty. Our “token is all you need” principle marks a paradigm shift from reconstructing the world to understanding it, laying a foundation for cognitively inspired systems that plan through imagination rather than reaction.
[58] In-Context-Learning-Assisted Quality Assessment Vision-Language Models for Metal Additive Manufacturing
Qiaojie Zheng,Jiucai Zhang,Xiaoli Zhang
Main category: cs.CV
TL;DR: 论文提出了一种利用视觉语言模型(VLMs)和上下文学习(ICL)进行金属增材制造质量评估的方法,无需大量数据集,并能生成可解释的决策依据。
Details
Motivation: 传统基于视觉的质量评估方法需要大量数据集和专用模型,数据收集和训练成本高。本文旨在通过VLMs和ICL解决这一问题,提高效率并增强透明度。Contribution: 1. 结合ICL与VLMs,消除了对大规模应用特定数据集的需求。
2. 提出了采样策略优化ICL配置。
3. 引入了知识相关性和合理性有效性两个指标,评估VLMs的解释性。
Method: 利用VLMs(Gemini-2.5-flash和Gemma3:27b)的推理能力,结合ICL提供领域知识和示例,在有限样本下完成质量分类。
Result: ICL辅助的VLMs在Wire-Laser直接能量沉积任务中达到了与传统机器学习相当的分类精度,同时生成可解释的决策依据。
Insight: VLMs结合ICL在数据有限的任务中表现出色,且能够提供透明的决策过程,有望在制造领域推广。
Abstract: Vision-based quality assessment in additive manufacturing often requires dedicated machine learning models and application-specific datasets. However, data collection and model training can be expensive and time-consuming. In this paper, we leverage vision-language models’ (VLMs’) reasoning capabilities to assess the quality of printed parts and introduce in-context learning (ICL) to provide VLMs with necessary application-specific knowledge and demonstration samples. This method eliminates the requirement for large application-specific datasets for training models. We explored different sampling strategies for ICL to search for the optimal configuration that makes use of limited samples. We evaluated these strategies on two VLMs, Gemini-2.5-flash and Gemma3:27b, with quality assessment tasks in wire-laser direct energy deposition processes. The results show that ICL-assisted VLMs can reach quality classification accuracies similar to those of traditional machine learning models while requiring only a minimal number of samples. In addition, unlike traditional classification models that lack transparency, VLMs can generate human-interpretable rationales to enhance trust. Since there are no metrics to evaluate their interpretability in manufacturing applications, we propose two metrics, knowledge relevance and rationale validity, to evaluate the quality of VLMs’ supporting rationales. Our results show that ICL-assisted VLMs can address application-specific tasks with limited data, achieving relatively high accuracy while also providing valid supporting rationales for improved decision transparency.
[59] EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning
Xinyan Cai,Shiguang Wu,Dafeng Chi,Yuzheng Zhuang,Xingyue Quan,Jianye Hao,Qiang Guan
Main category: cs.CV
TL;DR: EVLP提出了一种统一的视觉-语言规划框架,通过动态预训练和强化对齐训练,实现了语言推理和视觉合成的协同建模,解决了长时段任务中多模态规划不一致的问题。
Details
Motivation: 当前方法在多模态规划中缺乏统一的生成框架,导致语言推理和视觉空间想象之间的不一致,影响任务的分解和执行效果。EVLP旨在通过协同建模解决这一问题。Contribution: 1) 提出了统一的多模态生成框架,结合语义信息和空间特征;2) 设计了动态感知预训练策略,增强多模态关联;3) 引入了强化监督微调方法,对齐文本动作与生成图像的空间逻辑。
Method: 1) 统一的生成框架通过交叉模态注意力机制建模语言和视觉;2) 动态预训练采用双向动态对齐策略;3) 强化微调通过强化损失函数对齐空间逻辑。
Result: EVLP在多模态任务规划中表现出高效性和准确性,解决了长时段任务中语言与视觉不一致的问题。
Insight: 多模态任务的规划需要语言和视觉的协同建模,动态对齐和强化微调是实现这一目标的有效方法。
Abstract: In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning. To address this challenge, we present \textbf{EVLP (Embodied Vision-Language Planner)}, an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: \textbf{1) Unified Multimodal Generation Framework}: For understanding, We integrate semantic information with spatial features to provide comprehensive visual perception. For generation, we directly learn the joint distribution of discrete images for one-step visual synthesis, enabling coordinated language-visual modeling through learnable cross-modal attention mechanisms. \textbf{2) Dynamic Perception Pretraining}: We propose a bidirectional dynamic alignment strategy employing inverse dynamics tasks and forward dynamics tasks, effectively strengthening multimodal correlations within a unified feature space. \textbf{3) Reinforced Supervised Fine-Tuning}: While conducting instruction-based fine-tuning in the unified generation space, we construct a reinforce loss to align the spatial logic between textual actions and generated images, enabling the model to acquire spatio-awared multimodal planning capabilities.
[60] MCFCN: Multi-View Clustering via a Fusion-Consensus Graph Convolutional Network
Chenping Pei,Fadi Dornaika,Jingjun Bi
Main category: cs.CV
TL;DR: MCFCN提出了一种融合-共识图卷积网络的多视图聚类方法,通过端到端学习多视图数据的共识图,结合视图特征融合模型和统一图结构适配器,解决了现有方法忽视数据拓扑结构和对噪声敏感的问题,显著提升了聚类性能。
Details
Motivation: 现有多视图聚类(MVC)方法通常专注于共识表示学习,但忽视了数据的隐含拓扑结构。此外,基于图神经网络的MVC方法因输入图结构易受噪声干扰,而多视图图细化方法则存在跨视图一致性不足等问题。MCFCN旨在通过融合-共识机制解决这些局限性。Contribution: 1. 提出了融合-共识图卷积网络(MCFCN),端到端学习多视图数据的共识图;2. 设计了相似性矩阵对齐损失(SMAL)和特征表示对齐损失(FRAL)以优化视图特定图;3. 实现了跨视图拓扑一致性保持和高效共识表示学习。
Method: MCFCN通过视图特征融合模型和统一图结构适配器(UGA)学习有效共识表示,并设计SMAL和FRAL损失函数以优化视图特定图。利用GCN提升聚类性能。
Result: 在八个多视图基准数据集上,MCFCN表现出最先进的聚类性能,并通过广泛的定性和定量实验验证了其有效性。
Insight: MCFCN的融合-共识机制不仅解决了噪声干扰问题,还通过跨视图一致性优化显著提升了多视图聚类的性能,为后续研究提供了新思路。
Abstract: Existing Multi-view Clustering (MVC) methods based on subspace learning focus on consensus representation learning while neglecting the inherent topological structure of data. Despite the integration of Graph Neural Networks (GNNs) into MVC, their input graph structures remain susceptible to noise interference. Methods based on Multi-view Graph Refinement (MGRC) also have limitations such as insufficient consideration of cross-view consistency, difficulty in handling hard-to-distinguish samples in the feature space, and disjointed optimization processes caused by graph construction algorithms. To address these issues, a Multi-View Clustering method via a Fusion-Consensus Graph Convolutional Network (MCFCN) is proposed. The network learns the consensus graph of multi-view data in an end-to-end manner and learns effective consensus representations through a view feature fusion model and a Unified Graph Structure Adapter (UGA). It designs Similarity Matrix Alignment Loss (SMAL) and Feature Representation Alignment Loss (FRAL). With the guidance of consensus, it optimizes view-specific graphs, preserves cross-view topological consistency, promotes the construction of intra-class edges, and realizes effective consensus representation learning with the help of GCN to improve clustering performance. MCFCN demonstrates state-of-the-art performance on eight multi-view benchmark datasets, and its effectiveness is verified by extensive qualitative and quantitative implementations. The code will be provided at https://github.com/texttao/MCFCN.
[61] Compressing Multi-Task Model for Autonomous Driving via Pruning and Knowledge Distillation
Jiayuan Wang,Q. M. Jonathan Wu,Ning Zhang,Katsuya Suto,Lei Zhong
Main category: cs.CV
TL;DR: 这篇论文提出了一种结合任务感知的安全剪枝和特征级知识蒸馏的多任务模型压缩框架,用于自动驾驶中的全景感知任务。该方法显著减少了模型参数量,同时保持了较高的性能。
Details
Motivation: 自动驾驶系统中的全景感知任务(如目标检测、可行驶区域分割和车道线分割)通常依赖于多任务学习,但其模型参数和复杂度较高,难以在车载设备上部署。Contribution: 提出了一个结合任务感知安全剪枝和特征级知识蒸馏的压缩框架,通过泰勒重要性评估和梯度冲突惩罚解决冗余和冲突通道的问题,并设计了任务头无关的知识蒸馏方法。
Method: 1. 任务感知安全剪枝:整合泰勒重要性评估和梯度冲突惩罚以保留重要通道。2. 任务头无关知识蒸馏:从中教师模型转移中间特征以指导学生模型。
Result: 在BDD100K数据集上,压缩模型的参数量减少了32.7%,分割任务性能几乎无损,检测任务仅略微下降(召回率-1.2%,mAP50-1.8%),同时仍能以32.7 FPS实时运行。
Insight: 结合剪枝和知识蒸馏是一种有效的多任务模型压缩方法,能够在显著减少参数量的同时保持较高的任务性能,适用于资源受限的自动驾驶场景。
Abstract: Autonomous driving systems rely on panoptic perception to jointly handle object detection, drivable area segmentation, and lane line segmentation. Although multi-task learning is an effective way to integrate these tasks, its increasing model parameters and complexity make deployment on on-board devices difficult. To address this challenge, we propose a multi-task model compression framework that combines task-aware safe pruning with feature-level knowledge distillation. Our safe pruning strategy integrates Taylor-based channel importance with gradient conflict penalty to keep important channels while removing redundant and conflicting channels. To mitigate performance degradation after pruning, we further design a task head-agnostic distillation method that transfers intermediate backbone and encoder features from a teacher to a student model as guidance. Experiments on the BDD100K dataset demonstrate that our compressed model achieves a 32.7% reduction in parameters while segmentation performance shows negligible accuracy loss and only a minor decrease in detection (-1.2% for Recall and -1.8% for mAP50) compared to the teacher. The compressed model still runs at 32.7 FPS in real-time. These results show that combining pruning and knowledge distillation provides an effective compression solution for multi-task panoptic perception.
[62] M2S2L: Mamba-based Multi-Scale Spatial-temporal Learning for Video Anomaly Detection
Yang Liu,Boan Chen,Xiaoguang Zhu,Jing Liu,Peng Sun,Wei Zhou
Main category: cs.CV
TL;DR: 论文提出了一种基于Mamba的多尺度时空学习框架(M2S2L),用于视频异常检测,通过分层空间编码器和多时间编码器,结合特征分解机制,实现了高准确率和低计算成本的平衡。
Details
Motivation: 视频异常检测在复杂视频内容中面临准确性与计算效率的平衡问题,现有方法或缺乏全面的时空建模,或计算资源消耗过大。Contribution: 提出了M2S2L框架,结合多尺度空间和时间编码器,引入特征分解机制,优化外观和运动重建任务,提升了异常检测的效果。
Method: 采用分层空间编码器和多时间编码器,结合特征分解机制,实现多尺度的时空建模和任务特异性优化。
Result: 在三个基准数据集上达到了98.5%、92.1%和77.9%的帧级AUC,计算效率为20.1G FLOPs和45 FPS。
Insight: M2S2L通过多尺度建模和任务特异性分解,显著提升了复杂场景下的异常检测性能,同时保持高效计算,适用于实际部署。
Abstract: Video anomaly detection (VAD) is an essential task in the image processing community with prospects in video surveillance, which faces fundamental challenges in balancing detection accuracy with computational efficiency. As video content becomes increasingly complex with diverse behavioral patterns and contextual scenarios, traditional VAD approaches struggle to provide robust assessment for modern surveillance systems. Existing methods either lack comprehensive spatial-temporal modeling or require excessive computational resources for real-time applications. In this regard, we present a Mamba-based multi-scale spatial-temporal learning (M2S2L) framework in this paper. The proposed method employs hierarchical spatial encoders operating at multiple granularities and multi-temporal encoders capturing motion dynamics across different time scales. We also introduce a feature decomposition mechanism to enable task-specific optimization for appearance and motion reconstruction, facilitating more nuanced behavioral modeling and quality-aware anomaly assessment. Experiments on three benchmark datasets demonstrate that M2S2L framework achieves 98.5%, 92.1%, and 77.9% frame-level AUCs on UCSD Ped2, CUHK Avenue, and ShanghaiTech respectively, while maintaining efficiency with 20.1G FLOPs and 45 FPS inference speed, making it suitable for practical surveillance deployment.
[63] In-Context Adaptation of VLMs for Few-Shot Cell Detection in Optical Microscopy
Shreyan Ganguly,Angona Biswas,Jaydeep Rade,Md Hasibul Hasan Hasib,Nabila Masud,Nitish Singla,Abhipsa Dash,Ushashi Bhattacharjee,Aditya Balu,Anwesha Sarkar,Adarsh Krishnamurthy,Soumik Sarkar
Main category: cs.CV
TL;DR: 该论文探索了如何在缺乏大规模标注数据的情况下,通过上下文学习(in-context learning)使视觉语言模型(VLM)在光学显微镜图像的少样本目标检测中表现优异。作者提出了Micro-OD基准测试,并验证了一种结合检测头和VLM的混合方法,显著提升了少样本性能。
Details
Motivation: 基础视觉语言模型(VLM)在自然图像上表现优异,但在生物医学显微镜领域的应用尚未充分研究。尤其是在标注数据稀缺的情况下,如何利用VLM进行少样本目标检测成为关键问题。Contribution: 1. 提出了Micro-OD基准测试,包含252张图像和11种细胞类型的标注。2. 系统地评估了8种VLM在少样本条件下的表现。3. 设计了一种混合少样本目标检测(FSOD)方法,结合了检测头和VLM分类器。
Method: 1. 使用Micro-OD基准测试进行实验。2. 引入隐式测试时间推理标记(reasoning tokens)的变体与非变体进行比较。3. 开发了一种混合FSOD方法,结合检测头和VLM分类器。
Result: 1. 零样本性能因领域差距较弱,但少样本支持显著提升了检测效果。2. 推理标记变体在全流程定位中更有效,而简单变体更适合分类预定位裁剪区域。3. 六次少样本后性能提升趋于稳定。
Insight: 上下文学习是生物医学显微镜图像目标检测的实用路径;Micro-OD基准为生物医学图像中的开放词汇检测提供了可复现的测试平台。
Abstract: Foundation vision-language models (VLMs) excel on natural images, but their utility for biomedical microscopy remains underexplored. In this paper, we investigate how in-context learning enables state-of-the-art VLMs to perform few-shot object detection when large annotated datasets are unavailable, as is often the case with microscopic images. We introduce the Micro-OD benchmark, a curated collection of 252 images specifically curated for in-context learning, with bounding-box annotations spanning 11 cell types across four sources, including two in-lab expert-annotated sets. We systematically evaluate eight VLMs under few-shot conditions and compare variants with and without implicit test-time reasoning tokens. We further implement a hybrid Few-Shot Object Detection (FSOD) pipeline that combines a detection head with a VLM-based few-shot classifier, which enhances the few-shot performance of recent VLMs on our benchmark. Across datasets, we observe that zero-shot performance is weak due to the domain gap; however, few-shot support consistently improves detection, with marginal gains achieved after six shots. We observe that models with reasoning tokens are more effective for end-to-end localization, whereas simpler variants are more suitable for classifying pre-localized crops. Our results highlight in-context adaptation as a practical path for microscopy, and our benchmark provides a reproducible testbed for advancing open-vocabulary detection in biomedical imaging.
[64] Efficient Online Continual Learning in Sensor-Based Human Activity Recognition
Yao Zhang,Souza Leite Clayton,Yu Xiao
Main category: cs.CV
TL;DR: 论文提出了PTRN-HAR方法,首次将基于预训练模型的在线持续学习(OCL)成功应用于传感器数据的人类活动识别(HAR),通过对比损失预训练特征提取器并冻结,结合关系模块网络,显著减少了训练资源和标注数据需求,并在实验中超越现有方法。
Details
Motivation: 传感器数据的人类活动识别(HAR)模型需要能够在部署后适应新活动和变化,但现有的在线持续学习方法计算复杂度高且需要大量标注数据。预训练模型(PTM)在计算机视觉中表现优异,但直接应用于HAR由于数据异构性和标注稀缺性面临挑战。Contribution: 提出了首个基于PTM的在线持续学习方法PTRN-HAR,通过对比损失预训练和冻结特征提取器,结合关系模块网络,实现了高效且数据高效的持续学习。
Method: 采用对比损失预训练特征提取器并冻结,替换传统的密集分类层为关系模块网络,显著减少训练资源和标注数据需求。
Result: 在三个公开数据集上的实验表明,PTRN-HAR在性能和效率上均优于现有方法。
Insight: PTRN-HAR证明了在数据稀缺和异构场景下,通过巧妙的预训练和模块设计,可以实现高效的在线持续学习,为HAR领域的模型适应性问题提供了新思路。
Abstract: Machine learning models for sensor-based human activity recognition (HAR) are expected to adapt post-deployment to recognize new activities and different ways of performing existing ones. To address this need, Online Continual Learning (OCL) mechanisms have been proposed, allowing models to update their knowledge incrementally as new data become available while preserving previously acquired information. However, existing OCL approaches for sensor-based HAR are computationally intensive and require extensive labeled samples to represent new changes. Recently, pre-trained model-based (PTM-based) OCL approaches have shown significant improvements in performance and efficiency for computer vision applications. These methods achieve strong generalization capabilities by pre-training complex models on large datasets, followed by fine-tuning on downstream tasks for continual learning. However, applying PTM-based OCL approaches to sensor-based HAR poses significant challenges due to the inherent heterogeneity of HAR datasets and the scarcity of labeled data in post-deployment scenarios. This paper introduces PTRN-HAR, the first successful application of PTM-based OCL to sensor-based HAR. Unlike prior PTM-based OCL approaches, PTRN-HAR pre-trains the feature extractor using contrastive loss with a limited amount of data. This extractor is then frozen during the streaming stage. Furthermore, it replaces the conventional dense classification layer with a relation module network. Our design not only significantly reduces the resource consumption required for model training while maintaining high performance, but also improves data efficiency by reducing the amount of labeled data needed for effective continual learning, as demonstrated through experiments on three public datasets, outperforming the state-of-the-art. The code can be found here: https://anonymous.4open.science/r/PTRN-HAR-AF60/
[65] Video Text Preservation with Synthetic Text-Rich Videos
Ziyang Liu,Kevin Valencia,Justin Cui
Main category: cs.CV
TL;DR: 论文提出了一种轻量级方法,通过合成监督提升文本到视频(T2V)扩散模型的文本生成能力,利用文本到图像(T2I)模型生成文本丰富的图像,并通过图像到视频(I2V)模型将其动画化,以微调预训练的T2V模型。
Details
Motivation: 现有的文本到视频(T2V)模型在生成清晰连贯的文本内容时表现不佳,尤其是对短文本的处理存在问题,而之前的解决方案计算成本高且不适合视频生成。Contribution: 1. 提出了一种轻量级方法,利用合成数据监督提升T2V模型的文本生成能力。2. 展示了合成数据和弱监督在提升视频中文本保真度的实用性。
Method: 1. 使用T2I模型生成文本丰富的图像。2. 用文本无关的I2V模型将图像动画化为视频。3. 利用这些合成的视频-提示对微调预训练的T2V模型(如Wan2.1)。
Result: 方法在短文本清晰度和长文本的时间一致性上表现更好,同时为更复杂文本提供了结构先验。
Insight: 合成数据和弱监督是提升T2V模型文本生成能力的实际可行路径,且无需修改模型架构。
Abstract: While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and previous attempts to address this problem are computationally expensive and not suitable for video generation. In this work, we investigate a lightweight approach to improve T2V diffusion models using synthetic supervision. We first generate text-rich images using a text-to-image (T2I) diffusion model, then animate them into short videos using a text-agnostic image-to-video (I2v) model. These synthetic video-prompt pairs are used to fine-tune Wan2.1, a pre-trained T2V model, without any architectural changes. Our results show improvement in short-text legibility and temporal consistency with emerging structural priors for longer text. These findings suggest that curated synthetic data and weak supervision offer a practical path toward improving textual fidelity in T2V generation.
[66] Elements of Active Continuous Learning and Uncertainty Self-Awareness: a Narrow Implementation for Face and Facial Expression Recognition
Stanislav Selitskiy
Main category: cs.CV
TL;DR: 论文提出了一种主动连续学习和不确定性自我感知的机制,通过监督人工神经网络(ANN)监测底层ANN的激活模式,以判断其不确定性和预测可信度。
Details
Motivation: 模仿人类反思和自我纠正的能力,即使在狭窄的机器学习任务中(如人脸和表情识别)实现类似智能的自我感知机制。Contribution: 设计了一种自监督ANN架构,能够评估底层CNN集成模型的不确定性,并在高不确定性时触发主动学习机制请求人工帮助。
Method: 使用监督ANN监测CNN集成模型的激活模式,存储历史性能数据,并通过训练调整参数以优化性能。
Result: 实现了在不确定性条件下的主动学习能力,提升了系统的鲁棒性和预测可信度。
Insight: 展示了即使是狭窄任务中也可以实现高级别自我感知和主动学习的机制,为通用人工智能的研究提供了灵感。
Abstract: Reflection on one’s thought process and making corrections to it if there exists dissatisfaction in its performance is, perhaps, one of the essential traits of intelligence. However, such high-level abstract concepts mandatory for Artificial General Intelligence can be modelled even at the low level of narrow Machine Learning algorithms. Here, we present the self-awareness mechanism emulation in the form of a supervising artificial neural network (ANN) observing patterns in activations of another underlying ANN in a search for indications of the high uncertainty of the underlying ANN and, therefore, the trustworthiness of its predictions. The underlying ANN is a convolutional neural network (CNN) ensemble employed for face recognition and facial expression tasks. The self-awareness ANN has a memory region where its past performance information is stored, and its learnable parameters are adjusted during the training to optimize the performance. The trustworthiness verdict triggers the active learning mode, giving elements of agency to the machine learning algorithm that asks for human help in high uncertainty and confusion conditions.
[67] Beyond Softmax: Dual-Branch Sigmoid Architecture for Accurate Class Activation Maps
Yoojin Oh,Junhyug Noh
Main category: cs.CV
TL;DR: 这篇论文提出了一种双分支Sigmoid架构,用于改进CAM方法的准确性,解决了Softmax带来的对数偏移和符号混淆问题,并在实验中表现出更高的解释忠实度和定位准确性。
Details
Motivation: 传统CAM方法依赖Softmax分类器,存在对数偏移和符号混淆的问题,导致重要性评分失真。本文旨在提出一种改进方法,在不牺牲分类性能的情况下提升CAM的解释效果。Contribution: 1. 提出了一种双分支Sigmoid架构,将定位与分类解耦;2. 保留了Softmax的分类准确性,同时通过Sigmoid分支生成更准确的激活图;3. 实现了无缝集成大多数CAM变体,且计算开销极小。
Method: 克隆预训练模型的分类头为并行Sigmoid分支,冻结原始Softmax分支,仅对Sigmoid分支进行类平衡的二元监督微调。推理时,Softmax用于分类,Sigmoid分支生成激活图。
Result: 在细粒度数据集(CUB-200-2011, Stanford Cars)和WSOL基准(ImageNet-1K, OpenImages30K)上,展示了更高的解释忠实度和一致的Top-1定位性能提升。
Insight: 通过解耦分类和定位任务,Sigmoid分支能够更好地保留特征的幅度和符号信息,从而提高CAM的准确性。
Abstract: Class Activation Mapping (CAM) and its extensions have become indispensable tools for visualizing the evidence behind deep network predictions. However, by relying on a final softmax classifier, these methods suffer from two fundamental distortions: additive logit shifts that arbitrarily bias importance scores, and sign collapse that conflates excitatory and inhibitory features. We propose a simple, architecture-agnostic dual-branch sigmoid head that decouples localization from classification. Given any pretrained model, we clone its classification head into a parallel branch ending in per-class sigmoid outputs, freeze the original softmax head, and fine-tune only the sigmoid branch with class-balanced binary supervision. At inference, softmax retains recognition accuracy, while class evidence maps are generated from the sigmoid branch – preserving both magnitude and sign of feature contributions. Our method integrates seamlessly with most CAM variants and incurs negligible overhead. Extensive evaluations on fine-grained tasks (CUB-200-2011, Stanford Cars) and WSOL benchmarks (ImageNet-1K, OpenImages30K) show improved explanation fidelity and consistent Top-1 Localization gains – without any drop in classification accuracy. Code is available at https://github.com/finallyupper/beyond-softmax.
[68] Google-MedGemma Based Abnormality Detection in Musculoskeletal radiographs
Soumyajit Maity,Pranjal Kamboj,Sneha Maity,Rajat Singh,Sankhadeep Chatterjee
Main category: cs.CV
TL;DR: 本文提出了一种基于MedGemma的框架,用于自动检测肌肉骨骼放射照片中的异常。与传统方法不同,该方法利用MedGemma基础模型和SigLIP派生的视觉编码器,通过多层感知器实现高性能分类。
Details
Motivation: 传统自动编码器和神经网络方法在肌肉骨骼放射照片异常检测中存在性能瓶颈,而MedGemma作为现代医学基础模型,能够提供更优的特征表示和迁移学习能力。Contribution: 1. 提出了基于MedGemma的异常检测框架;2. 结合SigLIP派生的视觉编码器增强特征提取;3. 验证了MedGemma在医学图像分类中的高效性能。
Method: 1. 使用MedGemma视觉主干编码X光图像;2. 轻量级多层感知器用于二元分类;3. 采用选择性编码块解冻策略优化领域适应。
Result: 实验表明,MedGemma驱动的分类器性能优于传统卷积和自动编码器方法,且具有更强的迁移学习能力。
Insight: MedGemma不仅能提升表示学习能力,还能通过模块化训练策略(如选择性解冻)高效适应特定医学领域。
Abstract: This paper proposes a MedGemma-based framework for automatic abnormality detection in musculoskeletal radiographs. Departing from conventional autoencoder and neural network pipelines, the proposed method leverages the MedGemma foundation model, incorporating a SigLIP-derived vision encoder pretrained on diverse medical imaging modalities. Preprocessed X-ray images are encoded into high-dimensional embeddings using the MedGemma vision backbone, which are subsequently passed through a lightweight multilayer perceptron for binary classification. Experimental assessment reveals that the MedGemma-driven classifier exhibits strong performance, exceeding conventional convolutional and autoencoder-based metrics. Additionally, the model leverages MedGemma’s transfer learning capabilities, enhancing generalization and optimizing feature engineering. The integration of a modern medical foundation model not only enhances representation learning but also facilitates modular training strategies such as selective encoder block unfreezing for efficient domain adaptation. The findings suggest that MedGemma-powered classification systems can advance clinical radiograph triage by providing scalable and accurate abnormality detection, with potential for broader applications in automated medical image analysis. Keywords: Google MedGemma, MURA, Medical Image, Classification.
[69] In-process 3D Deviation Mapping and Defect Monitoring (3D-DM2) in High Production-rate Robotic Additive Manufacturing
Subash Gautam,Alejandro Vargas-Uscategui,Peter King,Hans Lohr,Alireza Bab-Hadiashar,Ivan Cole,Ehsan Asadi
Main category: cs.CV
TL;DR: 该论文提出了一种实时监控系统3D-DM2,用于高生产率机器人增材制造过程中的3D偏差映射和缺陷监测,旨在通过实时检测和干预确保零件形状精度。
Details
Motivation: 高沉积率机器人增材制造(HDRRAM)过程中存在形状精度问题,而开环系统的不稳定性加剧了这一挑战,亟需实时监控以避免误差累积和质量问题。Contribution: 开发了一种实时监控系统,能够重建正在制造的零件并与参考模型直接对比,实现制造过程中的形状偏差检测和区域跟踪。
Method: 通过实时获取和重建零件几何形状,与近净参考模型进行直接比对,检测并分割偏差区域,以便及时干预。
Result: 系统能够早期识别形状不一致性并跟踪偏差区域,为实现零件质量的稳定性和减少后处理需求提供了可能。
Insight: 实时监测和高精度比对是实现高生产率增材制造质量控制的关键,为闭环系统开发奠定了基础。
Abstract: Additive manufacturing (AM) is an emerging digital manufacturing technology to produce complex and freeform objects through a layer-wise deposition. High deposition rate robotic AM (HDRRAM) processes, such as cold spray additive manufacturing (CSAM), offer significantly increased build speeds by delivering large volumes of material per unit time. However, maintaining shape accuracy remains a critical challenge, particularly due to process instabilities in current open-loop systems. Detecting these deviations as they occur is essential to prevent error propagation, ensure part quality, and minimize post-processing requirements. This study presents a real-time monitoring system to acquire and reconstruct the growing part and directly compares it with a near-net reference model to detect the shape deviation during the manufacturing process. The early identification of shape inconsistencies, followed by segmenting and tracking each deviation region, paves the way for timely intervention and compensation to achieve consistent part quality.
[70] Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation
Ziying Li,Xuequan Lu,Xinkui Zhao,Guanjie Cheng,Shuiguang Deng,Jianwei Yin
Main category: cs.CV
TL;DR: 该论文提出了一种新的文本到3D生成方法(TraCe),通过将生成过程建模为Schrödinger Bridge问题,解决了现有Score Distillation Sampling(SDS)方法导致的过饱和和过平滑等问题。TraCe利用显式的扩散桥轨迹和LoRA适配模型,实现了高质量的3D生成。
Details
Motivation: 现有的基于优化的文本到3D生成方法(如SDS)依赖预训练的文本到图像扩散模型,但容易引入过饱和和过平滑等伪影。论文旨在通过Schrödinger Bridge框架直接建模轨迹,解决这些问题。Contribution: 1. 理论证明了SDS是Schrödinger Bridge的一种简化实例;2. 提出了TraCe框架,通过显式构建扩散桥轨迹和LoRA适配模型,实现了更高质量的文本到3D生成。
Method: 论文提出TraCe框架,将生成过程建模为Schrödinger Bridge问题,显式构建从当前渲染到目标分布的扩散桥轨迹,并使用LoRA适配模型优化3D生成。
Result: 实验表明,TraCe在质量和保真度上优于现有技术,尤其是在较小的Classifier-free Guidance(CFG)值下表现更优。
Insight: Schrödinger Bridge框架为文本到3D生成提供了一种更直接的轨迹优化方法,能够避免传统SDS方法的伪影问题,提升生成质量。
Abstract: Recent advancements in optimization-based text-to-3D generation heavily rely on distilling knowledge from pre-trained text-to-image diffusion models using techniques like Score Distillation Sampling (SDS), which often introduce artifacts such as over-saturation and over-smoothing into the generated 3D assets. In this paper, we address this essential problem by formulating the generation process as learning an optimal, direct transport trajectory between the distribution of the current rendering and the desired target distribution, thereby enabling high-quality generation with smaller Classifier-free Guidance (CFG) values. At first, we theoretically establish SDS as a simplified instance of the Schrödinger Bridge framework. We prove that SDS employs the reverse process of an Schrödinger Bridge, which, under specific conditions (e.g., a Gaussian noise as one end), collapses to SDS’s score function of the pre-trained diffusion model. Based upon this, we introduce Trajectory-Centric Distillation (TraCe), a novel text-to-3D generation framework, which reformulates the mathematically trackable framework of Schrödinger Bridge to explicitly construct a diffusion bridge from the current rendering to its text-conditioned, denoised target, and trains a LoRA-adapted model on this trajectory’s score dynamics for robust 3D optimization. Comprehensive experiments demonstrate that TraCe consistently achieves superior quality and fidelity to state-of-the-art techniques.
[71] Pose-Aware Multi-Level Motion Parsing for Action Quality Assessment
Shuaikang Zhu,Yang Yang,Chen Sun
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于姿态感知的多层次运动解析框架,用于动作质量评估(AQA)。通过多层次的运动解析和条件解析,结合权重调整评分模块,显著提升了动作分割和评分的性能。
Details
Motivation: 在高水平竞技中,动作质量的细微差别对评分至关重要。传统的AQA方法未能充分捕捉空间-时间姿态的变化和特殊条件(如水花)的影响。Contribution: 1) 提出多层次运动解析框架;2) 设计动作单元解析器和条件解析器;3) 引入权重调整评分模块以适应多样化动作需求。
Method: 1) 动作单元解析器用于动作分割和局部-全局姿态表示;2) 运动解析器学习空间-时间特征捕捉姿态变化;3) 条件解析器处理特殊条件;4) 权重调整评分模块优化评分。
Result: 在大规模跳水数据集上的实验表明,该框架在动作分割和评分任务上达到最先进性能。
Insight: 姿态的细微变化和特殊条件显著影响动作质量评分,多层次的解析框架能有效提升AQA的准确性。
Abstract: Human pose serves as a cornerstone of action quality assessment (AQA), where subtle spatial-temporal variations in pose often distinguish excellence from mediocrity. In high-level competitions, these nuanced differences become decisive factors in scoring. In this paper, we propose a novel multi-level motion parsing framework for AQA based on enhanced spatial-temporal pose features. On the first level, the Action-Unit Parser is designed with the help of pose extraction to achieve precise action segmentation and comprehensive local-global pose representations. On the second level, Motion Parser is used by spatial-temporal feature learning to capture pose changes and appearance details for each action-unit. Meanwhile, some special conditions other than body-related will impact action scoring, like water splash in diving. In this work, we design an additional Condition Parser to offer users more flexibility in their choices. Finally, Weight-Adjust Scoring Module is introduced to better accommodate the diverse requirements of various action types and the multi-scale nature of action-units. Extensive evaluations on large-scale diving sports datasets demonstrate that our multi-level motion parsing framework achieves state-of-the-art performance in both action segmentation and action scoring tasks.
[72] Personalized Image Editing in Text-to-Image Diffusion Models via Collaborative Direct Preference Optimization
Connor Dunlop,Matthew Zheng,Kavana Venkatesh,Pinar Yanardag
Main category: cs.CV
TL;DR: 提出了一种新方法C-DPO,通过轻量级图神经网络和动态偏好图,实现文本到图像扩散模型的个性化编辑,优化用户特定偏好。
Details
Motivation: 现有文本到图像扩散模型缺乏对用户个性化美学偏好的适应能力,需要一种能够根据用户需求生成个性化编辑的方法。Contribution: 1. 提出了首个个性化图像编辑框架C-DPO;2. 引入动态偏好图和轻量级图神经网络学习用户嵌入;3. 结合DPO目标优化个性化对齐和邻域一致性。
Method: 通过动态偏好图编码用户节点,利用图神经网络学习用户嵌入,并将嵌入整合到DPO目标中,优化个性化编辑任务。
Result: 在用户研究和定量基准测试中,C-DPO方法始终优于基线方法,生成更符合用户偏好的编辑图像。
Insight: 用户偏好的协同信号可以提高个性化编辑的质量,轻量级图神经网络是实现用户嵌入学习的有效工具。
Abstract: Text-to-image (T2I) diffusion models have made remarkable strides in generating and editing high-fidelity images from text. Yet, these models remain fundamentally generic, failing to adapt to the nuanced aesthetic preferences of individual users. In this work, we present the first framework for personalized image editing in diffusion models, introducing Collaborative Direct Preference Optimization (C-DPO), a novel method that aligns image edits with user-specific preferences while leveraging collaborative signals from like-minded individuals. Our approach encodes each user as a node in a dynamic preference graph and learns embeddings via a lightweight graph neural network, enabling information sharing across users with overlapping visual tastes. We enhance a diffusion model’s editing capabilities by integrating these personalized embeddings into a novel DPO objective, which jointly optimizes for individual alignment and neighborhood coherence. Comprehensive experiments, including user studies and quantitative benchmarks, demonstrate that our method consistently outperforms baselines in generating edits that are aligned with user preferences.
[73] Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition
Nicholas Babey,Tiffany Gu,Yiheng Li,Cristian Meo,Kevin Zhu
Main category: cs.CV
TL;DR: 论文提出了一种结合3D人体姿态与视觉预测模型的行动识别方法,旨在通过空间理解而非统计模式识别提升复杂遮挡场景下的识别性能。
Details
Motivation: 现有行动识别模型主要依赖RGB视频,容易学到动作与标签间的表面关联,难以捕捉复杂场景中的物理交互动态和人体姿态。Contribution: 提出了一种结合V-JEPA 2的预测世界动态和CoMotion的显式人体姿态数据的模型架构,提升了行动识别在遮挡场景下的鲁棒性。
Method: 通过融合V-JEPA 2的上下文预测能力和CoMotion的遮挡耐受姿态数据,模型实现了对动作的空间理解。
Result: 在InHARD和UCF-19-Y-OCC基准测试中,模型表现优于其他基线,尤其在复杂遮挡场景下效果显著。
Insight: 研究强调行动识别需要基于空间理解,而非依赖于统计模式识别,这对未来智能代理的发展具有重要意义。
Abstract: For embodied agents to effectively understand and interact within the world around them, they require a nuanced comprehension of human actions grounded in physical space. Current action recognition models, often relying on RGB video, learn superficial correlations between patterns and action labels, so they struggle to capture underlying physical interaction dynamics and human poses in complex scenes. We propose a model architecture that grounds action recognition in physical space by fusing two powerful, complementary representations: V-JEPA 2’s contextual, predictive world dynamics and CoMotion’s explicit, occlusion-tolerant human pose data. Our model is validated on both the InHARD and UCF-19-Y-OCC benchmarks for general action recognition and high-occlusion action recognition, respectively. Our model outperforms three other baselines, especially within complex, occlusive scenes. Our findings emphasize a need for action recognition to be supported by spatial understanding instead of statistical pattern recognition.
[74] Registration-Free Monitoring of Unstructured Point Cloud Data via Intrinsic Geometrical Properties
Mariafrancesca Patalano,Giovanna Capizzi,Kamran Paynabar
Main category: cs.CV
TL;DR: 提出了一种无需配准和网格重建的点云数据监控方法,利用形状的内蕴几何属性(如拉普拉斯和测地距离)进行特征学习,并通过阈值技术选择最敏感的监控特征。
Details
Motivation: 传统点云数据监控需要配准和网格重建等预处理步骤,这些步骤易引入误差且耗时。本文旨在消除这些步骤,直接利用内蕴几何属性进行高效监控。Contribution: 提出了一种无需配准和网格重建的点云监控框架,设计了基于拉普拉斯和测地距离的特征学习方法,并通过阈值技术优化监控特征选择。
Method: 1. 使用拉普拉斯和测地距离提取形状的内蕴几何特征;2. 设计两种特征学习方法;3. 采用阈值技术筛选对异常敏感的监控特征。
Result: 数值实验和案例研究表明,该方法能有效识别多种缺陷类型,验证了其在实际应用中的有效性。
Insight: 内蕴几何属性可以作为点云数据监控的核心特征,避免了传统预处理步骤的复杂性,同时提高了监控的效率和准确性。
Abstract: Modern sensing technologies have enabled the collection of unstructured point cloud data (PCD) of varying sizes, which are used to monitor the geometric accuracy of 3D objects. PCD are widely applied in advanced manufacturing processes, including additive, subtractive, and hybrid manufacturing. To ensure the consistency of analysis and avoid false alarms, preprocessing steps such as registration and mesh reconstruction are commonly applied prior to monitoring. However, these steps are error-prone, time-consuming and may introduce artifacts, potentially affecting monitoring outcomes. In this paper, we present a novel registration-free approach for monitoring PCD of complex shapes, eliminating the need for both registration and mesh reconstruction. Our proposal consists of two alternative feature learning methods and a common monitoring scheme. Feature learning methods leverage intrinsic geometric properties of the shape, captured via the Laplacian and geodesic distances. In the monitoring scheme, thresholding techniques are used to further select intrinsic features most indicative of potential out-of-control conditions. Numerical experiments and case studies highlight the effectiveness of the proposed approach in identifying different types of defects.
[75] Culture in Action: Evaluating Text-to-Image Models through Social Activities
Sina Malakouti,Boqing Gong,Adriana Kovashka
Main category: cs.CV
TL;DR: CULTIVate is a benchmark introduced to evaluate text-to-image models on cross-cultural activities, revealing systematic biases favoring global north countries over the global south.
Details
Motivation: Existing cultural benchmarks for text-to-image models focus on object-centric categories, overlooking social activities that better reflect cultural norms. There is a lack of metrics for measuring cultural faithfulness.Contribution: Proposes CULTIVate, a benchmark spanning 16 countries with 576 prompts and 19,000+ images, and introduces four metrics for cultural alignment, hallucination, exaggeration, and diversity.
Method: Uses an explainable descriptor-based evaluation framework across cultural dimensions (background, attire, objects, interactions) and compares human judgments with the proposed metrics.
Result: Reveals systematic disparities favoring global north countries and distinct failure modes in T2I systems.
Insight: Social activities are more reflective of cultural norms than object-centric categories, and current models struggle with underrepresented regions.
Abstract: Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity. Our findings reveal systematic disparities: models perform better for global north countries than for the global south, with distinct failure modes across T2I systems. Human studies confirm that our metrics correlate more strongly with human judgments than existing text-image metrics.
[76] VMDT: Decoding the Trustworthiness of Video Foundation Models
Yujin Potter,Zhun Wang,Nicholas Crispino,Kyle Montgomery,Alexander Xiong,Ethan Y. Chang,Francesco Pinto,Yuqi Chen,Rahul Gupta,Morteza Ziyadi,Christos Christodoulopoulos,Bo Li,Chenguang Wang,Dawn Song
Main category: cs.CV
TL;DR: 论文介绍了VMDT平台,首个统一评估文本到视频(T2V)和视频到文本(V2T)模型的信任度基准,涵盖安全性、幻觉、公平性、隐私和对抗鲁棒性五个维度。研究发现开源T2V模型普遍存在有害内容生成和不公平问题,而V2T模型中公平性和隐私风险随模型规模增加,但整体表现仍低。
Details
Motivation: 随着基础模型日益复杂,其可信度的重要性凸显,但目前视频模态缺乏全面的信任度评估体系,亟需统一平台来填补这一空白。Contribution: 提出了首个评估视频基础模型信任度的统一平台VMDT,并通过对7个T2V模型和19个V2T模型的评估,揭示了信任度与模型规模的关系及现有模型的不足。
Method: 设计了VMDT平台,涵盖五个信任度维度,并通过量化指标对T2V和V2T模型进行系统性评估。
Result: 发现开源T2V模型普遍无法识别有害查询并生成有害视频,不公平性高于图像模态模型;V2T模型中公平性和隐私风险随规模增加,但幻觉和对抗鲁棒性有所提升。安全性则与规模无关。
Insight: 视频基础模型在信任度上亟需改进,尤其是安全性问题需独立于模型规模解决;VMDT为未来研究工作提供了系统性评估框架。
Abstract: As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve – though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal. The code is available at https://sunblaze-ucb.github.io/VMDT-page/.
[77] Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale
David Acuna,Chao-Han Huck Yang,Yuntian Deng,Jaehun Jung,Ximing Lu,Prithviraj Ammanabrolu,Hyunwoo Kim,Yuan-Hong Liao,Yejin Choi
Main category: cs.CV
TL;DR: 论文提出了一个多技能、高复杂度的视觉推理数据生成框架,合成的数据集包含100万条高质量视觉中心问题,支持离线和在线RL。其合成的CoT轨迹提升了模型在多模态任务中的表现。
Details
Motivation: 当前多模态推理研究依赖于未公开的数据集和专有合成方法,缺乏系统化的大规模视觉推理数据集构建方法。Contribution: 1. 提出了两阶段(规模和复杂度)的推理数据生成框架;2. 合成了包含偏好数据和指令提示的大规模数据集;3. 展示了数据在多模态和跨模态任务中的迁移能力。
Method: 1. 分两阶段合成数据(规模和复杂度);2. 利用VLMs和推理LLMs生成CoT轨迹;3. 结合SFT和RL优化模型性能。
Result: 微调Qwen2.5-VL-7B后,模型在多个视觉基准测试中超越开源基线,甚至优于闭源模型MiMo-VL-7B-RL,并在文本和音频推理任务中表现出色。
Insight: 1. 高质量数据和复杂推理轨迹对RL至关重要;2. 分阶段离线RL可与在线RL媲美且节省计算资源;3. 精细SFT显著提升跨模态迁移能力。
Abstract: Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL’s performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.
[78] Towards Better Ultrasound Video Segmentation Foundation Model: An Empirical study on SAM2 Finetuning from Data Perspective
Xing Yao,Ahana Gangopadhyay,Hsi-Ming Chang,Ravi Soni
Main category: cs.CV
TL;DR: 本文研究了如何通过数据视角优化SAM2模型在超声视频分割任务中的微调,分析了训练数据规模、视频时长和增强策略对性能的影响,并提出了超声特异的数据增强方法。
Details
Motivation: 超声视频分割因数据集差异大、运动伪影和标记数据有限而具有挑战性。尽管SAM2在通用分割中表现优异,但在医学影像中性能显著下降。现有研究多关注架构修改,缺乏对数据特征和训练模式的系统分析。Contribution: 1) 提出数据中心的SAM2微调研究框架;2) 分析了数据规模、视频时长和增强策略对性能的影响;3) 设计了六种超声特异的数据增强方法;4) 揭示了联合训练的模态对齐与任务专门化的平衡优势。
Method: 通过三种范式(任务特定微调、中间适应、多任务联合训练)评估五种SAM2变体和多种提示模式。设计了超声特异的数据增强策略,并与通用方法对比。
Result: 实验表明,数据规模和时序上下文的影响超过了模型架构或初始化。联合训练在模态对齐和任务专门化之间提供了高效折衷。
Insight: 1) 数据特征对SAM2在医学影像中的适应至关重要;2) 联合训练是实现高效适应的有效策略;3) 超声特异的数据增强可以显著提升性能。
Abstract: Ultrasound (US) video segmentation remains a challenging problem due to strong inter- and intra-dataset variability, motion artifacts, and limited annotated data. Although foundation models such as Segment Anything Model 2 (SAM2) demonstrate strong zero-shot and prompt-guided segmentation capabilities, their performance deteriorates substantially when transferred to medical imaging domains. Current adaptation studies mainly emphasize architectural modifications, while the influence of data characteristics and training regimes has not been systematically examined. In this study, we present a comprehensive, data-centric investigation of SAM2 adaptation for ultrasound video segmentation. We analyze how training-set size, video duration, and augmentation schemes affect adaptation performance under three paradigms: task-specific fine-tuning, intermediate adaptation, and multi-task joint training, across five SAM2 variants and multiple prompting modes. We further design six ultrasound-specific augmentations, assessing their effect relative to generic strategies. Experiments on three representative ultrasound datasets reveal that data scale and temporal context play a more decisive role than model architecture or initialization. Moreover, joint training offers an efficient compromise between modality alignment and task specialization. This work aims to provide empirical insights for developing efficient, data-aware adaptation pipelines for SAM2 in ultrasound video analysis.
[79] Sign language recognition from skeletal data using graph and recurrent neural networks
B. Mederos,J. Mejía,A. Medina-Reyes,Y. Espinosa-Almeyda,J. D. Díaz-Roman,I. Rodríguez-Mederos,M. Mejía-Carreon,F. Gonzalez-Lopez
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于骨架数据和图神经网络结合循环神经网络的方法,用于识别孤立的手语手势,并在AUTSL数据集上取得了高精度。
Details
Motivation: 现有的视频处理方法在手语识别任务中面临计算复杂度高和精度不足的挑战。通过利用骨架数据,可以减少计算负担并提高模型的效率。Contribution: 主要贡献是提出了一个Graph-GRU时序网络,能够同时建模空间和时间依赖关系,从而准确分类手语手势。
Method: 采用了基于骨架的姿态数据,通过图神经网络(GraphNN)捕捉空间关系,并结合门控循环单元(GRU)建模时序动态。模型的输入是视频序列中的骨骼关键点数据。
Result: 在AUTSL数据集上取得了高精度,验证了图结构和时序建模结合的有效性。
Insight: 基于姿态的驱动方法在手语识别中具有潜力,尤其是通过结合空间和时间建模,能够显著提升性能。
Abstract: This work presents an approach for recognizing isolated sign language gestures using skeleton-based pose data extracted from video sequences. A Graph-GRU temporal network is proposed to model both spatial and temporal dependencies between frames, enabling accurate classification. The model is trained and evaluated on the AUTSL (Ankara university Turkish sign language) dataset, achieving high accuracy. Experimental results demonstrate the effectiveness of integrating graph-based spatial representations with temporal modeling, providing a scalable framework for sign language recognition. The results of this approach highlight the potential of pose-driven methods for sign language understanding.
[80] TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation
Lalit Maurya,Honghai Liu,Reyer Zwiggelaar
Main category: cs.CV
TL;DR: TCSA-UDA提出了一种文本驱动的跨语义对齐框架,通过利用领域不变的文本类描述指导视觉表示学习,解决医学图像分割中的无监督域适应问题。
Details
Motivation: 医学图像分割中的无监督域适应面临巨大挑战,尤其是不同成像模态(如CT和MRI)之间的领域偏移。现有的视觉-语言表示学习方法潜力尚未充分挖掘。Contribution: 1. 提出了TCSA-UDA框架,用于文本驱动的跨语义对齐;2. 设计了视觉-语言协方差余弦损失,直接对齐图像编码器特征与类间文本语义关系;3. 引入原型对齐模块,增强跨模态一致性。
Method: 1. 利用领域不变的文本类描述;2. 视觉-语言协方差余弦损失;3. 原型对齐模块。
Result: 在心脏、腹部和脑肿瘤分割任务中,TCSA-UDA显著减少了领域偏移,并超越了当前最先进的UDA方法。
Insight: 通过语言驱动的语义整合,TCSA-UDA为医学图像分析的域适应任务提供了新范式。
Abstract: Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have shown promise, their potential in UDA segmentation tasks remains underexplored. To address this gap, we propose TCSA-UDA, a Text-driven Cross-Semantic Alignment framework that leverages domain-invariant textual class descriptions to guide visual representation learning. Our approach introduces a vision-language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations, encouraging semantically meaningful and modality-invariant feature representations. Additionally, we incorporate a prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes. This mitigates residual category-level discrepancies and enhances cross-modal consistency. Extensive experiments on challenging cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks demonstrate that our TCSA-UDA framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods, establishing a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis.
[81] Position-Prior-Guided Network for System Matrix Super-Resolution in Magnetic Particle Imaging
Xuqing Geng,Lei Su,Zhongwei Bian,Zewen Sun,Jiaxuan Wen,Jie Tian,Yang Du
Main category: cs.CV
TL;DR: 该论文提出了一种基于位置先验的深度学习方法,用于磁粒子成像(MPI)中的系统矩阵超分辨率重建,旨在提高校准效率并充分利用物理对称性先验知识。
Details
Motivation: MPI的系统矩阵(SM)校准耗时且需重复测量,现有深度学习方法未充分利用SM的物理对称性先验知识。Contribution: 论文将位置先验知识集成到SM超分辨率框架中,并通过理论和实验验证了其有效性。
Method: 结合对称位置先验的深度学习方法,用于2D和3D SM的超分辨率重建。
Result: 实验表明,该方法显著提高了SM校准的效率和性能。
Insight: 物理先验知识的合理利用可以提升深度学习方法在医学成像中的性能。
Abstract: Magnetic Particle Imaging (MPI) is a novel medical imaging modality. One of the established methods for MPI reconstruction is based on the System Matrix (SM). However, the calibration of the SM is often time-consuming and requires repeated measurements whenever the system parameters change. Current methodologies utilize deep learning-based super-resolution (SR) techniques to expedite SM calibration; nevertheless, these strategies do not fully exploit physical prior knowledge associated with the SM, such as symmetric positional priors. Consequently, we integrated positional priors into existing frameworks for SM calibration. Underpinned by theoretical justification, we empirically validated the efficacy of incorporating positional priors through experiments involving both 2D and 3D SM SR methods.
[82] LRANet++: Low-Rank Approximation Network for Accurate and Efficient Text Spotting
Yuchen Su,Zhineng Chen,Yongkun Du,Zuxuan Wu,Hongtao Xie,Yu-Gang Jiang
Main category: cs.CV
TL;DR: LRANet++提出了一种基于低秩近似的参数化文本形状方法,结合三重分配检测头和轻量级识别分支,实现高效准确的任意形状文本检测与识别。
Details
Motivation: 现有端到端文本定位方法在处理任意形状文本时,因缺乏高效可靠的检测方法而面临瓶颈。作者提出通过低秩逼近优化形状表示,提升检测精度与效率。Contribution: 1. 提出数据驱动的低秩逼近方法,从标注文本边界中学习紧凑的形状表示;2. 设计三重分配检测头,平衡训练稳定性和推理速度;3. 构建端到端框架LRANet++,在多项基准测试中表现优异。
Method: 1. 使用基于ℓ₁-范数的鲁棒恢复方法从标注噪声中提取关键正交向量;2. 引入三重分配检测头(深度稀疏分支、超轻量稀疏分支和密集分支);3. 结合轻量级识别模块形成完整框架。
Result: 在多个挑战性数据集上,LRANet++展现了优于现有方法的性能,同时保持了高效推理速度。
Insight: 低秩逼近能有效捕捉文本形状的固有相关性,而三重分配设计通过协同训练解决了速度与精度的权衡问题。
Abstract: End-to-end text spotting aims to jointly optimize text detection and recognition within a unified framework. Despite significant progress, designing an accurate and efficient end-to-end text spotter for arbitrary-shaped text remains largely unsolved. We identify the primary bottleneck as the lack of a reliable and efficient text detection method. To address this, we propose a novel parameterized text shape method based on low-rank approximation for precise detection and a triple assignment detection head to enable fast inference. Specifically, unlike other shape representation methods that employ data-irrelevant parameterization, our data-driven approach derives a low-rank subspace directly from labeled text boundaries. To ensure this process is robust against the inherent annotation noise in this data, we utilize a specialized recovery method based on an $\ell_1$-norm formulation, which accurately reconstructs the text shape with only a few key orthogonal vectors. By exploiting the inherent shape correlation among different text contours, our method achieves consistency and compactness in shape representation. Next, the triple assignment scheme introduces a novel architecture where a deep sparse branch (for stabilized training) is used to guide the learning of an ultra-lightweight sparse branch (for accelerated inference), while a dense branch provides rich parallel supervision. Building upon these advancements, we integrate the enhanced detection module with a lightweight recognition branch to form an end-to-end text spotting framework, termed LRANet++, capable of accurately and efficiently spotting arbitrary-shaped text. Extensive experiments on several challenging benchmarks demonstrate the superiority of LRANet++ compared to state-of-the-art methods. Code will be available at: https://github.com/ychensu/LRANet-PP.git
[83] Hilbert-Guided Block-Sparse Local Attention
Yunge Li,Lanyu Xu
Main category: cs.CV
TL;DR: 该论文提出了一种基于Hilbert曲线的局部注意力方法,通过重新排序图像令牌并生成块稀疏的注意力窗口,显著提高了2D局部注意力的效率。
Details
Motivation: 全局自注意力的二次计算和内存成本限制了其在高分辨率图像中的应用。传统的局部注意力方法虽降低了复杂度,但由于窗口内令牌在1D序列中不连续,往往无法显著提速。Contribution: 主要贡献是提出了一种基于Hilbert曲线的窗口和邻域构造方法,显著提高了块稀疏性,并与现有块稀疏内核结合,提升了2D局部注意力的效率。
Method: 通过Hilbert曲线对图像令牌重新排序,然后在重排的1D序列上形成窗口和邻域,结合块稀疏内核加速局部注意力。
Result: 实验表明,Hilbert Window Attention和Hilbert Slide Attention分别加速了约4倍和18倍,且精度损失最小。
Insight: Hilbert曲线在提升块稀疏性方面效果显著,结合块稀疏内核是一种通用且实用的优化2D局部注意力的方法。
Abstract: The quadratic compute and memory costs of global self-attention severely limit its use in high-resolution images. Local attention reduces complexity by restricting attention to neighborhoods. Block-sparse kernels can further improve the efficiency of local attention, but conventional local attention patterns often fail to deliver significant speedups because tokens within a window are not contiguous in the 1D sequence. This work proposes a novel method for constructing windows and neighborhoods based on the Hilbert curve. Image tokens are first reordered along a Hilbert curve, and windows and neighborhoods are then formed on the reordered 1D sequence. From a block-sparse perspective, this strategy significantly increases block sparsity and can be combined with existing block-sparse kernels to improve the efficiency of 2D local attention. Experiments show that the proposed Hilbert Window Attention and Hilbert Slide Attention can accelerate window attention and slide attention by about $4\times$ and $18\times$, respectively. To assess practicality, the strategy is instantiated as the Hilbert Window Transformer and the Hilbert Neighborhood Transformer, both of which achieve end-to-end speedups with minimal accuracy loss. Overall, combining Hilbert-guided local attention with block-sparse kernels offers a general and practical approach to enhancing the efficiency of 2D local attention for images. The code is available at https://github.com/Yunge6666/Hilbert-Local-Attention.
[84] TYrPPG: Uncomplicated and Enhanced Learning Capability rPPG for Remote Heart Rate Estimation
Taixi Chen,Yiu-ming Cheung
Main category: cs.CV
TL;DR: 本文提出了TYrPPG,一种基于Mambaout模块的新型rPPG算法,用于远程心率估计。通过创新的门控视频理解块(GVB)和综合监督损失函数(CSL),TYrPPG在性能上达到了SOTA。
Details
Motivation: 现有的rPPG模型通常基于transformer模块,计算效率较低。Mamba模型在NLP任务中表现高效,但其SSM模块在视觉任务中可能不必要。因此,作者希望探索基于Mambaout模块的rPPG方法。Contribution: 1. 提出TYrPPG算法,结合Mambaout结构与2D/3D-CNN,设计GVB块增强视频分析能力;2. 提出CSL损失函数及其弱监督变体,提升模型学习能力。
Method: 1. 基于Mambaout结构设计GVB块,融合2D/3D-CNN;2. 引入CSL损失函数优化训练过程。
Result: 实验表明TYrPPG在主流数据集上达到了SOTA性能,验证了其在远程心率估计中的优越性。
Insight: Mambaout结构在视觉任务中具有潜力,尤其是高效视频分析;结合2D/3D-CNN和定制损失函数可进一步提升rPPG性能。
Abstract: Remote photoplethysmography (rPPG) can remotely extract physiological signals from RGB video, which has many advantages in detecting heart rate, such as low cost and no invasion to patients. The existing rPPG model is usually based on the transformer module, which has low computation efficiency. Recently, the Mamba model has garnered increasing attention due to its efficient performance in natural language processing tasks, demonstrating potential as a substitute for transformer-based algorithms. However, the Mambaout model and its variants prove that the SSM module, which is the core component of the Mamba model, is unnecessary for the vision task. Therefore, we hope to prove the feasibility of using the Mambaout-based module to remotely learn the heart rate. Specifically, we propose a novel rPPG algorithm called uncomplicated and enhanced learning capability rPPG (TYrPPG). This paper introduces an innovative gated video understanding block (GVB) designed for efficient analysis of RGB videos. Based on the Mambaout structure, this block integrates 2D-CNN and 3D-CNN to enhance video understanding for analysis. In addition, we propose a comprehensive supervised loss function (CSL) to improve the model’s learning capability, along with its weakly supervised variants. The experiments show that our TYrPPG can achieve state-of-the-art performance in commonly used datasets, indicating its prospects and superiority in remote heart rate estimation. The source code is available at https://github.com/Taixi-CHEN/TYrPPG.
[85] Understanding Cross Task Generalization in Handwriting-Based Alzheimer’s Screening via Vision Language Adaptation
Changqing Gong,Huafeng Qin,Mounim A. El-Yacoubi
Main category: cs.CV
TL;DR: 论文提出了一个轻量级的Cross-Layer Fusion Adapter(CLFA)框架,将CLIP模型重用于基于笔迹的阿尔茨海默病筛查,并通过实验揭示了不同任务类型和书写模式对诊断效果的影响。
Details
Motivation: 阿尔茨海默病(AD)的早期检测至关重要,而笔迹分析因其非侵入性和低成本特性成为潜在工具。现有研究多依赖在线轨迹和手工特征,未能系统研究任务类型对诊断性能和跨任务泛化的影响。此外,大规模视觉语言模型在其他医学模态中表现出色,但在笔迹分析中尚未充分探索。Contribution: 1. 提出了CLFA框架,通过多层融合适配器将CLIP模型适配为笔迹医学线索的表示对齐工具。2. 系统研究了跨任务泛化能力,揭示了不同任务类型和书写模式对AD诊断的有效性。3.提供了诊断见解和基准数据集。
Method: CLFA框架在CLIP视觉编码器中植入多层融合适配器,逐步对齐笔迹特有的医学线索表示,支持无需提示(prompt-free)的高效零样本推理。通过训练特定笔迹任务并在未见任务上评估,研究跨任务泛化能力。
Result: 实验表明,CLFA能够有效区分AD和非AD患者的笔迹特征,并揭示了哪些任务类型和书写模式最具诊断价值。
Insight: 1. 任务类型对AD诊断的性能有显著影响,某些书写模式更能反映早期AD症状。2.大规模视觉语言模型可成功迁移到笔迹分析领域,为医学诊断提供新思路。
Abstract: Alzheimer’s disease is a prevalent neurodegenerative disorder for which early detection is critical. Handwriting-often disrupted in prodromal AD-provides a non-invasive and cost-effective window into subtle motor and cognitive decline. Existing handwriting-based AD studies, mostly relying on online trajectories and hand-crafted features, have not systematically examined how task type influences diagnostic performance and cross-task generalization. Meanwhile, large-scale vision language models have demonstrated remarkable zero or few-shot anomaly detection in natural images and strong adaptability across medical modalities such as chest X-ray and brain MRI. However, handwriting-based disease detection remains largely unexplored within this paradigm. To close this gap, we introduce a lightweight Cross-Layer Fusion Adapter framework that repurposes CLIP for handwriting-based AD screening. CLFA implants multi-level fusion adapters within the visual encoder to progressively align representations toward handwriting-specific medical cues, enabling prompt-free and efficient zero-shot inference. Using this framework, we systematically investigate cross-task generalization-training on a specific handwriting task and evaluating on unseen ones-to reveal which task types and writing patterns most effectively discriminate AD. Extensive analyses further highlight characteristic stroke patterns and task-level factors that contribute to early AD identification, offering both diagnostic insights and a benchmark for handwriting-based cognitive assessment.
[86] CGCE: Classifier-Guided Concept Erasure in Generative Models
Viet Nguyen,Vishal M. Patel
Main category: cs.CV
TL;DR: 提出了Classifier-Guided Concept Erasure (CGCE),一种高效即插即用的概念擦除框架,通过轻量级分类器检测和优化文本嵌入,实现多概念擦除,同时保持生成质量。
Details
Motivation: 生成模型的发展带来了安全隐患,现有的概念擦除方法易受对抗攻击,且在确保安全性时会牺牲生成质量。本文旨在解决这一平衡问题。Contribution: 提出CGCE框架,首次引入分类器指导的文本嵌入优化,实现高效且鲁棒的概念擦除,同时不影响生成模型的原始性能。
Method: 利用轻量级分类器检测文本嵌入中的有害概念,并通过优化这些嵌入实现多概念擦除,无需修改模型权重。
Result: CGCE在多种对抗攻击下表现出鲁棒性,且在保持生成质量的同时,显著优于现有方法。
Insight: 分类器指导的嵌入优化是一种有效的即插即用方法,可广泛适用于T2I和T2V模型,为生成模型的安全性提供了实用解决方案。
Abstract: Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model’s generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model’s original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.
[87] Light-Field Dataset for Disparity Based Depth Estimation
Suresh Nehra,Aupendu Kar,Jayanta Mukhopadhyay,Prabir Kumar Biswas
Main category: cs.CV
TL;DR: 这篇论文介绍了一个公开可用的光场图像数据集,用于支持基于视差的深度估计算法的设计与测试,同时分析了光场相机对焦位置对视差的影响。
Details
Motivation: 目前缺乏适合的光场数据集来设计和测试基于视差的深度估计算法,因此作者提出并公开了一个包含真实和合成光场图像的数据集。Contribution: 提出了一个包含285张真实光场图像和13张合成光场图像的数据集,并通过机械系统和Blender生成了真实与合成的立体光场数据集。
Method: 使用Lytro Illum光场相机捕捉真实图像,并通过Blender生成合成数据,同时分析了光场相机的对焦位置对视差的影响。
Result: 数据集公开可用,并展示了其对深度估计算法设计的实用性,同时指出了现有光场数据集的局限性。
Insight: 光场数据集的设计需考虑对焦位置对视差的影响,真实与合成数据的结合有助于算法的鲁棒性测试。
Abstract: A Light Field (LF) camera consists of an additional two-dimensional array of micro-lenses placed between the main lens and sensor, compared to a conventional camera. The sensor pixels under each micro-lens receive light from a sub-aperture of the main lens. This enables the image sensor to capture both spatial information and the angular resolution of a scene point. This additional angular information is used to estimate the depth of a 3-D scene. The continuum of virtual viewpoints in light field data enables efficient depth estimation using Epipolar Line Images (EPIs) with robust occlusion handling. However, the trade-off between angular information and spatial information is very critical and depends on the focal position of the camera. To design, develop, implement, and test novel disparity-based light field depth estimation algorithms, the availability of suitable light field image datasets is essential. In this paper, a publicly available light field image dataset is introduced and thoroughly described. We have also demonstrated the effect of focal position on the disparity of a 3-D point as well as the shortcomings of the currently available light field dataset. The proposed dataset contains 285 light field images captured using a Lytro Illum LF camera and 13 synthetic LF images. The proposed dataset also comprises a synthetic dataset with similar disparity characteristics to those of a real light field camera. A real and synthetic stereo light field dataset is also created by using a mechanical gantry system and Blender. The dataset is available at https://github.com/aupendu/light-field-dataset.
[88] MoEGCL: Mixture of Ego-Graphs Contrastive Representation Learning for Multi-View Clustering
Jian Zhu,Xin Zou,Jun Sun,Cheng Luo,Lei Liu,Lingfang Zeng,Ning Zhang,Bian Wu,Chang Tang,Lirong Dai
Main category: cs.CV
TL;DR: MoEGCL提出了一种新的多视图聚类方法,通过细粒度的自我图融合和对比学习实现更高效的表征学习。
Details
Motivation: 现有的多视图聚类方法通常采用视图级别的粗粒度图融合策略,限制了性能。MoEGCL旨在通过样本级别的细粒度融合改进这一问题。Contribution: 提出了MoEGF模块,利用Mixture-of-Experts网络实现样本级别的自我图融合;设计了EGCL模块,通过对比学习增强表征一致性。
Method: 结合MoEGF(自我图细粒度融合)和EGCL(自我图对比学习)两个模块,优化多视图聚类的表征学习过程。
Result: 实验表明,MoEGCL在多视图聚类任务上达到了最先进的性能。
Insight: 样本级别的细粒度图融合和多视图表征对齐是提升多视图聚类性能的关键。
Abstract: In recent years, the advancement of Graph Neural Networks (GNNs) has significantly propelled progress in Multi-View Clustering (MVC). However, existing methods face the problem of coarse-grained graph fusion. Specifically, current approaches typically generate a separate graph structure for each view and then perform weighted fusion of graph structures at the view level, which is a relatively rough strategy. To address this limitation, we present a novel Mixture of Ego-Graphs Contrastive Representation Learning (MoEGCL). It mainly consists of two modules. In particular, we propose an innovative Mixture of Ego-Graphs Fusion (MoEGF), which constructs ego graphs and utilizes a Mixture-of-Experts network to implement fine-grained fusion of ego graphs at the sample level, rather than the conventional view-level fusion. Additionally, we present the Ego Graph Contrastive Learning (EGCL) module to align the fused representation with the view-specific representation. The EGCL module enhances the representation similarity of samples from the same cluster, not merely from the same sample, further boosting fine-grained graph representation. Extensive experiments demonstrate that MoEGCL achieves state-of-the-art results in deep multi-view clustering tasks. The source code is publicly available at https://github.com/HackerHyper/MoEGCL.
[89] Towards Frequency-Adaptive Learning for SAR Despeckling
Ziqing Ma,Chang Yang,Zhichang Guo,Yao Li
Main category: cs.CV
TL;DR: SAR-FAH提出了一种频率自适应的SAR去斑方法,通过小波分解将图像分为不同频率子带,并为不同频率设计专用子网络,改善了边缘和纹理保持能力。
Details
Motivation: SAR图像固有斑点噪声限制了其高精度应用,现有深度学习方法无法针对不同空间物理特性调整处理方式,导致伪影和纹理失真。Contribution: SAR-FAH是一种基于频率自适应异构网络的去斑模型,通过小波分解和专用子网络设计,提升了噪声抑制和结构保持能力。
Method: 1. 使用小波分解将图像分为不同频率子带;2. 为低频率部分设计基于神经ODE的动态系统;3. 为高频率部分设计增强的U-Net网络。
Result: 在合成和真实SAR图像上的实验验证了SAR-FAH在噪声去除和结构保持上的优越性能。
Insight: 针对不同频率子带设计专用网络能显著提升SAR去斑效果,频率自适应是解决纹理失真的关键。
Abstract: Synthetic Aperture Radar (SAR) images are inherently corrupted by speckle noise, limiting their utility in high-precision applications. While deep learning methods have shown promise in SAR despeckling, most methods employ a single unified network to process the entire image, failing to account for the distinct speckle statistics associated with different spatial physical characteristics. It often leads to artifacts, blurred edges, and texture distortion. To address these issues, we propose SAR-FAH, a frequency-adaptive heterogeneous despeckling model based on a divide-and-conquer architecture. First, wavelet decomposition is used to separate the image into frequency sub-bands carrying different intrinsic characteristics. Inspired by their differing noise characteristics, we design specialized sub-networks for different frequency components. The tailored approach leverages statistical variations across frequencies, improving edge and texture preservation while suppressing noise. Specifically, for the low-frequency part, denoising is formulated as a continuous dynamic system via neural ordinary differential equations, ensuring structural fidelity and sufficient smoothness that prevents artifacts. For high-frequency sub-bands rich in edges and textures, we introduce an enhanced U-Net with deformable convolutions for noise suppression and enhanced features. Extensive experiments on synthetic and real SAR images validate the superior performance of the proposed model in noise removal and structural preservation.
[90] Hybrid second-order gradient histogram based global low-rank sparse regression for robust face recognition
Hongxia Li,Ying Ji,Yongxin Dong,Yuehua Feng
Main category: cs.CV
TL;DR: 本文提出了一种基于混合二阶梯度直方图的全局低秩稀疏回归模型(H2H-GLRSR),用于鲁棒人脸识别,通过设计新的特征描述子和全局低秩约束,显著提升了在遮挡和光照变化等复杂场景下的性能。
Details
Motivation: 人脸识别中,低秩稀疏回归模型已广泛应用,但在复杂遮挡和光照变化下仍面临挑战。本文旨在通过设计更有效的特征描述子和全局约束,进一步提升模型的鲁棒性。Contribution: 1)提出一种新的混合二阶梯度直方图(H2H)特征描述子,更好地表征面部图像的局部结构特征;2)将H2H与稀疏正则化核范数矩阵回归(SR_NMR)结合;3)引入全局低秩约束,提升对结构化噪声的全局相关性建模能力。
Method: 1)设计H2H特征描述子;2)将其与SR_NMR模型集成;3)对残差矩阵施加全局低秩约束。通过这些步骤,模型能够同时捕获局部结构特征和全局噪声相关性。
Result: 实验表明,该方法在遮挡、光照变化和无约束环境下显著优于现有的基于回归的分类方法。
Insight: 结合局部特征描述子与全局约束可以有效提升人脸识别模型在复杂场景下的鲁棒性,尤其是在处理遮挡和光照变化时。
Abstract: Low-rank sparse regression models have been widely applied in the field of face recognition. To further address the challenges caused by complex occlusions and illumination variations, this paper proposes a Hybrid Second-Order Gradient Histogram based Global Low-Rank Sparse Regression (H2H-GLRSR) model. Specifically, a novel feature descriptor called the Hybrid Second-Order Gradient Histogram (H2H) is first designed to more effectively characterize the local structural features of facial images. Then, this descriptor is integrated with the Sparse Regularized Nuclear Norm based Matrix Regression (SR$_$NMR). Moreover, a global low-rank constraint is imposed on the residual matrix, enabling the model to better capture the global correlations inherent in structured noise. Experimental results demonstrate that the proposed method significantly outperforms existing regression-based classification approaches under challenging scenarios involving occlusions, illumination changes, and unconstrained environments.
[91] Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning
Fei Yu,Quan Deng,Shengeng Tang,Yuehua Li,Lechao Cheng
Main category: cs.CV
TL;DR: 论文提出了一种开放世界的3D场景图生成方法,结合检索增强推理,支持多模态探索和语言交互,展示了在多样化任务中的强大泛化能力。
Details
Motivation: 解决封闭词汇监督和静态标注在开放世界3D场景理解中的局限性,提升场景理解的泛化性和交互性。Contribution: 提出了一个统一的开放世界3D场景图生成框架,结合检索增强推理,支持语言和多模态交互。
Method: 1.动态场景图生成模块(无固定标签集);2.检索增强推理管道(将场景图编码为向量数据库支持查询)。
Result: 在3DSSG和Replica基准测试中表现出色,涵盖场景问答、视觉定位等多个任务。
Insight: 开放词汇感知与检索推理的结合是扩展3D场景理解的有效途径。
Abstract: Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks-scene question answering, visual grounding, instance retrieval, and task planning-demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.
[92] Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation
Qiming Li,Zekai Ye,Xiaocheng Feng,Weihong Zhong,Weitao Ma,Xiachong Feng
Main category: cs.CV
TL;DR: 该论文提出了FCCT框架,系统地量化视觉对象感知的因果效应,揭示了MHSA和FFN在跨模态信息处理中的作用,并基于此提出了IRI技术,显著提升了模型的感知能力并缓解了幻觉现象。
Details
Motivation: 尽管大视觉语言模型(LVLM)取得了显著进展,但其机制可解释性仍未充分探索。现有分析方法不够全面,缺乏对视觉和文本令牌、模型组件及所有层的系统性研究,限制了提高模型输出忠实度和下游任务(如缓解幻觉)的可行见解。Contribution: 1) 提出了FCCT框架,系统性分析跨模态因果效应;2) 揭示了MHSA和FFN在跨模态信息处理中的关键作用;3) 提出了IRI技术,通过干预特定组件和层的表征增强感知能力。
Method: FCCT框架对视觉和文本令牌、MHSA、FFN和隐藏状态进行全面分析。IRI技术通过干预特定层的跨模态表征增强信息流。
Result: 实验表明IRI在五个基准测试和多种LVLM上均取得了最先进性能,同时保持了推理速度和其他基础性能。
Insight: MHSA在中间层的最后一个令牌中起关键作用,而FFN表现出三阶段分层进展,存储和传递视觉对象表征。
Abstract: Despite the remarkable advancements of Large Vision-Language Models (LVLMs), the mechanistic interpretability remains underexplored. Existing analyses are insufficiently comprehensive and lack examination covering visual and textual tokens, model components, and the full range of layers. This limitation restricts actionable insights to improve the faithfulness of model output and the development of downstream tasks, such as hallucination mitigation. To address this limitation, we introduce Fine-grained Cross-modal Causal Tracing (FCCT) framework, which systematically quantifies the causal effects on visual object perception. FCCT conducts fine-grained analysis covering the full range of visual and textual tokens, three core model components including multi-head self-attention (MHSA), feed-forward networks (FFNs), and hidden states, across all decoder layers. Our analysis is the first to demonstrate that MHSAs of the last token in middle layers play a critical role in aggregating cross-modal information, while FFNs exhibit a three-stage hierarchical progression for the storage and transfer of visual object representations. Building on these insights, we propose Intermediate Representation Injection (IRI), a training-free inference-time technique that reinforces visual object information flow by precisely intervening on cross-modal representations at specific components and layers, thereby enhancing perception and mitigating hallucination. Consistent improvements across five widely used benchmarks and LVLMs demonstrate IRI achieves state-of-the-art performance, while preserving inference speed and other foundational performance.
[93] CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework
Jiaxuan Li,Qing Xu,Xiangjian He,Ziyu Liu,Chang Xing,Zhen Chen,Daokun Zhang,Rong Qu,Chang Wen Chen
Main category: cs.CV
TL;DR: CoMA通过互补掩码策略和分层动态多窗口自注意力机制,提升了MAE的训练效率和下游任务适应性。
Details
Motivation: MAE及其变体采用随机掩码策略,通常需要更多的预训练周期,且ViT在固定空间分辨率下参数利用效率低,限制了预训练效率和适应性。Contribution: 1. 提出互补掩码策略(CoMA),确保对所有像素的均匀采样,提升特征学习效果;2. 引入分层动态ViT(DyViT),采用动态多窗口自注意力机制(DM-MSA),减少参数量和计算开销,同时提高细粒度特征学习能力。
Method: 1. CoMA通过互补掩码策略覆盖所有像素;2. DyViT利用DM-MSA动态调整窗口大小,分层处理不同尺度的特征。
Result: 在ImageNet-1K预训练中,DyViT仅需MAE 12%的训练周期即可达到同等下游性能,且每轮训练时间减少10%。
Insight: 互补掩码策略和动态多窗口机制显著提升了预训练效率和模型适应性,为自监督学习中掩码策略和架构设计提供了新思路。
Abstract: Masked Autoencoders (MAE) achieve self-supervised learning of image representations by randomly removing a portion of visual tokens and reconstructing the original image as a pretext task, thereby significantly enhancing pretraining efficiency and yielding excellent adaptability across downstream tasks. However, MAE and other MAE-style paradigms that adopt random masking generally require more pre-training epochs to maintain adaptability. Meanwhile, ViT in MAE suffers from inefficient parameter use due to fixed spatial resolution across layers. To overcome these limitations, we propose the Complementary Masked Autoencoders (CoMA), which employ a complementary masking strategy to ensure uniform sampling across all pixels, thereby improving effective learning of all features and enhancing the model’s adaptability. Furthermore, we introduce DyViT, a hierarchical vision transformer that employs a Dynamic Multi-Window Self-Attention (DM-MSA), significantly reducing the parameters and FLOPs while improving fine-grained feature learning. Pre-trained on ImageNet-1K with CoMA, DyViT matches the downstream performance of MAE using only 12% of the pre-training epochs, demonstrating more effective learning. It also attains a 10% reduction in pre-training time per epoch, further underscoring its superior pre-training efficiency.
[94] AD-DAE: Unsupervised Modeling of Longitudinal Alzheimer’s Disease Progression with Diffusion Auto-Encoder
Ayantika Das,Arunima Sarkar,Keerthi Ram,Mohanasankar Sivaprakasam
Main category: cs.CV
TL;DR: 该论文提出了一种无监督的生成模型AD-DAE,利用扩散自编码器框架从基线图像生成阿尔茨海默病的纵向进展图像,无需特定主题的纵向监督。
Details
Motivation: 现有生成模型在无监督学习纵向疾病进展时,由于对分布学习施加了限制,导致隐空间的可控性不足。AD-DAE旨在解决这一问题,提供更可控的隐空间表示。Contribution: AD-DAE引入了可调节的扩散自编码器框架,通过显式编码机制形成紧凑的隐空间,从中分离与疾病进展相关的信息,并实现无监督的图像生成和可控性。
Method: 结合扩散自编码器,AD-DAE在隐空间中定义了一个子空间,用于隔离进展相关因素。通过限制偏移施加可控性,并与疾病进展属性隐式关联。
Result: 在两个不同来源的阿尔茨海默病数据集上,AD-DAE通过图像质量指标、体积进展分析和下游分类验证了其有效性。
Insight: AD-DAE的无监督方法为纵向疾病建模提供了新思路,其紧凑隐空间设计有助于解开与疾病进展相关的信息,同时保留身份特征。
Abstract: Generative modeling frameworks have emerged as an effective approach to capture high-dimensional image distributions from large datasets without requiring domain-specific knowledge, a capability essential for longitudinal disease progression modeling. Recent generative modeling approaches have attempted to capture progression by mapping images into a latent representational space and then controlling and guiding the representations to generate follow-up images from a baseline image. However, existing approaches impose constraints on distribution learning, leading to latent spaces with limited controllability to generate follow-up images without explicit supervision from subject-specific longitudinal images. In order to enable controlled movements in the latent representational space and generate progression images from a baseline image in an unsupervised manner, we introduce a conditionable Diffusion Auto-encoder framework. The explicit encoding mechanism of image-diffusion auto-encoders forms a compact latent space capturing high-level semantics, providing means to disentangle information relevant for progression. Our approach leverages this latent space to condition and apply controlled shifts to baseline representations for generating follow-up. Controllability is induced by restricting these shifts to a subspace, thereby isolating progression-related factors from subject identity-preserving components. The shifts are implicitly guided by correlating with progression attributes, without requiring subject-specific longitudinal supervision. We validate the generations through image quality metrics, volumetric progression analysis, and downstream classification in Alzheimer’s disease datasets from two different sources and disease categories. This demonstrates the effectiveness of our approach for Alzheimer’s progression modeling and longitudinal image generation.
[95] Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation
Lin Li,Chuhan Zhang,Dong Zhang,Chong Sun,Chen Li,Long Chen
Main category: cs.CV
TL;DR: 论文提出了一种名为ACC的端到端开放词汇场景图生成框架,专注于交互驱动的知识注入和迁移,以解决现有方法在交互建模上的不足。
Details
Motivation: 现有开放词汇场景图生成(OVSGG)方法缺乏显式的交互建模,导致知识注入和迁移过程中出现噪声伪监督和模糊查询匹配的问题。Contribution: 提出了ACC框架,通过双向交互提示(知识注入)和交互引导的查询选择与知识蒸馏(知识迁移),显著提升了OVSGG的性能。
Method: 1)双向交互提示生成鲁棒的伪监督;2)交互引导的查询选择优先配对交互对象;3)交互一致的知识蒸馏增强鲁棒性。
Result: 在三个基准数据集上实现了最先进的性能。
Insight: 交互为中心的范式能有效减少噪声和模糊匹配,提升模型在实际应用中的潜力。
Abstract: Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) \textit{Infusing knowledge} into large-scale models via pre-training on large datasets; 2) \textit{Transferring knowledge} from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an inter\textbf{AC}tion-\textbf{C}entric end-to-end OVSGG framework (\textbf{ACC}) in an interaction-driven paradigm to minimize these mismatches. For \textit{interaction-centric knowledge infusion}, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model’s interaction knowledge. For \textit{interaction-centric knowledge transfer}, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.
[96] Global Multiple Extraction Network for Low-Resolution Facial Expression Recognition
Jingyi Shi
Main category: cs.CV
TL;DR: 论文提出了一种用于低分辨率面部表情识别的新网络GME-Net,通过混合注意力模块和多尺度全局特征提取模块,解决了低分辨率图像细节缺失和全局建模不足的问题。
Details
Motivation: 低分辨率图像在面部表情识别中表现较差,原因是缺乏细节信息且现有方法的全局建模能力不足。Contribution: 提出了GME-Net网络,结合混合注意力局部特征提取和多尺度全局特征提取模块,提升了低分辨率图像的表情识别能力。
Method: 1) 基于混合注意力的局部特征提取模块;2) 准对称结构的多尺度全局特征提取模块。
Result: 在多个数据集上验证了GME-Net的优越性能,优于现有方法。
Insight: 结合局部细节和全局特征的提取是提升低分辨率表情识别效果的关键。
Abstract: Facial expression recognition, as a vital computer vision task, is garnering significant attention and undergoing extensive research. Although facial expression recognition algorithms demonstrate impressive performance on high-resolution images, their effectiveness tends to degrade when confronted with low-resolution images. We find it is because: 1) low-resolution images lack detail information; 2) current methods complete weak global modeling, which make it difficult to extract discriminative features. To alleviate the above issues, we proposed a novel global multiple extraction network (GME-Net) for low-resolution facial expression recognition, which incorporates 1) a hybrid attention-based local feature extraction module with attention similarity knowledge distillation to learn image details from high-resolution network; 2) a multi-scale global feature extraction module with quasi-symmetric structure to mitigate the influence of local image noise and facilitate capturing global image features. As a result, our GME-Net is capable of extracting expression-related discriminative features. Extensive experiments conducted on several widely-used datasets demonstrate that the proposed GME-Net can better recognize low-resolution facial expression and obtain superior performance than existing solutions.
[97] Reperio-rPPG: Relational Temporal Graph Neural Networks for Periodicity Learning in Remote Physiological Measurement
Ba-Thinh Nguyen,Thach-Ha Ngoc Pham,Hoang-Long Duc Nguyen,Thi-Duyen Ngo,Thanh-Ha Le
Main category: cs.CV
TL;DR: Reperio-rPPG是一种新颖的rPPG框架,结合关系卷积网络和图Transformer,有效捕捉生理信号的周期性,通过CutMix增强泛化能力,在多个数据集上表现优异。
Details
Motivation: 现有rPPG方法对生理信号的周期性特征建模不足,限制了其在复杂环境下的性能。Contribution: 提出Reperio-rPPG框架,整合关系卷积网络和图Transformer,增强周期性建模;引入CutMix数据增强提升泛化性。
Method: 结合关系卷积网络和图Transformer捕捉周期性;使用CutMix增强数据多样性。
Result: 在PURE、UBFC-rPPG和MMPD数据集上达到SOTA,表现鲁棒。
Insight: 周期性建模和数据增强对rPPG任务至关重要。
Abstract: Remote photoplethysmography (rPPG) is an emerging contactless physiological sensing technique that leverages subtle color variations in facial videos to estimate vital signs such as heart rate and respiratory rate. This non-invasive method has gained traction across diverse domains, including telemedicine, affective computing, driver fatigue detection, and health monitoring, owing to its scalability and convenience. Despite significant progress in remote physiological signal measurement, a crucial characteristic - the intrinsic periodicity - has often been underexplored or insufficiently modeled in previous approaches, limiting their ability to capture fine-grained temporal dynamics under real-world conditions. To bridge this gap, we propose Reperio-rPPG, a novel framework that strategically integrates Relational Convolutional Networks with a Graph Transformer to effectively capture the periodic structure inherent in physiological signals. Additionally, recognizing the limited diversity of existing rPPG datasets, we further introduce a tailored CutMix augmentation to enhance the model’s generalizability. Extensive experiments conducted on three widely used benchmark datasets - PURE, UBFC-rPPG, and MMPD - demonstrate that Reperio-rPPG not only achieves state-of-the-art performance but also exhibits remarkable robustness under various motion (e.g., stationary, rotation, talking, walking) and illumination conditions (e.g., nature, low LED, high LED). The code is publicly available at https://github.com/deconasser/Reperio-rPPG.
[98] U(PM)$^2$:Unsupervised polygon matching with pre-trained models for challenging stereo images
Chang Li,Xingtao Peng
Main category: cs.CV
TL;DR: 论文提出了一种无需训练的低成本无监督多边形匹配方法U(PM)$^2$,结合了预训练模型和手工特征,解决了立体图像多边形匹配中的尺度变化、视差间断和拓扑不一致等挑战。
Details
Motivation: 立体图像匹配在计算机视觉、摄影测量和遥感中是基础任务,但多边形匹配这一领域几乎未被探索,面临视差间断、尺度变化、训练需求和泛化性等挑战。Contribution: 论文的主要贡献是提出了一种低成本的无监督多边形匹配框架U(PM)$^2$,通过结合预训练模型和手工特征,解决了多边形匹配中的多个挑战。
Method: 方法分为三步:1) 使用预训练的SAM模型获取掩码;2) 向量化掩码为多边形和图形结构;3) 基于双向金字塔策略(LoFTR)的全局匹配器和基于匈牙利算法的局部匹配器解决视差和拓扑问题。
Result: 在ScanNet和SceneFlow数据集上的实验表明,U(PM)$^2$以竞争速度实现了最先进的精度,并在无需训练的情况下展现了良好的泛化性能。
Insight: 论文表明,结合预训练模型和无监督方法可以在无需训练的情况下有效解决复杂多边形匹配问题,为低资源应用提供了新思路。
Abstract: Stereo image matching is a fundamental task in computer vision, photogrammetry and remote sensing, but there is an almost unexplored field, i.e., polygon matching, which faces the following challenges: disparity discontinuity, scale variation, training requirement, and generalization. To address the above-mentioned issues, this paper proposes a novel U(PM)$^2$: low-cost unsupervised polygon matching with pre-trained models by uniting automatically learned and handcrafted features, of which pipeline is as follows: firstly, the detector leverages the pre-trained segment anything model to obtain masks; then, the vectorizer converts the masks to polygons and graphic structure; secondly, the global matcher addresses challenges from global viewpoint changes and scale variation based on bidirectional-pyramid strategy with pre-trained LoFTR; finally, the local matcher further overcomes local disparity discontinuity and topology inconsistency of polygon matching by local-joint geometry and multi-feature matching strategy with Hungarian algorithm. We benchmark our U(PM)$^2$ on the ScanNet and SceneFlow datasets using our proposed new metric, which achieved state-of-the-art accuracy at a competitive speed and satisfactory generalization performance at low cost without any training requirement.
[99] CSGaze: Context-aware Social Gaze Prediction
Surbhi Madan,Shreya Ghosh,Ramanathan Subramanian,Abhinav Dhall,Tom Gedeon
Main category: cs.CV
TL;DR: CSGaze是一种利用上下文、场景和人脸信息的多模态方法,用于预测多人在对话中的社交凝视模式。实验证明其在多个数据集上表现优异,并通过注意力机制提升了模型的可解释性。
Details
Motivation: 研究社交凝视模式对于理解人的注意力和社会参与度至关重要,但现有方法往往忽略了上下文信息。CSGaze旨在结合多模态数据提升凝视预测的准确性。Contribution: 1. 提出结合人脸、场景和上下文的多模态社交凝视预测方法CSGaze;2. 引入细粒度注意力机制,聚焦主要说话者;3. 展示了模型在开放数据集上的泛化能力。
Method: CSGaze采用多模态输入(人脸和场景信息),并通过注意力机制聚焦主要说话者,以更好地建模社交凝视动态。
Result: 在GP-Static、UCO-LAEO和AVA-LAEO数据集上表现优异,且通过注意力得分提供了模型决策的可解释性。
Insight: 上下文信息显著提升社交凝视预测性能;注意力机制不仅提升了表现,也为模型提供了可解释性。
Abstract: A person’s gaze offers valuable insights into their focus of attention, level of social engagement, and confidence. In this work, we investigate how contextual cues combined with visual scene and facial information can be effectively utilized to predict and interpret social gaze patterns during conversational interactions. We introduce CSGaze, a context aware multimodal approach that leverages facial, scene information as complementary inputs to enhance social gaze pattern prediction from multi-person images. The model also incorporates a fine-grained attention mechanism centered on the principal speaker, which helps in better modeling social gaze dynamics. Experimental results show that CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO. Our findings highlight the role of contextual cues in improving social gaze prediction. Additionally, we provide initial explainability through generated attention scores, offering insights into the model’s decision-making process. We also demonstrate our model’s generalizability by testing our model on open set datasets that demonstrating its robustness across diverse scenarios.
[100] Adaptive Agent Selection and Interaction Network for Image-to-point cloud Registration
Zhixin Cheng,Xiaotian Yin,Jiacheng Deng,Bohao Liao,Yujia Chen,Xu Zhou,Baoqun Yin,Tianzhu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的跨模态配准框架,包含迭代代理选择(IAS)和可靠代理交互(RAI)模块,有效提升了图像到点云配准的鲁棒性和准确性。
Details
Motivation: 现有的无检测方法在处理噪声和跨模态特征选择时表现不佳,导致配准错误。本文旨在通过智能代理选择和交互机制解决这些问题。Contribution: 1)设计了IAS模块,通过相位图和强化学习原理选择可靠代理;2)引入了RAI模块,利用这些代理指导跨模态交互,减少失配。
Method: 采用IAS模块增强结构特征感知,并通过RAI模块优化跨模态交互。
Result: 在RGB-D Scenes v2和7-Scenes基准测试中,方法表现优于现有技术。
Insight: 智能代理选择和交互是提升跨模态配准性能的有效途径。
Abstract: Typical detection-free methods for image-to-point cloud registration leverage transformer-based architectures to aggregate cross-modal features and establish correspondences. However, they often struggle under challenging conditions, where noise disrupts similarity computation and leads to incorrect correspondences. Moreover, without dedicated designs, it remains difficult to effectively select informative and correlated representations across modalities, thereby limiting the robustness and accuracy of registration. To address these challenges, we propose a novel cross-modal registration framework composed of two key modules: the Iterative Agents Selection (IAS) module and the Reliable Agents Interaction (RAI) module. IAS enhances structural feature awareness with phase maps and employs reinforcement learning principles to efficiently select reliable agents. RAI then leverages these selected agents to guide cross-modal interactions, effectively reducing mismatches and improving overall robustness. Extensive experiments on the RGB-D Scenes v2 and 7-Scenes benchmarks demonstrate that our method consistently achieves state-of-the-art performance.
[101] Commonality in Few: Few-Shot Multimodal Anomaly Detection via Hypergraph-Enhanced Memory
Yuxuan Lin,Hanjing Yan,Xuan Tong,Yang Chang,Huanzhen Wang,Ziheng Zhou,Shuyong Gao,Yan Wang,Wenqiang Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种名为CIF的新型少样本无监督多模态工业异常检测方法,通过超图捕获训练样本的结构共性,并利用记忆库存储此类信息,显著提升了少样本设置下的异常检测性能。
Details
Motivation: 在少样本工业场景中,训练样本不足无法覆盖测试样本的多样模式,亟需一种有效方法从少量样本中提取结构共性以提升检测性能。Contribution: 1. 提出CIF方法,利用超图捕捉训练样本的结构共性;2. 设计了语义感知超图构建模块和训练自由超图消息传递模块;3. 引入超边引导的记忆搜索模块以减少误报率。
Method: 1. 超图建模高阶相关性以提取结构共性;2. 记忆库存储结构先验;3. 超图消息传递减少测试与记忆库特征的分布差异;4. 超边引导搜索优化记忆检索。
Result: 在MVTec 3D-AD和Eyecandies数据集上,CIF在少样本设置中优于现有方法。
Insight: 超图能有效建模少样本数据的结构共性,结合记忆库和训练自由模块可显著提升异常检测的鲁棒性和准确性。
Abstract: Few-shot multimodal industrial anomaly detection is a critical yet underexplored task, offering the ability to quickly adapt to complex industrial scenarios. In few-shot settings, insufficient training samples often fail to cover the diverse patterns present in test samples. This challenge can be mitigated by extracting structural commonality from a small number of training samples. In this paper, we propose a novel few-shot unsupervised multimodal industrial anomaly detection method based on structural commonality, CIF (Commonality In Few). To extract intra-class structural information, we employ hypergraphs, which are capable of modeling higher-order correlations, to capture the structural commonality within training samples, and use a memory bank to store this intra-class structural prior. Firstly, we design a semantic-aware hypergraph construction module tailored for single-semantic industrial images, from which we extract common structures to guide the construction of the memory bank. Secondly, we use a training-free hypergraph message passing module to update the visual features of test samples, reducing the distribution gap between test features and features in the memory bank. We further propose a hyperedge-guided memory search module, which utilizes structural information to assist the memory search process and reduce the false positive rate. Experimental results on the MVTec 3D-AD dataset and the Eyecandies dataset show that our method outperforms the state-of-the-art (SOTA) methods in few-shot settings. Code is available at https://github.com/Sunny5250/CIF.
[102] Adapted Foundation Models for Breast MRI Triaging in Contrast-Enhanced and Non-Contrast Enhanced Protocols
Tri-Thien Nguyen,Lorenz A. Kapsner,Tobias Hepp,Shirin Heidarikahkesh,Hannes Schreiter,Luise Brock,Dominika Skwierawska,Dominique Hadler,Julian Hossbach,Evelyn Wenkel,Sabine Ohlmeyer,Frederik B. Laun,Andrzej Liebert,Andreas Maier,Michael Uder,Sebastian Bickelhaupt
Main category: cs.CV
TL;DR: 该论文提出了一种基于DINOv2的医学切片变换器(MST)框架,用于在乳腺MRI中对显著发现(BI-RADS ≥4)进行筛查,并在对比增强和非对比增强协议中进行了评估。
Details
Motivation: 乳腺MRI解释耗时且具有高敏感性,需要人工智能辅助预筛查以提高效率。Contribution: 使用MST框架在乳腺MRI中实现了高敏感性和特异性,证明了其在对比增强和非对比增强协议中的可行性。
Method: 采用DINOv2基础的MST模型,评估了四种简化协议(T1sub、DWI1500、DWI1500+T2w、T1sub+T2w)的性能,并通过交叉验证和AUC分析进行比较。
Result: T1sub+T2w协议表现最佳,AUC为0.77,特异性为19%。外部验证AUC为0.77,88%的注意力图被评为良好或中等。
Insight: MST框架在乳腺MRI筛查中显示出潜力,尤其是在高敏感性要求下,但临床实施仍需进一步研究。
Abstract: Background: Magnetic resonance imaging (MRI) has high sensitivity for breast cancer detection, but interpretation is time-consuming. Artificial intelligence may aid in pre-screening. Purpose: To evaluate the DINOv2-based Medical Slice Transformer (MST) for ruling out significant findings (Breast Imaging Reporting and Data System [BI-RADS] >=4) in contrast-enhanced and non-contrast-enhanced abbreviated breast MRI. Materials and Methods: This institutional review board approved retrospective study included 1,847 single-breast MRI examinations (377 BI-RADS >=4) from an in-house dataset and 924 from an external validation dataset (Duke). Four abbreviated protocols were tested: T1-weighted early subtraction (T1sub), diffusion-weighted imaging with b=1500 s/mm2 (DWI1500), DWI1500+T2-weighted (T2w), and T1sub+T2w. Performance was assessed at 90%, 95%, and 97.5% sensitivity using five-fold cross-validation and area under the receiver operating characteristic curve (AUC) analysis. AUC differences were compared with the DeLong test. False negatives were characterized, and attention maps of true positives were rated in the external dataset. Results: A total of 1,448 female patients (mean age, 49 +/- 12 years) were included. T1sub+T2w achieved an AUC of 0.77 +/- 0.04; DWI1500+T2w, 0.74 +/- 0.04 (p=0.15). At 97.5% sensitivity, T1sub+T2w had the highest specificity (19% +/- 7%), followed by DWI1500+T2w (17% +/- 11%). Missed lesions had a mean diameter <10 mm at 95% and 97.5% thresholds for both T1sub and DWI1500, predominantly non-mass enhancements. External validation yielded an AUC of 0.77, with 88% of attention maps rated good or moderate. Conclusion: At 97.5% sensitivity, the MST framework correctly triaged cases without BI-RADS >=4, achieving 19% specificity for contrast-enhanced and 17% for non-contrast-enhanced MRI. Further research is warranted before clinical implementation.
[103] DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities
Nagur Shareef Shaik,Teja Krishna Cherukuri,Adnan Masood,Dong Hye Ye
Main category: cs.CV
TL;DR: 该论文提出了一种名为DiA-gnostic VLVAE的方法,通过解耦对齐技术解决了医学图像与临床文本融合中的模态缺失和特征纠缠问题,实现了鲁棒的放射学报告生成。
Details
Motivation: 现有方法依赖大型语言模型或静态知识图谱,难以应对临床数据中模态缺失和特征纠缠的挑战。Contribution: 1. 设计了基于Mixture-of-Experts的Vision-Language VAE框架;2. 通过正交约束目标实现特征解耦和对齐;3. 使用紧凑的LLaMA-X解码器高效生成报告。
Method: 提出了一种解耦对齐的VLVAE框架,包括MoE模块用于特征解耦、正交约束优化目标防止信息融合不佳,以及LLaMA-X解码器生成报告。
Result: 在IU X-Ray和MIMIC-CXR数据集上,BLEU@4分数分别为0.266和0.134,显著优于现有方法。
Insight: 特征解耦和对齐技术可有效应对多模态数据中的不完全性问题,提升生成报告的鲁棒性和临床准确性。
Abstract: The integration of medical images with clinical context is essential for generating accurate and clinically interpretable radiology reports. However, current automated methods often rely on resource-heavy Large Language Models (LLMs) or static knowledge graphs and struggle with two fundamental challenges in real-world clinical data: (1) missing modalities, such as incomplete clinical context , and (2) feature entanglement, where mixed modality-specific and shared information leads to suboptimal fusion and clinically unfaithful hallucinated findings. To address these challenges, we propose the DiA-gnostic VLVAE, which achieves robust radiology reporting through Disentangled Alignment. Our framework is designed to be resilient to missing modalities by disentangling shared and modality-specific features using a Mixture-of-Experts (MoE) based Vision-Language Variational Autoencoder (VLVAE). A constrained optimization objective enforces orthogonality and alignment between these latent representations to prevent suboptimal fusion. A compact LLaMA-X decoder then uses these disentangled representations to generate reports efficiently. On the IU X-Ray and MIMIC-CXR datasets, DiA has achieved competetive BLEU@4 scores of 0.266 and 0.134, respectively. Experimental results show that the proposed method significantly outperforms state-of-the-art models.
[104] A Dual-Mode ViT-Conditioned Diffusion Framework with an Adaptive Conditioning Bridge for Breast Cancer Segmentation
Prateek Singh,Moumita Dholey,P. K. Vinod
Main category: cs.CV
TL;DR: 该论文提出了一种基于ViT和扩散模型的双模态框架,用于乳腺癌超声图像的分割,通过引入自适应条件桥和多尺度特征融合,以及拓扑一致性损失,实现了高精度的分割结果。
Details
Motivation: 乳腺癌超声图像的低对比度、斑点噪声和模糊边界使得精确分割具有挑战性。传统卷积架构缺乏全局上下文捕捉能力,导致分割结果解剖学不一致。Contribution: 提出了自适应条件桥(ACB)、拓扑去噪一致性(TDC)损失和双头架构,实现了高性能的乳腺癌超声图像分割。
Method: 结合ViT编码器和UNet解码器,利用多尺度特征融合和拓扑一致性损失优化训练,并通过双头架构实现快速推理。
Result: 在公开数据集上达到了SOTA性能,Dice分数分别为0.96(BUSI)、0.90(BrEaST)和0.97(BUS-UCLM)。
Insight: 全局特征提取与局部细节保留的结合,以及拓扑一致性约束,是提高医学图像分割精度和解剖合理性的关键。
Abstract: In breast ultrasound images, precise lesion segmentation is essential for early diagnosis; however, low contrast, speckle noise, and unclear boundaries make this difficult. Even though deep learning models have demonstrated potential, standard convolutional architectures frequently fall short in capturing enough global context, resulting in segmentations that are anatomically inconsistent. To overcome these drawbacks, we suggest a flexible, conditional Denoising Diffusion Model that combines an enhanced UNet-based generative decoder with a Vision Transformer (ViT) encoder for global feature extraction. We introduce three primary innovations: 1) an Adaptive Conditioning Bridge (ACB) for efficient, multi-scale fusion of semantic features; 2) a novel Topological Denoising Consistency (TDC) loss component that regularizes training by penalizing structural inconsistencies during denoising; and 3) a dual-head architecture that leverages the denoising objective as a powerful regularizer, enabling a lightweight auxiliary head to perform rapid and accurate inference on smaller datasets and a noise prediction head. Our framework establishes a new state-of-the-art on public breast ultrasound datasets, achieving Dice scores of 0.96 on BUSI, 0.90 on BrEaST and 0.97 on BUS-UCLM. Comprehensive ablation studies empirically validate that the model components are critical for achieving these results and for producing segmentations that are not only accurate but also anatomically plausible.
[105] Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds
Xianhui Meng,Yukang Huo,Li Zhang,Liu Liu,Haonan Jiang,Yan Zhong,Pingrui Zhang,Cewu Lu,Jun Liu
Main category: cs.CV
TL;DR: 本文提出了一种基于点对特征的SE(3)流形上关节物体姿态跟踪框架PPF-Tracker,通过SE(3)不变性和统一的运动学约束,在多帧姿态跟踪任务中表现出色。
Details
Motivation: 关节物体在日常生活和机器人操作中广泛存在,但由于其运动学约束复杂,其姿态跟踪问题仍缺乏有效方法。本文旨在解决这一问题。Contribution: 提出了PPF-Tracker框架,首次在SE(3)李群空间中对点云进行准规范化处理,并结合点对特征和运动学约束实现关节物体的姿态跟踪。
Method: 1. 在SE(3)空间对点云进行准规范化;2. 利用点对特征(PPF)预测姿态投票参数;3. 加入关节轴语义信息以统一运动学约束。
Result: 在合成数据集和真实场景中验证了PPF-Tracker的泛化能力和鲁棒性,展示了在多帧姿态跟踪任务中的有效性。
Insight: 通过SE(3)不变性和运动学约束的结合,PPF-Tracker为关节物体姿态跟踪提供了新思路,适用于机器人、增强现实等领域。
Abstract: Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed \textbf{PPF-Tracker}. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality. Codes are available at https://github.com/mengxh20/PPFTracker.
[106] MALeR: Improving Compositional Fidelity in Layout-Guided Generation
Shivank Saxena,Dhruv Srivastava,Makarand Tapaswi
Main category: cs.CV
TL;DR: MALeR提出了一种布局引导的生成方法,解决了多主题和多属性场景中常见的主题溢出、属性泄漏等问题,提升了生成图像的组合准确性和一致性。
Details
Motivation: 文本到图像生成模型在多主题和多属性的组合场景中存在主题溢出、分布外生成和属性泄漏等问题,用户对主题布局的控制需求亟待解决。Contribution: 提出了MALeR方法,确保生成的主题不溢出布局范围并保持分布内生成;提出了一种掩码属性感知绑定机制,防止属性泄漏。
Method: 结合文本提示和布局信息,通过掩码机制确保主题在指定布局内生成;采用属性感知绑定机制防止属性在不同主题间泄漏。
Result: 定性和定量实验表明,MALeR在多主题和多属性场景中的组合准确性、生成一致性和属性绑定方面优于现有方法。
Insight: 通过布局约束和属性绑定机制,MALeR解决了组合生成中的关键问题,为实现更可控的复杂场景生成提供了新思路。
Abstract: Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.
[107] How Reasoning Influences Intersectional Biases in Vision Language Models
Adit Desai,Sudipta Roy,Mohna Chakraborty
Main category: cs.CV
TL;DR: 这篇论文研究了视觉语言模型(VLMs)中的推理过程如何影响其交叉偏见,通过在职业预测任务中分析五个开源VLMs的推理模式,揭示了其与社会偏见的系统关联。
Details
Motivation: VLMs的训练数据通常隐含着社会偏见,而这些偏见在模型的推理过程中进一步被放大,导致输出结果与人类价值观不符。作者希望通过分析VLMs的推理机制,揭示其偏见形成的根源及其对下游任务的影响。Contribution: 论文的主要贡献是系统地分析了五种开源VLMs在职业预测任务中的推理模式,揭示了其与社会偏见的交叉关联,并强调了在部署VLMs前对齐人类价值观的重要性。
Method: 研究采用FairFace数据集,对32种职业进行了职业预测任务,并通过三种不同的提示方式(prompting styles)引导模型生成预测和推理。
Result: 结果显示,VLMs的偏見推理模式与交叉社会偏见之间存在系统性关联,尤其是在职业预测任务中表现出明显的偏差。
Insight: VLMs的推理模式缺乏对社会语境的敏感度,导致其偏见难以通过纯粹的统计学习消除。未来的研究需要更关注模型推理的透明性和对齐人类价值观的方法。
Abstract: Vision Language Models (VLMs) are increasingly deployed across downstream tasks, yet their training data often encode social biases that surface in outputs. Unlike humans, who interpret images through contextual and social cues, VLMs process them through statistical associations, often leading to reasoning that diverges from human reasoning. By analyzing how a VLM reasons, we can understand how inherent biases are perpetuated and can adversely affect downstream performance. To examine this gap, we systematically analyze social biases in five open-source VLMs for an occupation prediction task, on the FairFace dataset. Across 32 occupations and three different prompting styles, we elicit both predictions and reasoning. Our findings reveal that the biased reasoning patterns systematically underlie intersectional disparities, highlighting the need to align VLM reasoning with human values prior to its downstream deployment.
[108] MiVID: Multi-Strategic Self-Supervision for Video Frame Interpolation using Diffusion Model
Priyansh Srivastava,Romit Chatterjee,Abir Sen,Aradhana Behura,Ratnakar Dash
Main category: cs.CV
TL;DR: MiVID是一种轻量级的自监督扩散模型框架,用于视频帧插值,无需显式运动估计,结合3D U-Net和时序注意力机制,在低资源条件下表现优异。
Details
Motivation: 传统方法依赖光流或密集标注数据,难以应对遮挡和运动模糊问题,MiVID通过自监督学习解决这些挑战。Contribution: 提出了无需高帧率监督的轻量级自监督扩散模型MiVID,结合混合掩码策略和自适应损失调度,实现鲁棒的时空表示学习。
Method: 使用3D U-Net和时序注意力机制,结合余弦渐进掩码和自适应损失调度进行训练。
Result: 在UCF101-7和DAVIS-7数据集上表现优异,仅需50轮训练即可达到与监督基线竞争的结果。
Insight: 自监督扩散先验能有效提升时间一致性帧生成,为通用视频帧插值系统提供了可扩展路径。
Abstract: Video Frame Interpolation (VFI) remains a cornerstone in video enhancement, enabling temporal upscaling for tasks like slow-motion rendering, frame rate conversion, and video restoration. While classical methods rely on optical flow and learning-based models assume access to dense ground-truth, both struggle with occlusions, domain shifts, and ambiguous motion. This article introduces MiVID, a lightweight, self-supervised, diffusion-based framework for video interpolation. Our model eliminates the need for explicit motion estimation by combining a 3D U-Net backbone with transformer-style temporal attention, trained under a hybrid masking regime that simulates occlusions and motion uncertainty. The use of cosine-based progressive masking and adaptive loss scheduling allows our network to learn robust spatiotemporal representations without any high-frame-rate supervision. Our framework is evaluated on UCF101-7 and DAVIS-7 datasets. MiVID is trained entirely on CPU using the datasets and 9-frame video segments, making it a low-resource yet highly effective pipeline. Despite these constraints, our model achieves optimal results at just 50 epochs, competitive with several supervised baselines.This work demonstrates the power of self-supervised diffusion priors for temporally coherent frame synthesis and provides a scalable path toward accessible and generalizable VFI systems.
[109] Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era
Feng Lu,Tong Jin,Canming Ye,Yunpeng Liu,Xiangyuan Lan,Chun Yuan
Main category: cs.CV
TL;DR: 本文提出了一种在Transformer时代无需显式聚合器的视觉地点识别方法,通过引入可学习的聚合token实现隐式聚合,提升了效率和性能。
Details
Motivation: 传统的视觉地点识别通常依赖于显式聚合器(如NetVLAD),但在Transformer时代,这种设计可能不再是必要的。作者提出隐式聚合的概念,简化模型结构同时提升性能。Contribution: 1. 提出了隐式聚合方法,仅需主干网络即可生成稳健的全局描述符;2. 设计了优化的token插入策略和初始化方法;3. 在多个数据集上超越了SOTA方法并在MSLS挑战榜上排名第一。
Method: 1. 在特定Transformer块前引入可学习的聚合token;2. 通过自注意力机制隐式聚合patch token信息;3. 从输出token中提取聚合token作为全局描述符。
Result: 在多个视觉地点识别数据集上性能优于现有方法,同时在MSLS挑战榜上排名第一,且效率更高。
Insight: Transformer的自注意力机制本身具备信息聚合能力,无需额外的显式聚合器,简化了模型设计并提升了性能。
Abstract: Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https://github.com/lu-feng/image.
[110] S2ML: Spatio-Spectral Mutual Learning for Depth Completion
Zihui Zhao,Yifei Zhang,Zheng Wang,Yang Li,Kui Jiang,Zihan Geng,Chia-Wen Lin
Main category: cs.CV
TL;DR: 论文提出了一种时空-频谱互学习的深度补全框架(S2ML),通过结合空间域和频域的特性,提升深度图像的补全精度。
Details
Motivation: 现有深度补全方法仅关注图像域,忽视了深度图像的物理特性和频域信息,限制了其在下游任务中的应用。Contribution: 提出了S2ML框架,通过时空-频谱互学习,融合空间域和频域的特征,设计了专门的频谱融合模块和统一的嵌入空间。
Method: 利用振幅和相位谱的独特性质,设计频谱融合模块;在统一嵌入空间中计算空间域和频域特征的局部和全局相关性。
Result: 在NYU-Depth V2和SUN RGB-D数据集上,S2ML分别比最先进的CFormer方法高出0.828 dB和0.834 dB。
Insight: 深度图像的频域信息对补全任务至关重要,结合空间域和频域的特征可以显著提升补全精度。
Abstract: The raw depth images captured by RGB-D cameras using Time-of-Flight (TOF) or structured light often suffer from incomplete depth values due to weak reflections, boundary shadows, and artifacts, which limit their applications in downstream vision tasks. Existing methods address this problem through depth completion in the image domain, but they overlook the physical characteristics of raw depth images. It has been observed that the presence of invalid depth areas alters the frequency distribution pattern. In this work, we propose a Spatio-Spectral Mutual Learning framework (S2ML) to harmonize the advantages of both spatial and frequency domains for depth completion. Specifically, we consider the distinct properties of amplitude and phase spectra and devise a dedicated spectral fusion module. Meanwhile, the local and global correlations between spatial-domain and frequency-domain features are calculated in a unified embedding space. The gradual mutual representation and refinement encourage the network to fully explore complementary physical characteristics and priors for more accurate depth completion. Extensive experiments demonstrate the effectiveness of our proposed S2ML method, outperforming the state-of-the-art method CFormer by 0.828 dB and 0.834 dB on the NYU-Depth V2 and SUN RGB-D datasets, respectively.
[111] StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video
Zhihui Ke,Yuyang Liu,Xiaobo Zhou,Tie Qiu
Main category: cs.CV
TL;DR: StreamSTGS提出了一种实时流式自由视角视频表示方法,通过压缩3D高斯属性为2D图像和视频,支持自适应比特率控制,并在性能与存储效率上显著优于现有方法。
Details
Motivation: 现有3D高斯泼溅(3DGS)的自由视角视频方法虽然在训练和渲染上表现优秀,但每帧存储需求高达10MB,无法实现实时流式传输。StreamSTGS旨在解决这一问题,实现高效的实时流式自由视角视频。Contribution: 1. 提出了StreamSTGS,一种专为实时流式设计的自由视角视频表示方法;2. 通过将3D高斯属性和时序特征编码为2D图像和视频,实现高效压缩;3. 引入滑动窗口和Transformer辅助模块,分别学习局部和全局运动。
Method: 1. 使用规范3D高斯、时序特征和变形场表示动态场景;2. 通过2D图像和视频编码压缩高斯属性;3. 采用滑动窗口聚合时序特征学习局部运动,结合Transformer模块学习全局运动。
Result: StreamSTGS在多个自由视角视频基准测试中表现优异,平均PSNR提高1dB,同时将帧大小降至170KB,显著优于现有方法。
Insight: StreamSTGS的创新之处在于将3D高斯属性的高效压缩与自适应比特率控制结合,同时通过滑动窗口和Transformer模块优化运动建模,为实时流式自由视角视频提供了实用解决方案。
Abstract: Streaming free-viewpoint video(FVV) in real-time still faces significant challenges, particularly in training, rendering, and transmission efficiency. Harnessing superior performance of 3D Gaussian Splatting(3DGS), recent 3DGS-based FVV methods have achieved notable breakthroughs in both training and rendering. However, the storage requirements of these methods can reach up to $10$MB per frame, making stream FVV in real-time impossible. To address this problem, we propose a novel FVV representation, dubbed StreamSTGS, designed for real-time streaming. StreamSTGS represents a dynamic scene using canonical 3D Gaussians, temporal features, and a deformation field. For high compression efficiency, we encode canonical Gaussian attributes as 2D images and temporal features as a video. This design not only enables real-time streaming, but also inherently supports adaptive bitrate control based on network condition without any extra training. Moreover, we propose a sliding window scheme to aggregate adjacent temporal features to learn local motions, and then introduce a transformer-guided auxiliary training module to learn global motions. On diverse FVV benchmarks, StreamSTGS demonstrates competitive performance on all metrics compared to state-of-the-art methods. Notably, StreamSTGS increases the PSNR by an average of $1$dB while reducing the average frame size to just $170$KB. The code is publicly available on https://github.com/kkkzh/StreamSTGS.
[112] Neodragon: Mobile Video Generation using Diffusion Transformer
Animesh Karnewar,Denis Korzhenkov,Ioannis Lelekas,Adil Karjauv,Noor Fathima,Hanwen Xiong,Vancheeswaran Vaidyanathan,Will Zeng,Rafael Esteves,Tushar Singhal,Fatih Porikli,Mohsen Ghafoorian,Amirhossein Habibian
Main category: cs.CV
TL;DR: Neodragon是一种基于扩散变换器的移动端文本生成视频系统,在移动硬件上实现了高效且高保真的视频合成,首开移动设备上实时生成高质量视频的先河。
Details
Motivation: 现有的文本生成视频模型多依赖于云端和大型硬件,无法在移动设备上高效运行。Neodragon旨在通过优化模型和硬件适配,实现移动端的低成本、隐私安全的视频生成。Contribution: 1) 用小型DT5替换大型T5文本编码器;2) 提出非对称解码器蒸馏方法;3) 基于重要性剪枝MMDiT模块并通过两阶段蒸馏恢复性能;4) 通过步进蒸馏减少去噪器的NFE需求。
Method: 结合文本编码器蒸馏、非对称解码器蒸馏、模块剪枝与步进蒸馏技术,优化模型结构和计算效率,适配移动硬件。
Result: Neodragon在Hexagon NPU上以6.7秒生成2秒视频(640x1024分辨率),VBench总分81.61,峰值内存占用3.5GB,模型参数4.945B。
Insight: 通过硬件适配和模型优化,移动端也能实现高质量的实时视频生成,为AI内容创作的民主化提供了新途径。
Abstract: We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at the 640x1024 resolution directly on a Qualcomm Hexagon NPU in a record 6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient and high-fidelity video synthesis. We achieve this through four key technical contributions: (1) Replacing the original large 4.762B T5xxl Text-Encoder with a much smaller 0.2B DT5 (DistilT5) with minimal quality loss, enabled through a novel Text-Encoder Distillation procedure. (2) Proposing an Asymmetric Decoder Distillation approach allowing us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the generation pipeline. (3) Pruning of MMDiT blocks within the denoiser backbone based on their relative importance, with recovery of original performance through a two-stage distillation process. (4) Reducing the NFE (Neural Functional Evaluation) requirement of the denoiser by performing step distillation using DMD adapted for pyramidal flow-matching, thereby substantially accelerating video generation. When paired with an optimised SSD1B first-frame image generator and QuickSRNet for 2x super-resolution, our end-to-end Neodragon system becomes a highly parameter (4.945B full model), memory (3.5GB peak RAM usage), and runtime (6.7s E2E latency) efficient mobile-friendly model, while achieving a VBench total score of 81.61. By enabling low-cost, private, and on-device text-to-video synthesis, Neodragon democratizes AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services. Code and model will be made publicly available at our website: https://qualcomm-ai-research.github.io/neodragon
[113] LoopExpose: An Unsupervised Framework for Arbitrary-Length Exposure Correction
Ao Li,Chen Chen,Zhenyu Wang,Tao Huang,Fangfang Wu,Weisheng Dong
Main category: cs.CV
TL;DR: LoopExpose是一个基于伪标签的无监督曝光校正框架,通过嵌套循环优化策略和多曝光融合生成伪标签,引入亮度排序损失作为自监督约束,无需依赖大规模标注数据。
Details
Motivation: 曝光校正在复杂光照条件下对提升图像质量至关重要,但传统监督学习方法依赖难以获取的大规模标注数据。因此,作者提出了一种无监督框架以避免这一问题。Contribution: 1. 提出了基于伪标签的无监督曝光校正框架LoopExpose;2. 通过嵌套循环优化策略和多曝光融合生成伪标签;3. 引入了亮度排序损失作为自监督约束。
Method: 1. 采用两层级嵌套循环优化,上层训练校正模型,下层通过多曝光融合生成伪标签;2. 引入反馈机制,将校正图像反馈到融合过程优化伪标签;3. 提出亮度排序损失,利用输入序列的相对亮度顺序作为自监督约束。
Result: 在多个基准数据集上的实验表明,LoopExpose在曝光校正和融合性能上优于现有的无监督方法。
Insight: 通过伪标签和自监督约束的结合,LoopExpose展示了无监督方法在曝光校正任务中的潜力,避免了数据标注的依赖。
Abstract: Exposure correction is essential for enhancing image quality under challenging lighting conditions. While supervised learning has achieved significant progress in this area, it relies heavily on large-scale labeled datasets, which are difficult to obtain in practical scenarios. To address this limitation, we propose a pseudo label-based unsupervised method called LoopExpose for arbitrary-length exposure correction. A nested loop optimization strategy is proposed to address the exposure correction problem, where the correction model and pseudo-supervised information are jointly optimized in a two-level framework. Specifically, the upper-level trains a correction model using pseudo-labels generated through multi-exposure fusion at the lower level. A feedback mechanism is introduced where corrected images are fed back into the fusion process to refine the pseudo-labels, creating a self-reinforcing learning loop. Considering the dominant role of luminance calibration in exposure correction, a Luminance Ranking Loss is introduced to leverage the relative luminance ordering across the input sequence as a self-supervised constraint. Extensive experiments on different benchmark datasets demonstrate that LoopExpose achieves superior exposure correction and fusion performance, outperforming existing state-of-the-art unsupervised methods. Code is available at https://github.com/FALALAS/LoopExpose.
[114] An Artificial Intelligence-based Assistant for the Visually Impaired
Luis Marquez-Carpintero,Francisco Gomez-Donoso,Zuria Bauer,Bessie Dominguez-Dager,Alvaro Belmonte-Baeza,Mónica Pina-Navarro,Francisco Morillas-Espejo,Felix Escalona,Miguel Cazorla
Main category: cs.CV
TL;DR: 论文介绍了一款名为AIDEN的人工智能助手,旨在通过先进的机器学习技术提升视障人士的生活质量,帮助其识别物体、读取文本和导航环境。
Details
Motivation: 视障人士在日常生活中面临识别物体、阅读文本和导航环境的挑战,现有解决方案如盲文和有声书无法完全满足需求。因此,开发一款高效的AI助手显得尤为重要。Contribution: 提出了一款基于AI的助手AIDEN,结合YOLO架构和大语言视觉模型,帮助视障人士提升独立性和生活质量。
Method: 采用了YOLO(You Only Look Once)架构和大语言视觉模型(LLaVA),用于物体识别、文本阅读和环境问答。
Result: 通过用户反馈验证,AIDEN在提升用户自主性和信息获取能力方面表现出色。
Insight: 结合计算机视觉和自然语言处理的AI助手可以为视障人士提供更加全面的辅助工具,弥补现有解决方案的不足。
Abstract: This paper describes an artificial intelligence-based assistant application, AIDEN, developed during 2023 and 2024, aimed at improving the quality of life for visually impaired individuals. Visually impaired individuals face challenges in identifying objects, reading text, and navigating unfamiliar environments, which can limit their independence and reduce their quality of life. Although solutions such as Braille, audio books, and screen readers exist, they may not be effective in all situations. This application leverages state-of-the-art machine learning algorithms to identify and describe objects, read text, and answer questions about the environment. Specifically, it uses You Only Look Once architectures and a Large Language and Vision Assistant. The system incorporates several methods to facilitate the user’s interaction with the system and access to textual and visual information in an appropriate manner. AIDEN aims to enhance user autonomy and access to information, contributing to an improved perception of daily usability, as supported by user feedback.
[115] Hybrid CNN-ViT Framework for Motion-Blurred Scene Text Restoration
Umar Rashid,Muhammad Arslan Arshad,Ghulam Ahmad,Muhammad Zeeshan Anjum,Rizwan Khan,Muhammad Akmal
Main category: cs.CV
TL;DR: 该论文提出了一种混合CNN-ViT框架,用于恢复运动模糊的场景文本图像,结合了CNN的局部特征提取和ViT的全局依赖建模,取得了高效且轻量化的性能。
Details
Motivation: 运动模糊严重影响场景文本图像的阅读性和计算机视觉任务的可靠性,传统方法难以处理空间变化模糊和长程依赖建模,因此需要一种新的解决方案。Contribution: 提出了一种CNN-ViT混合框架,结合局部特征与全局上下文推理,并通过多损失函数优化,显著提升了运动模糊文本的恢复效果。
Method: 使用CNN编码器-解码器保留结构细节,ViT模块通过自注意力增强全局感知;在合成的TextOCR数据集上训练,采用MAE、MSE、感知相似性和SSIM的复合损失。
Result: PSNR达到32.20 dB,SSIM为0.934,模型轻量化(2.83M参数,61 ms推理时间),展现了高效和实用性。
Insight: CNN与ViT的混合设计能够有效平衡局部细节与全局依赖,适用于实际场景中的文本恢复任务。
Abstract: Motion blur in scene text images severely impairs readability and hinders the reliability of computer vision tasks, including autonomous driving, document digitization, and visual information retrieval. Conventional deblurring approaches are often inadequate in handling spatially varying blur and typically fall short in modeling the long-range dependencies necessary for restoring textual clarity. To overcome these limitations, we introduce a hybrid deep learning framework that combines convolutional neural networks (CNNs) with vision transformers (ViTs), thereby leveraging both local feature extraction and global contextual reasoning. The architecture employs a CNN-based encoder-decoder to preserve structural details, while a transformer module enhances global awareness through self-attention. Training is conducted on a curated dataset derived from TextOCR, where sharp scene-text samples are paired with synthetically blurred versions generated using realistic motion-blur kernels of multiple sizes and orientations. Model optimization is guided by a composite loss that incorporates mean absolute error (MAE), squared error (MSE), perceptual similarity, and structural similarity (SSIM). Quantitative evaluations show that the proposed method attains 32.20 dB in PSNR and 0.934 in SSIM, while remaining lightweight with 2.83 million parameters and an average inference time of 61 ms. These results highlight the effectiveness and computational efficiency of the CNN-ViT hybrid design, establishing its practicality for real-world motion-blurred scene-text restoration.
[116] MambaOVSR: Multiscale Fusion with Global Motion Modeling for Chinese Opera Video Super-Resolution
Hua Chang,Xin Xu,Wei Liu,Wei Wang,Xin Yuan,Kui Jiang
Main category: cs.CV
TL;DR: 论文提出MambaOVSR,用于中国戏曲视频的超分辨率重建,结合多尺度融合和全局运动建模,解决了现有STVSR方法在处理大运动时的不足,并在新数据集COVC上表现优异。
Details
Motivation: 早期拍摄的中国戏曲视频因设备限制导致分辨率低、帧率不足,现有STVSR方法缺乏全局建模能力,难以处理戏曲中特有的大幅度动作。Contribution: 1. 构建了首个大规模中国戏曲视频片段数据集COVC;2. 提出MambaOVSR,引入全局融合模块(GFM)和多尺度协同Mamba模块(MSMM)改进运动建模;3. 设计MambaVR块解决特征对齐过程中的伪影问题。
Method: 1. 使用GFM通过多尺度交替扫描机制建模全局运动;2. MSMM实现多尺度序列对齐;3. MambaVR块优化特征对齐。
Result: 在COVC数据集上,MambaOVSR的PSNR平均比现有最优STVSR方法高出1.86 dB。
Insight: 全局运动建模和多尺度融合对处理戏曲视频中的大运动至关重要,填补了STVSR在此领域的空白。
Abstract: Chinese opera is celebrated for preserving classical art. However, early filming equipment limitations have degraded videos of last-century performances by renowned artists (e.g., low frame rates and resolution), hindering archival efforts. Although space-time video super-resolution (STVSR) has advanced significantly, applying it directly to opera videos remains challenging. The scarcity of datasets impedes the recovery of high frequency details, and existing STVSR methods lack global modeling capabilities, compromising visual quality when handling opera’s characteristic large motions. To address these challenges, we pioneer a large scale Chinese Opera Video Clip (COVC) dataset and propose the Mamba-based multiscale fusion network for space-time Opera Video Super-Resolution (MambaOVSR). Specifically, MambaOVSR involves three novel components: the Global Fusion Module (GFM) for motion modeling through a multiscale alternating scanning mechanism, and the Multiscale Synergistic Mamba Module (MSMM) for alignment across different sequence lengths. Additionally, our MambaVR block resolves feature artifacts and positional information loss during alignment. Experimental results on the COVC dataset show that MambaOVSR significantly outperforms the SOTA STVSR method by an average of 1.86 dB in terms of PSNR. Dataset and Code will be publicly released.
[117] Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models
Rodrigo Gallardo,Oz Fishman,Alexander Htet Kyaw
Main category: cs.CV
TL;DR: 该论文提出了一种人在循环(human-in-the-loop)的计算机视觉框架,结合生成式AI和视觉语言模型,用于公共空间的微观设计干预。通过检测城市对象并构建共现嵌入,系统推荐可能的补充对象,最终生成复杂的城市设计策略。
Details
Motivation: 传统城市规划往往采用自上而下的方法,忽略了日常生活中的实际需求和空间模式。本文旨在通过结合AI技术,提出一种更贴近实际体验的设计框架,实现更连续的本地参与。Contribution: 1. 提出了一种结合Grounding DINO和ADE20K数据集的框架,用于检测城市对象和构建共现嵌入;2. 设计了一个视觉语言模型驱动的推荐系统,生成复杂的城市设计策略;3. 提供了一个人在循环的工作流程,确保用户对选择和优化的控制权。
Method: 1. 使用Grounding DINO和ADE20K数据集检测城市对象并分析其空间共现模式;2. 通过共现嵌入生成统计上可能的补充对象;3. 结合视觉语言模型(VLM)对场景和目标对象进行推理,推荐第三对象以完成更复杂的设计。
Result: 系统能够生成符合实际空间模式的微观设计干预,并通过视觉语言模型提供更复杂的策略,同时保留了用户的控制权。
Insight: 将AI技术与城市规划结合,可以更好地捕捉日常生活中的空间模式,同时通过人在循环的设计提升方案的实用性和接受度。
Abstract: This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space and support more continuous, local participation. Using Grounding DINO and a curated subset of the ADE20K dataset as a proxy for the urban built environment, the system detects urban objects and builds co-occurrence embeddings that reveal common spatial configurations. From this analysis, the user receives five statistically likely complements to a chosen anchor object. A vision language model then reasons over the scene image and the selected pair to suggest a third object that completes a more complex urban tactic. The workflow keeps people in control of selection and refinement and aims to move beyond top-down master planning by grounding choices in everyday patterns and lived experience.
[118] MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition
Shu Zhao,Nilesh Ahuja,Tan Yu,Tianyi Shen,Vijaykrishnan Narayanan
Main category: cs.CV
TL;DR: MoRA是一种参数高效微调方法,针对视觉识别任务中模态缺失问题提出解决方案,通过显式建模跨模态交互并结合模态特定参数,显著提升了性能和效率。
Details
Motivation: 实际应用中,多模态数据常因隐私、采集或资源限制而缺失。现有方法无法有效捕获跨模态关系且计算开销大,MoRA旨在解决这一问题。Contribution: 提出MoRA方法,引入模态共享参数实现跨模态知识迁移,结合模态特定参数保持模态间交互和模态内灵活性。
Method: 采用参数高效微调策略,通过模态共享参数和模态特定参数共同优化模型,减少计算开销。
Result: 在标准基准测试中,MoRA在模态缺失场景下平均性能提升5.24%,推理时间仅为SOTA方法的25.90%,训练参数仅需全微调的0.11%。
Insight: MoRA展示了在模态缺失场景下,显式建模跨模态交互和参数高效微调的重要性,同时验证了共享参数与模态特定参数结合的有效性。
Abstract: Pre-trained vision language models have shown remarkable performance on visual recognition tasks, but they typically assume the availability of complete multimodal inputs during both training and inference. In real-world scenarios, however, modalities may be missing due to privacy constraints, collection difficulties, or resource limitations. While previous approaches have addressed this challenge using prompt learning techniques, they fail to capture the cross-modal relationships necessary for effective multimodal visual recognition and suffer from inevitable computational overhead. In this paper, we introduce MoRA, a parameter-efficient fine-tuning method that explicitly models cross-modal interactions while maintaining modality-specific adaptations. MoRA introduces modality-common parameters between text and vision encoders, enabling bidirectional knowledge transfer. Additionally, combined with the modality-specific parameters, MoRA allows the backbone model to maintain inter-modality interaction and enable intra-modality flexibility. Extensive experiments on standard benchmarks demonstrate that MoRA achieves an average performance improvement in missing-modality scenarios by 5.24% and uses only 25.90% of the inference time compared to the SOTA method while requiring only 0.11% of trainable parameters compared to full fine-tuning.
[119] Temporal-Guided Visual Foundation Models for Event-Based Vision
Ruihao Xia,Junhong Cai,Luziwei Leng,Liuyi Wang,Chengju Liu,Ran Cheng,Yang Tang,Pan Zhou
Main category: cs.CV
TL;DR: TGVFM提出了一种新型框架,将视觉基础模型(VFMs)与时态上下文融合块结合,用于事件相机数据的高效处理,实现了语义分割、深度估计和目标检测的SOTA性能。
Details
Motivation: 事件相机在处理挑战性环境中有独特优势,但现有方法依赖于专用架构或高资源训练,而基于图像的VFM潜力在事件视觉中尚未充分探索。Contribution: 提出了TGVFM框架,引入了长程时态注意力、双时空注意力和深度特征引导机制,无缝整合VFM与时态信息,提升了事件视觉任务的性能。
Method: 采用时序上下文融合块,结合长程时态注意力、双时空注意力和深度特征引导机制,将预训练的VFM与时态信息融合。
Result: 在语义分割、深度估计和目标检测任务中分别提升了16%、21%和16%,达到SOTA性能。
Insight: 通过跨模态利用图像VFM与时态推理的结合,解锁了事件视觉中的新潜力。
Abstract: Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource-intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained on image data remains under-explored for event-based vision. To address this, we propose Temporal-Guided VFM (TGVFM), a novel framework that integrates VFMs with our temporal context fusion block seamlessly to bridge this gap. Our temporal block introduces three key components: (1) Long-Range Temporal Attention to model global temporal dependencies, (2) Dual Spatiotemporal Attention for multi-scale frame correlation, and (3) Deep Feature Guidance Mechanism to fuse semantic-temporal features. By retraining event-to-video models on real-world data and leveraging transformer-based VFMs, TGVFM preserves spatiotemporal dynamics while harnessing pretrained representations. Experiments demonstrate SoTA performance across semantic segmentation, depth estimation, and object detection, with improvements of 16%, 21%, and 16% over existing methods, respectively. Overall, this work unlocks the cross-modality potential of image-based VFMs for event-based vision with temporal reasoning. Code is available at https://github.com/XiaRho/TGVFM.
[120] Physics-Informed Image Restoration via Progressive PDE Integration
Shamika Likhite,Santiago López-Tapia,Aggelos K. Katsaggelos
Main category: cs.CV
TL;DR: 该论文提出了一种基于物理启发的渐进式PDE集成方法,用于图像恢复,通过结合物理先验(如平流扩散方程)来捕捉运动模糊的方向性特征,显著提升了现有深度学习方法的效果。
Details
Motivation: 运动模糊严重影响图像质量及下游任务(如目标检测),现有深度学习方法难以捕捉运动模糊的长距离空间依赖性,且依赖极深网络建模全局关系,因此需要结合物理先验指导特征演化以改进恢复效果。Contribution: 1. 提出了一种渐进式训练框架,将物理驱动的PDE动力学集成到图像恢复架构中;2. 利用平流扩散方程建模特征演化,捕捉运动模糊的方向性特征;3. 在多个先进架构上验证了方法的有效性,显著提升了PSNR和SSIM指标。
Method: 1. 设计渐进式训练框架,逐层引入PDE动力学;2. 使用平流扩散方程建模特征演化,结合方向性流特性;3. 在FFTformer、NAFNet等架构中嵌入全局PDE层。
Result: 在标准运动去模糊基准测试中,PDE增强的模型显著提升了PSNR和SSIM,推理开销仅增加约1%,且在多个架构(如FFTformer、Restormer)上均表现优异。
Insight: 通过在神经网络中嵌入物理驱动的PDE层,可以高效建模运动模糊的全局空间依赖,为计算机视觉中的物理启发神经网络设计提供了新方向。
Abstract: Motion blur, caused by relative movement between camera and scene during exposure, significantly degrades image quality and impairs downstream computer vision tasks such as object detection, tracking, and recognition in dynamic environments. While deep learning-based motion deblurring methods have achieved remarkable progress, existing approaches face fundamental challenges in capturing the long-range spatial dependencies inherent in motion blur patterns. Traditional convolutional methods rely on limited receptive fields and require extremely deep networks to model global spatial relationships. These limitations motivate the need for alternative approaches that incorporate physical priors to guide feature evolution during restoration. In this paper, we propose a progressive training framework that integrates physics-informed PDE dynamics into state-of-the-art restoration architectures. By leveraging advection-diffusion equations to model feature evolution, our approach naturally captures the directional flow characteristics of motion blur while enabling principled global spatial modeling. Our PDE-enhanced deblurring models achieve superior restoration quality with minimal overhead, adding only approximately 1% to inference GMACs while providing consistent improvements in perceptual quality across multiple state-of-the-art architectures. Comprehensive experiments on standard motion deblurring benchmarks demonstrate that our physics-informed approach improves PSNR and SSIM significantly across four diverse architectures, including FFTformer, NAFNet, Restormer, and Stripformer. These results validate that incorporating mathematical physics principles through PDE-based global layers can enhance deep learning-based image restoration, establishing a promising direction for physics-informed neural network design in computer vision applications.
[121] Gait Recognition via Collaborating Discriminative and Generative Diffusion Models
Haijun Xiong,Bin Feng,Bang Wang,Xinggang Wang,Wenyu Liu
Main category: cs.CV
TL;DR: 论文提出了一种结合判别式模型和生成式扩散模型的新框架CoD²,通过多级条件控制策略提取鲁棒的步态特征,实现了在多数据集上的SOTA性能。
Details
Motivation: 步态识别作为一种非侵入式生物识别技术,判别式模型已取得显著成功,但生成式模型的潜力尚未充分探索。论文旨在结合两者的优势,提升步态特征的鲁棒性。Contribution: 1. 提出CoD²框架,首次结合判别式和生成式扩散模型;2. 设计多级条件控制策略,融合高级语义和低级视觉细节;3. 在多个数据集上验证了方法的有效性。
Method: 1. 使用判别式提取器捕获高级身份感知语义;2. 扩散模型生成身份一致的步态序列,同时保留低级视觉细节;3. 生成序列反哺判别式模型,提升语义特征学习。
Result: 在SUSTech1K、CCPG、GREW和Gait3D四个数据集上实现了SOTA性能,并能无缝集成现有判别式方法。
Insight: 结合生成式与判别式模型的优势,不仅能提升特征鲁棒性,还为其他生物识别任务提供了新思路。
Abstract: Gait recognition offers a non-intrusive biometric solution by identifying individuals through their walking patterns. Although discriminative models have achieved notable success in this domain, the full potential of generative models remains largely underexplored. In this paper, we introduce \textbf{CoD$^2$}, a novel framework that combines the data distribution modeling capabilities of diffusion models with the semantic representation learning strengths of discriminative models to extract robust gait features. We propose a Multi-level Conditional Control strategy that incorporates both high-level identity-aware semantic conditions and low-level visual details. Specifically, the high-level condition, extracted by the discriminative extractor, guides the generation of identity-consistent gait sequences, whereas low-level visual details, such as appearance and motion, are preserved to enhance consistency. Furthermore, the generated sequences facilitate the discriminative extractor’s learning, enabling it to capture more comprehensive high-level semantic features. Extensive experiments on four datasets (SUSTech1K, CCPG, GREW, and Gait3D) demonstrate that CoD$^2$ achieves state-of-the-art performance and can be seamlessly integrated with existing discriminative methods, yielding consistent improvements.
[122] AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving
Ruifei Zhang,Junlin Xie,Wei Zhang,Weikai Chen,Xiao Tan,Xiang Wan,Guanbin Li
Main category: cs.CV
TL;DR: AdaDrive提出了一种自适应慢快系统,用于语言驱动的自动驾驶,通过在复杂或关键场景中动态激活LLM,并结合自适应融合策略,平衡推理与实时性能。
Details
Motivation: 现有方法在自动驾驶中使用LLM时,要么激活频率过高导致计算开销大,要么采用固定调度无法适应动态驾驶条件。AdaDrive旨在解决这一问题。Contribution: 1) 提出自适应激活损失函数,动态决定LLM的调用时机;2) 引入自适应融合策略,根据场景复杂度和预测置信度调整LLM的影响力。
Method: 1) 对比学习机制动态触发LLM;2) 连续缩放LLM影响力的自适应融合策略。
Result: 在语言驱动的自动驾驶基准测试中,AdaDrive在准确性和计算效率上均达到最优性能。
Insight: 动态调整LLM的参与程度是实现高效自动驾驶的关键,AdaDrive提供了一种灵活的框架,既能利用LLM的高阶推理能力,又能满足实时性需求。
Abstract: Effectively integrating Large Language Models (LLMs) into autonomous driving requires a balance between leveraging high-level reasoning and maintaining real-time efficiency. Existing approaches either activate LLMs too frequently, causing excessive computational overhead, or use fixed schedules, failing to adapt to dynamic driving conditions. To address these challenges, we propose AdaDrive, an adaptively collaborative slow-fast framework that optimally determines when and how LLMs contribute to decision-making. (1) When to activate the LLM: AdaDrive employs a novel adaptive activation loss that dynamically determines LLM invocation based on a comparative learning mechanism, ensuring activation only in complex or critical scenarios. (2) How to integrate LLM assistance: Instead of rigid binary activation, AdaDrive introduces an adaptive fusion strategy that modulates a continuous, scaled LLM influence based on scene complexity and prediction confidence, ensuring seamless collaboration with conventional planners. Through these strategies, AdaDrive provides a flexible, context-aware framework that maximizes decision accuracy without compromising real-time performance. Extensive experiments on language-grounded autonomous driving benchmarks demonstrate that AdaDrive state-of-the-art performance in terms of both driving accuracy and computational efficiency. Code is available at https://github.com/ReaFly/AdaDrive.
[123] VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
Ruifei Zhang,Wei Zhang,Xiao Tan,Sibei Yang,Xiang Wan,Xiaonan Luo,Guanbin Li
Main category: cs.CV
TL;DR: VLDrive提出了一种轻量化的多模态大语言模型(MLLM)架构,通过视觉增强和高效特征学习优化语言驱动的自动驾驶性能,显著降低了模型参数规模并提升了驾驶表现。
Details
Motivation: 现有的基于大语言模型(LLM)的自动驾驶方法存在视觉表征不足和参数量过大两大问题,限制了其鲁棒性和实际部署可行性。Contribution: 1. 引入轻量化的MLLM架构VLDrive,显著减少参数量(81%)。2. 提出动态视觉剪枝和内存增强特征聚合方法以提高视觉表征效率。3. 设计了距离解耦的指令注意力机制,优化长距离视觉-语言联合特征学习。
Method: 1. 采用动态视觉剪枝和内存增强特征聚合生成紧凑视觉token。2. 提出距离解耦的指令注意力机制,提升长距离视觉-语言联合学习。3. 轻量化MLLM架构设计。
Result: 在CARLA模拟器中,VLDrive在闭合循环评估中显著提升了驾驶分数(15.4%、16.8%和7.6%分别在短、中和长距离场景),同时参数减少81%(从7B到1.3B)。
Insight: 轻量化设计和新颖的视觉表征学习方法可以有效解决LLM在自动驾驶中的参数量和视觉表征不足问题,为实际部署提供了可能性。
Abstract: Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive`s effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at https://github.com/ReaFly/VLDrive.
[124] LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval
Jian Zhang,Junyi Guo,Junyi Yuan,Huanda Lu,Yanlin Zhou,Fangyu Wu,Qiufeng Wang,Dongming Lu
Main category: cs.CV
TL;DR: LLM生成的描述在跨模态检索中存在完整性和一致性问题。C^3框架通过视觉和语言模型评估语义完整性,并通过马尔可夫决策过程监督Chain-of-Thought推理以提高一致性,显著提升了检索性能。
Details
Motivation: 文化遗产数据的跨模态检索常因文本描述不完整或不一致而受限,LLM虽然可以增强描述,但其输出可能存在幻觉或缺乏视觉细节。Contribution: 提出C^3框架,通过完整性评估模块和一致性监督方法,优化LLM生成的描述,提升跨模态检索性能。
Method: 1)完整性评估模块结合视觉和语言模型输出;2)马尔可夫决策过程监督Chain-of-Thought推理,自适应控制查询以确保一致性。
Result: 在CulTi、TimeTravel、MSCOCO和Flickr30K数据集上,C^3在微调和零样本设置中均取得最优结果。
Insight: 结合视觉信息的完整性评估和监督式推理能有效弥补LLM生成的不足,提升跨模态检索性能。
Abstract: Cross-modal retrieval is essential for interpreting cultural heritage data, but its effectiveness is often limited by incomplete or inconsistent textual descriptions, caused by historical data loss and the high cost of expert annotation. While large language models (LLMs) offer a promising solution by enriching textual descriptions, their outputs frequently suffer from hallucinations or miss visually grounded details. To address these challenges, we propose $C^3$, a data augmentation framework that enhances cross-modal retrieval performance by improving the completeness and consistency of LLM-generated descriptions. $C^3$ introduces a completeness evaluation module to assess semantic coverage using both visual cues and language-model outputs. Furthermore, to mitigate factual inconsistencies, we formulate a Markov Decision Process to supervise Chain-of-Thought reasoning, guiding consistency evaluation through adaptive query control. Experiments on the cultural heritage datasets CulTi and TimeTravel, as well as on general benchmarks MSCOCO and Flickr30K, demonstrate that $C^3$ achieves state-of-the-art performance in both fine-tuned and zero-shot settings.
[125] RelightMaster: Precise Video Relighting with Multi-plane Light Images
Weikang Bian,Xiaoyu Shi,Zhaoyang Huang,Jianhong Bai,Qinghe Wang,Xintao Wang,Pengfei Wan,Kun Gai,Hongsheng Li
Main category: cs.CV
TL;DR: RelightMaster提出了一种基于多平面光照图像的精确视频重光照框架,解决了扩散模型在视频重光照中的控制问题,并通过构建高质量数据集和新视觉提示实现了精确光照效果。
Details
Motivation: 现有扩散模型在文本到视频(T2V)任务中缺乏对光照的细粒度控制,且高质量重光照训练数据稀缺。RelightMaster旨在解决这些问题,提供精确可控的视频重光照方法。Contribution: 1) 构建了首个动态内容一致但光照多样的数据集RelightVideo;2) 提出多平面光照图像(MPLI)作为新视觉提示;3) 设计了Light Image Adapter以兼容预训练视频扩散模型。
Method: 1) 使用Unreal Engine构建RelightVideo数据集;2) 提出MPLI建模3D光源信息;3) 通过Video VAE压缩MPLI并注入预训练视频扩散模型中。
Result: 实验表明RelightMaster能生成物理合理的光照与阴影,同时保留原始场景内容。
Insight: MPLI作为一种视觉提示,可以有效扩展扩散模型的光照控制能力,同时避免对预训练模型的灾难性遗忘。
Abstract: Recent advances in diffusion models enable high-quality video generation and editing, but precise relighting with consistent video contents, which is critical for shaping scene atmosphere and viewer attention, remains unexplored. Mainstream text-to-video (T2V) models lack fine-grained lighting control due to text’s inherent limitation in describing lighting details and insufficient pre-training on lighting-related prompts. Additionally, constructing high-quality relighting training data is challenging, as real-world controllable lighting data is scarce. To address these issues, we propose RelightMaster, a novel framework for accurate and controllable video relighting. First, we build RelightVideo, the first dataset with identical dynamic content under varying precise lighting conditions based on the Unreal Engine. Then, we introduce Multi-plane Light Image (MPLI), a novel visual prompt inspired by Multi-Plane Image (MPI). MPLI models lighting via K depth-aligned planes, representing 3D light source positions, intensities, and colors while supporting multi-source scenarios and generalizing to unseen light setups. Third, we design a Light Image Adapter that seamlessly injects MPLI into pre-trained Video Diffusion Transformers (DiT): it compresses MPLI via a pre-trained Video VAE and injects latent light features into DiT blocks, leveraging the base model’s generative prior without catastrophic forgetting. Experiments show that RelightMaster generates physically plausible lighting and shadows and preserves original scene content. Demos are available at https://wkbian.github.io/Projects/RelightMaster/.
[126] LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation
Zijie Wang,Weiming Zhang,Wei Zhang,Xiao Tan,Hongxing Liu,Yaowei Wang,Guanbin Li
Main category: cs.CV
TL;DR: LaneDiffusion提出了一种基于扩散模型的中心线图学习方法,通过在BEV特征层面生成车道中心线先验,显著提升了中心线学习的性能。
Details
Motivation: 传统的确定性方法在处理中心线图学习时缺乏空间推理能力,难以应对遮挡或不可见的中心线;生成式方法虽具潜力但尚未充分探索。Contribution: 1) 提出LaneDiffusion,首次将扩散模型用于中心线图学习;2) 设计了LPIM和LPDM模块,实现了先验注入和扩散过程的优化。
Method: 1) 在BEV特征层面使用扩散模型生成车道中心线先验;2) 通过LPIM和LPDM模块构建扩散目标并优化过程;3) 从特征解码向量化中心线及拓扑结构。
Result: 在nuScenes和Argoverse2数据集上显著超越现有方法,点级和段级指标分别提升最高达6.4%和6.8%,达到了SOTA性能。
Insight: 生成式模型在中心线图学习任务中具有潜力,尤其是在处理复杂场景时;BEV特征层面的扩散生成是一种有效的策略。
Abstract: Centerline graphs, crucial for path planning in autonomous driving, are traditionally learned using deterministic methods. However, these methods often lack spatial reasoning and struggle with occluded or invisible centerlines. Generative approaches, despite their potential, remain underexplored in this domain. We introduce LaneDiffusion, a novel generative paradigm for centerline graph learning. LaneDiffusion innovatively employs diffusion models to generate lane centerline priors at the Bird’s Eye View (BEV) feature level, instead of directly predicting vectorized centerlines. Our method integrates a Lane Prior Injection Module (LPIM) and a Lane Prior Diffusion Module (LPDM) to effectively construct diffusion targets and manage the diffusion process. Furthermore, vectorized centerlines and topologies are then decoded from these prior-injected BEV features. Extensive evaluations on the nuScenes and Argoverse2 datasets demonstrate that LaneDiffusion significantly outperforms existing methods, achieving improvements of 4.2%, 4.6%, 4.7%, 6.4% and 1.8% on fine-grained point-level metrics (GEO F1, TOPO F1, JTOPO F1, APLS and SDA) and 2.3%, 6.4%, 6.8% and 2.1% on segment-level metrics (IoU, mAP_cf, DET_l and TOP_ll). These results establish state-of-the-art performance in centerline graph learning, offering new insights into generative models for this task.
[127] VideoSSR: Video Self-Supervised Reinforcement Learning
Zefeng He,Xiaoye Qu,Yafu Li,Siyuan Huang,Daizong Liu,Yu Cheng
Main category: cs.CV
TL;DR: 论文提出了VideoSSR,一种基于自监督强化学习的视频理解框架,通过设计三种自监督预训练任务和构建数据集VIUBench,显著提升了多模态大语言模型在视频领域的性能。
Details
Motivation: 目前多模态大语言模型的快速发展超过了现有视频数据集的复杂性,且高质量数据的标注成本高昂,因此探索如何利用视频内禀信息自生成高质量、可验证的训练数据。Contribution: 1. 提出了三种自监督预训练任务(异常定位、目标计数和时间拼图);2. 构建了数据集VIUBench和VideoSSR-30K;3. 设计了VideoSSR框架,显著提升了模型性能。
Method: 通过自监督预训练任务生成高质量训练数据,结合强化学习方法(RLVR)优化多模态大语言模型的视频理解能力。
Result: 在17个基准测试中覆盖四大视频领域,平均性能提升了5%以上。
Insight: 自监督学习可以有效利用视频内禀信息生成高质量训练数据,填补了人工标注数据的不足,为视频理解领域提供了新的基础框架。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that VideoSSR consistently enhances model performance, yielding an average improvement of over 5%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs. The code is available at https://github.com/lcqysl/VideoSSR.
[128] From ACR O-RADS 2022 to Explainable Deep Learning: Comparative Performance of Expert Radiologists, Convolutional Neural Networks, Vision Transformers, and Fusion Models in Ovarian Masses
Ali Abbasian Ardakani,Afshin Mohammadi,Alisa Mohebbi,Anushya Vijayananthan,Sook Sam Leong,Lim Yi Ting,Mohd Kamil Bin Mohamad Fabell,U Rajendra Acharya,Sepideh Hatamikia
Main category: cs.CV
TL;DR: 该论文比较了专家放射科医生、CNN、ViT和混合模型在卵巢肿块诊断中的性能,发现ViT表现最佳,混合人机框架显著提升CNN性能。
Details
Motivation: 研究旨在解决卵巢肿块诊断中放射科医生主观性和保守阈值的问题,评估深度学习模型在此任务中的潜力。Contribution: 对比了放射科医生、CNN、ViT及混合模型的性能,证明ViT表现最优,混合框架显著提升CNN的诊断准确性。
Method: 采用512张卵巢肿块图像,训练了16种DL模型(如DenseNet、ViT),并构建了混合模型结合放射科评分。
Result: ViT16-384表现最佳(AUC 0.941,准确率87.4%),混合模型显著提升CNN性能,但对ViT改善不显著。
Insight: 混合人机框架在标准化超声诊断和减少误诊方面潜力巨大,ViT在医学图像分类中表现出强大竞争力。
Abstract: Background: The 2022 update of the Ovarian-Adnexal Reporting and Data System (O-RADS) ultrasound classification refines risk stratification for adnexal lesions, yet human interpretation remains subject to variability and conservative thresholds. Concurrently, deep learning (DL) models have demonstrated promise in image-based ovarian lesion characterization. This study evaluates radiologist performance applying O-RADS v2022, compares it to leading convolutional neural network (CNN) and Vision Transformer (ViT) models, and investigates the diagnostic gains achieved by hybrid human-AI frameworks. Methods: In this single-center, retrospective cohort study, a total of 512 adnexal mass images from 227 patients (110 with at least one malignant cyst) were included. Sixteen DL models, including DenseNets, EfficientNets, ResNets, VGGs, Xception, and ViTs, were trained and validated. A hybrid model integrating radiologist O-RADS scores with DL-predicted probabilities was also built for each scheme. Results: Radiologist-only O-RADS assessment achieved an AUC of 0.683 and an overall accuracy of 68.0%. CNN models yielded AUCs of 0.620 to 0.908 and accuracies of 59.2% to 86.4%, while ViT16-384 reached the best performance, with an AUC of 0.941 and an accuracy of 87.4%. Hybrid human-AI frameworks further significantly enhanced the performance of CNN models; however, the improvement for ViT models was not statistically significant (P-value >0.05). Conclusions: DL models markedly outperform radiologist-only O-RADS v2022 assessment, and the integration of expert scores with AI yields the highest diagnostic accuracy and discrimination. Hybrid human-AI paradigms hold substantial potential to standardize pelvic ultrasound interpretation, reduce false positives, and improve detection of high-risk lesions.
[129] TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks
Xuanle Zhao,Shuxin Zeng,Yinyuan Cai,Xiang Cheng,Duzhen Zhang,Xiuyi Chen,Bo Xu
Main category: cs.CV
TL;DR: TinyChemVL是一种高效且强大的化学视觉语言模型,通过减少视觉令牌数量和引入复杂反应任务,提升了模型效率和推理能力。研究成果显著优于现有模型,参数更少,速度更快。
Details
Motivation: 现有视觉语言模型在化学领域的应用受限,主要因为计算效率低和任务范围窄,忽略了关键的视觉信息,如分子结构。TinyChemVL旨在解决这些问题。Contribution: 1. 提出TinyChemVL,通过视觉令牌减少和复杂反应任务提升效率和推理能力;2. 引入ChemRxn-V基准,用于评估视觉反应识别和预测任务。
Method: 1. 采用视觉令牌减少技术提高计算效率;2. 结合分子图像和反应任务,增强模型推理能力。
Result: TinyChemVL仅使用40亿参数,在分子和反应任务上表现优异,推理和训练速度更快,视觉令牌数量仅为ChemVLM的1/16。
Insight: 通过模型架构与任务复杂度的协同设计,可以在化学领域构建高效且强大的视觉语言模型。
Abstract: While Vision Language Models (VLMs) have demonstrated remarkable capabilities in general visual understanding, their application in the chemical domain has been limited, with previous works predominantly focusing on text and thus overlooking critical visual information, such as molecular structures. Current approaches that directly adopt standard VLMs for chemical tasks suffer from two primary issues: (i) computational inefficiency of processing entire chemical images with non-informative backgrounds. (ii) a narrow scope on molecular-level tasks that restricts progress in chemical reasoning. In this work, we propose \textbf{TinyChemVL}, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. Also, we propose \textbf{ChemRxn-V}, a reaction-level benchmark for assessing vision-based reaction recognition and prediction tasks. Directly predicting reaction products from molecular images poses a non-trivial challenge, as it requires models to integrate both recognition and reasoning capacities. Our results demonstrate that with only 4B parameters, TinyChemVL achieves superior performance on both molecular and reaction tasks while demonstrating faster inference and training speeds compared to existing models. Notably, TinyChemVL outperforms ChemVLM while utilizing only 1/16th of the visual tokens. This work builds efficient yet powerful VLMs for chemical domains by co-designing model architecture and task complexity.
[130] Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective
Bing Wang,Ximing Li,Yanjun Wang,Changchun Li,Lin Yuanbo Wu,Buyu Wang,Shengsheng Wang
Main category: cs.CV
TL;DR: 论文提出了一种新的多模态虚假信息检测方法RETSIMD,通过将文本分割为多个片段并生成对应的图像,增强图像的贡献,从而提升检测性能。
Details
Motivation: 研究发现文本在多模态虚假信息检测中贡献更大,而图像仅呈现部分场景。通过增强图像的描述能力,可以提升检测效果。Contribution: 提出了RETSIMD方法,通过分割文本生成图像序列,结合辅助目标和图结构融合特征,提升了多模态虚假信息检测性能。
Method: 将文本分割为片段,用文本-图像生成器生成图像序列;设计了文本-图像和图像-标签的互信息辅助目标;构建图结构并使用图神经网络融合特征。
Result: 实验证明了RETSIMD的有效性。
Insight: 增强图像的描述能力可以弥补其在多模态虚假信息检测中的不足,提升整体检测性能。
Abstract: Multimodal Misinformation Detection (MMD) refers to the task of detecting social media posts involving misinformation, where the post often contains text and image modalities. However, by observing the MMD posts, we hold that the text modality may be much more informative than the image modality because the text generally describes the whole event/story of the current post but the image often presents partial scenes only. Our preliminary empirical results indicate that the image modality exactly contributes less to MMD. Upon this idea, we propose a new MMD method named RETSIMD. Specifically, we suppose that each text can be divided into several segments, and each text segment describes a partial scene that can be presented by an image. Accordingly, we split the text into a sequence of segments, and feed these segments into a pre-trained text-to-image generator to augment a sequence of images. We further incorporate two auxiliary objectives concerning text-image and image-label mutual information, and further post-train the generator over an auxiliary text-to-image generation benchmark dataset. Additionally, we propose a graph structure by defining three heuristic relationships between images, and use a graph neural network to generate the fused features. Extensive empirical results validate the effectiveness of RETSIMD.
[131] Learning-Based Vision Systems for Semi-Autonomous Forklift Operation in Industrial Warehouse Environments
Vamshika Sutar,Mahek Maheshwari,Archak Mittal
Main category: cs.CV
TL;DR: 该论文提出了一个基于视觉的系统框架,用于工业仓库环境中半自动叉车操作,利用YOLOv8和YOLOv11架构实现托盘和托盘孔的检测与映射。
Details
Motivation: 仓库自动化需要低成本、鲁棒的视觉感知系统,以支持叉车和自动导引车的操作。Contribution: 1. 提出了一种基于单摄像头的视觉框架;2. 结合了YOLOv8和YOLOv11架构,并通过超参数优化和空间后处理提升性能;3. 设计了创新的托盘孔映射模块。
Method: 使用YOLOv8和YOLOv11进行检测,通过Optuna驱动的超参数优化和空间后处理;开发了托盘孔映射模块。
Result: YOLOv8检测精度高,YOLOv11在优化配置下表现更优且收敛稳定。该系统具有低成本、可扩展性强的特点。
Insight: 低成本视觉系统在仓库自动化中具有可行性,优化的YOLO架构能显著提升性能。
Abstract: The automation of material handling in warehouses increasingly relies on robust, low cost perception systems for forklifts and Automated Guided Vehicles (AGVs). This work presents a vision based framework for pallet and pallet hole detection and mapping using a single standard camera. We utilized YOLOv8 and YOLOv11 architectures, enhanced through Optuna driven hyperparameter optimization and spatial post processing. An innovative pallet hole mapping module converts the detections into actionable spatial representations, enabling accurate pallet and pallet hole association for forklift operation. Experiments on a custom dataset augmented with real warehouse imagery show that YOLOv8 achieves high pallet and pallet hole detection accuracy, while YOLOv11, particularly under optimized configurations, offers superior precision and stable convergence. The results demonstrate the feasibility of a cost effective, retrofittable visual perception module for forklifts. This study proposes a scalable approach to advancing warehouse automation, promoting safer, economical, and intelligent logistics operations.
[132] Physics-Informed Deformable Gaussian Splatting: Towards Unified Constitutive Laws for Time-Evolving Material Field
Haoqin Hong,Ding Fan,Fubin Dou,Zhi-Li Zhou,Haoran Sun,Congcong Zhu,Jingrun Chen
Main category: cs.CV
TL;DR: 论文提出了PIDG方法,通过物理约束和数据驱动的3D高斯粒子建模,实现了动态场景中物理一致的新视角合成。
Details
Motivation: 传统数据驱动的3DGS方法难以捕捉动态场景中多样化的物理驱动运动模式,因此需要一种结合物理约束的新方法。Contribution: 提出了PIDG方法,将高斯粒子视为具有时变本构参数的拉格朗日材料点,并通过2D光流监督和物理约束提升动态重建质量。
Method: 采用静态-动态解耦的4D分解哈希编码高效重建几何和运动;引入柯西动量残差作为物理约束;通过拉格朗日粒子流与光流匹配监督数据拟合。
Result: 实验表明,PIDG在物理一致性和动态重建质量上有显著提升,特别是在自定义物理驱动数据集和标准数据集上表现优异。
Insight: 结合物理约束和数据驱动的方法可以显著提升动态场景建模的准确性和泛化能力。
Abstract: Recently, 3D Gaussian Splatting (3DGS), an explicit scene representation technique, has shown significant promise for dynamic novel-view synthesis from monocular video input. However, purely data-driven 3DGS often struggles to capture the diverse physics-driven motion patterns in dynamic scenes. To fill this gap, we propose Physics-Informed Deformable Gaussian Splatting (PIDG), which treats each Gaussian particle as a Lagrangian material point with time-varying constitutive parameters and is supervised by 2D optical flow via motion projection. Specifically, we adopt static-dynamic decoupled 4D decomposed hash encoding to reconstruct geometry and motion efficiently. Subsequently, we impose the Cauchy momentum residual as a physics constraint, enabling independent prediction of each particle’s velocity and constitutive stress via a time-evolving material field. Finally, we further supervise data fitting by matching Lagrangian particle flow to camera-compensated optical flow, which accelerates convergence and improves generalization. Experiments on a custom physics-driven dataset as well as on standard synthetic and real-world datasets demonstrate significant gains in physical consistency and monocular dynamic reconstruction quality.
[133] Adaptive 3D Reconstruction via Diffusion Priors and Forward Curvature-Matching Likelihood Updates
Seunghyeok Shin,Dabin Kim,Hongki Lim
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散先验和正向曲率匹配似然更新的自适应3D重建方法,通过动态调整步长优化重建质量,支持多视图输入且无需重新训练。
Details
Motivation: 现有基于生成模型的3D重建方法(如扩散模型)存在灵活性不足的问题,包括训练时需要特定条件信号、仅支持固定输入视图数量,以及针对不同测量需求需完全重新训练。此外,近期方法依赖启发式固定步长,导致收敛速度慢和重建质量不佳。Contribution: 提出了一种新颖的正向曲率匹配(FCM)更新方法,结合扩散采样动态优化步长,仅需前向自动微分层和有限差分曲率估计即可实现精确优化。支持多种输入模态和多视图输入,且无需重新训练。
Method: 将FCM方法与扩散采样结合,动态计算最优步长。利用前向自动微分层和有限差分曲率估计优化似然更新,提升重建质量和效率。
Result: 在ShapeNet和CO3D数据集上的实验表明,该方法在相同或更低NFEs下实现了更高的重建质量,F-score更高,CD和EMD更低。
Insight: 动态优化步长是提升扩散模型灵活性和重建质量的关键。支持多种输入模态和多视图的设计使其更具实用性。
Abstract: Reconstructing high-quality point clouds from images remains challenging in computer vision. Existing generative-model-based approaches, particularly diffusion-model approaches that directly learn the posterior, may suffer from inflexibility – they require conditioning signals during training, support only a fixed number of input views, and need complete retraining for different measurements. Recent diffusion-based methods have attempted to address this by combining prior models with likelihood updates, but they rely on heuristic fixed step sizes for the likelihood update that lead to slow convergence and suboptimal reconstruction quality. We advance this line of approach by integrating our novel Forward Curvature-Matching (FCM) update method with diffusion sampling. Our method dynamically determines optimal step sizes using only forward automatic differentiation and finite-difference curvature estimates, enabling precise optimization of the likelihood update. This formulation enables high-fidelity reconstruction from both single-view and multi-view inputs, and supports various input modalities through simple operator substitution – all without retraining. Experiments on ShapeNet and CO3D datasets demonstrate that our method achieves superior reconstruction quality at matched or lower NFEs, yielding higher F-score and lower CD and EMD, validating its efficiency and adaptability for practical applications. Code is available at https://github.com/Seunghyeok0715/FCM
[134] Seq2Seq Models Reconstruct Visual Jigsaw Puzzles without Seeing Them
Gur Elkn,Ofir Itzhak Shahar,Ohad Ben-Shahar
Main category: cs.CV
TL;DR: 该论文提出一种新颖的方法,使用语言模型(Seq2Seq)解决方形拼图问题,无需视觉输入,通过将拼图块转化为符号序列,实现了与传统视觉方法相当的性能。
Details
Motivation: 传统拼图解决方法依赖视觉输入,该研究探索了是否可以通过非视觉手段(语言模型)解决这一问题,挑战了对问题领域的固有认知。Contribution: 主要贡献在于提出一种基于符号序列的拼图解决方法,展示了语言模型在非自然语言任务中的潜力,并在多个基准测试中达到了最先进的性能。
Method: 通过专用分词器将拼图块转化为离散符号序列,将拼图重组任务转化为序列到序列预测问题,使用编码器-解码器变换器模型实现。
Result: 实验结果表明,该方法在多个基准测试中优于传统视觉方法,证明了其有效性。
Insight: 研究发现,语言模型能够通过符号序列推理解决视觉问题,为跨领域问题解决提供了新思路。
Abstract: Jigsaw puzzles are primarily visual objects, whose algorithmic solutions have traditionally been framed from a visual perspective. In this work, however, we explore a fundamentally different approach: solving square jigsaw puzzles using language models, without access to raw visual input. By introducing a specialized tokenizer that converts each puzzle piece into a discrete sequence of tokens, we reframe puzzle reassembly as a sequence-to-sequence prediction task. Treated as “blind” solvers, encoder-decoder transformers accurately reconstruct the original layout by reasoning over token sequences alone. Despite being deliberately restricted from accessing visual input, our models achieve state-of-the-art results across multiple benchmarks, often outperforming vision-based methods. These findings highlight the surprising capability of language models to solve problems beyond their native domain, and suggest that unconventional approaches can inspire promising directions for puzzle-solving research.
[135] Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection
Dingkang Yang,Mingcheng Li,Xuecheng Wu,Zhaoyu Chen,Kaixun Jiang,Keliang Liu,Peng Zhai,Lihua Zhang
Main category: cs.CV
TL;DR: MODS框架通过模态优化和动态主模态选择改进多模态情感分析,利用GDC减少冗余噪声,MSelector动态选择主模态,PCCA增强主模态和跨模态交互,显著优于现有方法。
Details
Motivation: 多模态情感分析中,固定主模态策略无法适应样本间模态重要性的动态变化,且非语言模态的冗余和噪声影响性能。MODS旨在解决这些问题。Contribution: 1.提出GDC模块减少声学/视觉模态的冗余和噪声;2.设计MSelector动态选择主模态;3.开发PCCA模块增强主模态和跨模态交互。
Method: 1.GDC利用胶囊网络和图卷积压缩冗余序列;2.MSelector动态选择主模态;3.PCCA通过跨注意力机制优化主模态表示。
Result: 在四个基准数据集上,MODS显著优于现有方法,平衡了模态贡献并消除了噪声。
Insight: 动态主模态选择和冗余压缩是提升多模态情感分析性能的关键,MODS为模态不平衡问题提供了有效解决方案。
Abstract: Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos. However, imbalanced unimodal performance often leads to suboptimal fused representations. Existing approaches typically adopt fixed primary modality strategies to maximize dominant modality advantages, yet fail to adapt to dynamic variations in modality importance across different samples. Moreover, non-language modalities suffer from sequential redundancy and noise, degrading model performance when they serve as primary inputs. To address these issues, this paper proposes a modality optimization and dynamic primary modality selection framework (MODS). First, a Graph-based Dynamic Sequence Compressor (GDC) is constructed, which employs capsule networks and graph convolution to reduce sequential redundancy in acoustic/visual modalities. Then, we develop a sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination. Finally, a Primary-modality-Centric Cross-Attention (PCCA) module is designed to enhance dominant modalities while facilitating cross-modal interaction. Extensive experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.
[136] Label-Efficient 3D Forest Mapping: Self-Supervised and Transfer Learning for Individual, Structural, and Species Analysis
Aldino Rizaldy,Fabian Ewald Fassnacht,Ahmed Jamal Afifi,Hua Jiang,Richard Gloaguen,Pedram Ghamisi
Main category: cs.CV
TL;DR: 该论文提出了一种标签高效的3D森林映射方法,通过自监督学习和迁移学习减少对大量标注数据的依赖,提升树木实例分割、语义分割和分类任务的性能,并将其整合为一个统一框架。
Details
Motivation: 精准林业和生物多样性保护需要详细的树木结构及物种信息,但传统深度学习方法依赖大量标注数据,而标注3D点云数据耗时且难以扩展。Contribution: 1. 结合自监督学习和领域适应提升实例分割性能(AP50提升16.98%);2. 自监督学习有效支持语义分割(mIoU提升1.79%);3. 分层迁移学习实现未见物种的准确分类(Jaccard提升6.07%)。
Method: 采用自监督学习和迁移学习架构,整合实例分割、语义分割和分类任务为一个统一框架。
Result: 通过预训练模型减少约21%的能耗和碳排放,提升任务性能。
Insight: 自监督学习和迁移学习的结合可在标注数据有限的情况下显著提升3D点云任务的性能,适用于实际林业应用。
Abstract: Detailed structural and species information on individual tree level is increasingly important to support precision forestry, biodiversity conservation, and provide reference data for biomass and carbon mapping. Point clouds from airborne and ground-based laser scanning are currently the most suitable data source to rapidly derive such information at scale. Recent advancements in deep learning improved segmenting and classifying individual trees and identifying semantic tree components. However, deep learning models typically require large amounts of annotated training data which limits further improvement. Producing dense, high-quality annotations for 3D point clouds, especially in complex forests, is labor-intensive and challenging to scale. We explore strategies to reduce dependence on large annotated datasets using self-supervised and transfer learning architectures. Our objective is to improve performance across three tasks: instance segmentation, semantic segmentation, and tree classification using realistic and operational training sets. Our findings indicate that combining self-supervised learning with domain adaptation significantly enhances instance segmentation compared to training from scratch (AP50 +16.98%), self-supervised learning suffices for semantic segmentation (mIoU +1.79%), and hierarchical transfer learning enables accurate classification of unseen species (Jaccard +6.07%). To simplify use and encourage uptake, we integrated the tasks into a unified framework, streamlining the process from raw point clouds to tree delineation, structural analysis, and species classification. Pretrained models reduce energy consumption and carbon emissions by ~21%. This open-source contribution aims to accelerate operational extraction of individual tree information from laser scanning point clouds to support forestry, biodiversity, and carbon mapping.
[137] BuildingWorld: A Structured 3D Building Dataset for Urban Foundation Models
Shangfeng Huang,Ruisheng Wang,Xin Wang
Main category: cs.CV
TL;DR: BuildingWorld 是一个结构化3D建筑数据集,旨在解决现有建筑数据集中建筑风格多样性不足的问题,为城市基础模型提供全球代表性的数据。
Details
Motivation: 现有的3D城市建模数据集建筑风格单一,限制了学习模型在异构城市环境中的泛化能力。BuildingWorld希望通过提供多样化的建筑数据提升模型的普适性。Contribution: 提出了BuildingWorld数据集,包含全球多地区的500万LOD2建筑模型及LiDAR点云数据;引入虚拟城市模型Cyber City以生成多样化训练数据;提供了标准化的评估指标。
Method: 通过整合全球多区域的建筑数据和LiDAR点云构建数据集;利用Cyber City生成自定义点云分布的数据;设计了标准化评估指标。
Result: 构建了一个全面且多样化的数据集,支持3D建筑重建、检测和分割任务,为城市基础模型的训练和评估提供了标准化平台。
Insight: 建筑风格的多样性对城市基础模型的泛化能力至关重要;虚拟城市模型可以灵活生成训练数据,弥补现实数据的不足。
Abstract: As digital twins become central to the transformation of modern cities, accurate and structured 3D building models emerge as a key enabler of high-fidelity, updatable urban representations. These models underpin diverse applications including energy modeling, urban planning, autonomous navigation, and real-time reasoning. Despite recent advances in 3D urban modeling, most learning-based models are trained on building datasets with limited architectural diversity, which significantly undermines their generalizability across heterogeneous urban environments. To address this limitation, we present BuildingWorld, a comprehensive and structured 3D building dataset designed to bridge the gap in stylistic diversity. It encompasses buildings from geographically and architecturally diverse regions – including North America, Europe, Asia, Africa, and Oceania – offering a globally representative dataset for urban-scale foundation modeling and analysis. Specifically, BuildingWorld provides about five million LOD2 building models collected from diverse sources, accompanied by real and simulated airborne LiDAR point clouds. This enables comprehensive research on 3D building reconstruction, detection and segmentation. Cyber City, a virtual city model, is introduced to enable the generation of unlimited training data with customized and structurally diverse point cloud distributions. Furthermore, we provide standardized evaluation metrics tailored for building reconstruction, aiming to facilitate the training, evaluation, and comparison of large-scale vision models and foundation models in structured 3D urban environments.
[138] GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding
Athul M. Mathew,Haithem Hermassi,Thariq Khalid,Arshad Ali Khan,Riad Souissi
Main category: cs.CV
TL;DR: GazeVLM是一个新型视觉-语言模型(VLM),用于图像中的多任务注视理解,结合了人物检测、注视目标检测和注视物体识别。与其他基于Transformer的方法不同,GazeVLM首次将VLM应用于这些任务的组合,并通过融合RGB和HHA编码的深度图,在GazeFollow和VideoAttentionTarget数据集上实现了SOTA性能。
Details
Motivation: 注视理解在视觉注意力和意图估计中具有重要作用,但现有方法缺乏一个统一的框架来结合视觉和语言模态。GazeVLM旨在填补这一空白,提供更全面的注视分析能力。Contribution: 1. 提出了首个结合视觉和语言的VLM模型GazeVLM,用于多任务注视理解;2. 引入了RGB与HHA编码深度图的融合方法,提升性能;3. 提出了新的物体级注视检测指标$AP_{ob}$。
Method: GazeVLM结合了RGB图像和HHA编码的深度图,并通过文本提示指导多任务注视理解。模型基于视觉-语言Transformer架构,支持选择性执行任务。
Result: 在GazeFollow和VideoAttentionTarget数据集上,GazeVLM实现了SOTA性能,显著优于其他方法。
Insight: 视觉与语言的融合显著提升了注视理解的性能,尤其是深度信息的引入进一步增强了模型的准确性。
Abstract: Gaze understanding unifies the detection of people, their gaze targets, and objects of interest into a single framework, offering critical insight into visual attention and intent estimation. Although prior research has modelled gaze cues in visual scenes, a unified system is still needed for gaze understanding using both visual and language prompts. This paper introduces GazeVLM, a novel Vision-Language Model (VLM) for multi-task gaze understanding in images, addressing person detection, gaze target detection, and gaze object identification. While other transformer-based methods exist for gaze analysis, GazeVLM represents, to our knowledge, the first application of a VLM to these combined tasks, allowing for selective execution of each task. Through the integration of visual (RGB and depth) and textual modalities, our ablation study on visual input combinations revealed that a fusion of RGB images with HHA-encoded depth maps, guided by text prompts, yields superior performance. We also introduce an object-level gaze detection metric for gaze object identification ($AP_{ob}$). Through experiments, GazeVLM demonstrates significant improvements, notably achieving state-of-the-art evaluation scores on GazeFollow and VideoAttentionTarget datasets.
[139] AesTest: Measuring Aesthetic Intelligence from Perception to Production
Guolong Wang,Heng Huang,Zhiqiang Zhang,Wentian Li,Feilong Ma,Xin Jin
Main category: cs.CV
TL;DR: AesTest是一个多模态美学感知与生成的综合评测基准,覆盖了美学感知、欣赏、创作和摄影等多种任务,解决了现有图像美学评估基准范围窄或多样性不足的问题。
Details
Motivation: 现有的多模态大语言模型在美学感知和美学判断生成方面的能力尚未得到充分探索和评估,而现有评测基准存在范围狭窄或多样性不足的问题。Contribution: 提出AesTest评测基准,涵盖10种美学相关任务,结合心理学理论和多样化数据源,支持多种美学查询类型,填补了美学评测的空白。
Method: AesTest通过设计多选问题任务,整合专业编辑流程、摄影教程和众包偏好数据,评估MLLMs在美学感知与生成中的表现。
Result: 评估结果显示,现有MLLMs在美学智能方面面临显著挑战,表明AesTest能有效揭示模型的不足之处。
Insight: 美学评测需要综合考虑心理学理论和多样化数据源,AesTest为未来美学智能研究提供了有价值的评测工具。
Abstract: Perceiving and producing aesthetic judgments is a fundamental yet underexplored capability for multimodal large language models (MLLMs). However, existing benchmarks for image aesthetic assessment (IAA) are narrow in perception scope or lack the diversity needed to evaluate systematic aesthetic production. To address this gap, we introduce AesTest, a comprehensive benchmark for multimodal aesthetic perception and production, distinguished by the following features: 1) It consists of curated multiple-choice questions spanning ten tasks, covering perception, appreciation, creation, and photography. These tasks are grounded in psychological theories of generative learning. 2) It integrates data from diverse sources, including professional editing workflows, photographic composition tutorials, and crowdsourced preferences. It ensures coverage of both expert-level principles and real-world variation. 3) It supports various aesthetic query types, such as attribute-based analysis, emotional resonance, compositional choice, and stylistic reasoning. We evaluate both instruction-tuned IAA MLLMs and general MLLMs on AesTest, revealing significant challenges in building aesthetic intelligence. We will publicly release AesTest to support future research in this area.
[140] V-Shuffle: Zero-Shot Style Transfer via Value Shuffle
Haojun Tang,Qiwei Lin,Tongda Xu,Lida Huang,Yan Wang
Main category: cs.CV
TL;DR: V-Shuffle是一种零样本风格迁移方法,通过扰乱扩散模型中自注意力层的值特征来保持低层风格表示,有效解决了内容泄漏问题。
Details
Motivation: 现有基于注意力注入的风格迁移方法存在内容泄漏问题,即风格图像的语义内容错误地出现在风格化输出中。研究者希望通过多风格图像的有效利用来平衡内容保留和风格保真度。Contribution: 提出了V-Shuffle方法,通过值特征扰乱和混合风格正则化,解决了内容泄漏问题,同时在多风格图像和单风格图像场景中均表现出色。
Method: V-Shuffle通过在扩散模型的自注意力层中扰乱值特征来隐式破坏风格图像的语义内容,同时引入混合风格正则化补充高层风格纹理。
Result: 实验结果表明,V-Shuffle在多风格图像任务中表现优异,同时在单风格图像任务中超越了现有最优方法。
Insight: 通过扰乱特征而非直接抑制语义内容,可以更有效地分离风格和内容,为风格迁移任务提供了新思路。
Abstract: Attention injection-based style transfer has achieved remarkable progress in recent years. However, existing methods often suffer from content leakage, where the undesired semantic content of the style image mistakenly appears in the stylized output. In this paper, we propose V-Shuffle, a zero-shot style transfer method that leverages multiple style images from the same style domain to effectively navigate the trade-off between content preservation and style fidelity. V-Shuffle implicitly disrupts the semantic content of the style images by shuffling the value features within the self-attention layers of the diffusion model, thereby preserving low-level style representations. We further introduce a Hybrid Style Regularization that complements these low-level representations with high-level style textures to enhance style fidelity. Empirical results demonstrate that V-Shuffle achieves excellent performance when utilizing multiple style images. Moreover, when applied to a single style image, V-Shuffle outperforms previous state-of-the-art methods.
[141] InfoAffect: A Dataset for Affective Analysis of Infographics
Zihang Fu,Yunchao Wang,Chenyu Huang,Guodao Sun,Ronghua Liang
Main category: cs.CV
TL;DR: 论文介绍了一个名为InfoAffect的数据集,用于研究信息图表的情感维度,填补了该领域数据资源的空白。通过多模态语言模型和用户研究验证了数据集的高质量。
Details
Motivation: 信息图表在传达复杂信息时广泛应用,但其情感维度的研究因数据资源稀缺而受限,亟需一个高质量的数据集支持相关研究。Contribution: 论文的主要贡献是提出了InfoAffect数据集,包含3.5k个情感标注样本,结合了文本内容和真实世界信息图表,并通过多模态分析和用户研究验证了其准确性。
Method: 数据集的构建包括数据收集、预处理、情感表格构建和质量控制。采用五种多模态大语言模型(MLLMs)分析文本和图表的模态,并使用Reciprocal Rank Fusion(RRF)算法融合结果。通过用户研究和Composite Affect Consistency Index(CACI)评估数据集质量。
Result: 用户研究表明,InfoAffect数据集的CACI得分为0.986,表明数据集的准确性和一致性很高。
Insight: 信息图表的情感分析是一个新兴且重要的研究方向,多模态模型的结合和高质量数据集的构建为未来研究提供了坚实基础。
Abstract: Infographics are widely used to convey complex information, yet their affective dimensions remain underexplored due to the scarcity of data resources. We introduce a 3.5k-sample affect-annotated InfoAffect dataset, which combines textual content with real-world infographics. We first collect the raw data from six domains and aligned them via preprocessing, the accompanied-text-priority method, and three strategies to guarantee the quality and compliance. After that we construct an affect table and use it to constrain annotation. Five state-of-the-art multimodal large language models (MLLMs) then analyze both modalities, and their outputs are fused with Reciprocal Rank Fusion (RRF) algorithm to yield robust affects and confidences. We conducted a user study with two experiments to validate usability and assess InfoAffect dataset using the Composite Affect Consistency Index (CACI), achieving an overall score of 0.986, which indicates high accuracy.
[142] On Modality Incomplete Infrared-Visible Object Detection: An Architecture Compatibility Perspective
Shuo Yang,Yinghui Xing,Shizhou Zhang,Zhilong Niu
Main category: cs.CV
TL;DR: 该论文研究了红外-可见光目标检测(IVOD)中的模态不完整问题,提出了一种名为Scarf Neck的模块,用于DETR变体,通过模态无关的可变形注意力机制,使检测器能灵活适应单或双模态。此外,设计了伪模态丢弃策略和新的基准测试,Scarf-DETR在模态缺失和完整场景下均表现优异。
Details
Motivation: 现有IVOD模型在模态不完整时性能显著下降,尤其是主导模态缺失时,亟需一种兼容性强的架构来应对这一问题。Contribution: 1. 提出Scarf Neck模块,引入模态无关的可变形注意力机制;2. 设计了伪模态丢弃策略,充分利用多模态信息;3. 建立了模态不完整IVOD任务的综合基准测试。
Method: 1. Scarf Neck模块适配DETR变体,支持灵活的单/双模态训练和推理;2. 伪模态丢弃策略增强兼容性;3. 新基准测试评估模态缺失情况。
Result: Scarf-DETR在模态缺失和完整场景下均表现出色,优于现有方法。
Insight: 模态无关设计是关键,伪模态丢弃策略能有效提升模型对模态缺失的鲁棒性。
Abstract: Infrared and visible object detection (IVOD) is essential for numerous around-the-clock applications. Despite notable advancements, current IVOD models exhibit notable performance declines when confronted with incomplete modality data, particularly if the dominant modality is missing. In this paper, we take a thorough investigation on modality incomplete IVOD problem from an architecture compatibility perspective. Specifically, we propose a plug-and-play Scarf Neck module for DETR variants, which introduces a modality-agnostic deformable attention mechanism to enable the IVOD detector to flexibly adapt to any single or double modalities during training and inference. When training Scarf-DETR, we design a pseudo modality dropout strategy to fully utilize the multi-modality information, making the detector compatible and robust to both working modes of single and double modalities. Moreover, we introduce a comprehensive benchmark for the modality-incomplete IVOD task aimed at thoroughly assessing situations where the absent modality is either dominant or secondary. Our proposed Scarf-DETR not only performs excellently in missing modality scenarios but also achieves superior performances on the standard IVOD modality complete benchmarks. Our code will be available at https://github.com/YinghuiXing/Scarf-DETR.
[143] VDNeRF: Vision-only Dynamic Neural Radiance Field for Urban Scenes
Zhengyu Zou,Jingfeng Li,Hao Li,Xiaolei Hou,Jinwen Hu,Jingkun Chen,Lechao Cheng,Dingwen Zhang
Main category: cs.CV
TL;DR: VDNeRF提出了一种仅使用视觉信息的动态神经辐射场方法,用于解决城市动态场景中的相机姿态估计和新视角合成问题,无需额外传感器数据。
Details
Motivation: 现有NeRF方法在城市动态场景中面临相机姿态获取困难和动态物体处理能力不足的问题。VDNeRF旨在克服这些挑战,提升自主驾驶和机器人感知的实用性。Contribution: 1. 提出VDNeRF,通过静态和动态NeRF联合建模场景;2. 设计训练框架,解决相机运动与物体运动的歧义问题;3. 在主流数据集上验证了方法的优越性。
Method: 1. 分静态NeRF(优化相机姿态和背景)和动态NeRF(引入3D场景流);2. 设计自监督框架分解静态与动态元素。
Result: 在相机姿态估计和动态新视角合成任务上,VDNeRF优于现有无姿态依赖的NeRF方法。
Insight: 通过分离静态与动态场景建模,VDNeRF有效解决了动态环境中的NeRF应用难题,展示了自监督学习的潜力。
Abstract: Neural Radiance Fields (NeRFs) implicitly model continuous three-dimensional scenes using a set of images with known camera poses, enabling the rendering of photorealistic novel views. However, existing NeRF-based methods encounter challenges in applications such as autonomous driving and robotic perception, primarily due to the difficulty of capturing accurate camera poses and limitations in handling large-scale dynamic environments. To address these issues, we propose Vision-only Dynamic NeRF (VDNeRF), a method that accurately recovers camera trajectories and learns spatiotemporal representations for dynamic urban scenes without requiring additional camera pose information or expensive sensor data. VDNeRF employs two separate NeRF models to jointly reconstruct the scene. The static NeRF model optimizes camera poses and static background, while the dynamic NeRF model incorporates the 3D scene flow to ensure accurate and consistent reconstruction of dynamic objects. To address the ambiguity between camera motion and independent object motion, we design an effective and powerful training framework to achieve robust camera pose estimation and self-supervised decomposition of static and dynamic elements in a scene. Extensive evaluations on mainstream urban driving datasets demonstrate that VDNeRF surpasses state-of-the-art NeRF-based pose-free methods in both camera pose estimation and dynamic novel view synthesis.
[144] DiffusionUavLoc: Visually Prompted Diffusion for Cross-View UAV Localization
Tao Liu,Kan Ren,Qian Chen
Main category: cs.CV
TL;DR: 论文提出了DiffusionUavLoc,一种基于扩散模型的跨视角无人机定位框架,通过训练无关的几何渲染和文本无关的条件扩散模型,解决了无人机与卫星图像之间的视角差距问题,提升了定位性能。
Details
Motivation: 在GNSS信号缺失的环境下,无人机定位依赖于卫星图像的跨视角检索,但无人机倾斜视角与卫星正射图像之间存在显著的几何和外观域差异,传统方法依赖复杂网络或大量标注,泛化性受限。Contribution: 1. 提出DiffusionUavLoc框架,结合几何渲染和扩散模型,实现文本无关、图像提示的跨视角定位;2. 采用VAE统一表示,提升特征鲁棒性;3. 在University-1652和SUES-200数据集上取得竞争性结果。
Method: 1. 使用训练无关的几何渲染从无人机图像生成伪卫星图像作为结构提示;2. 设计文本无关的条件扩散模型,融合多模态结构线索;3. 在固定时间步t提取描述符,通过余弦相似度比较。
Result: 在University-1652和SUES-200数据集上表现优异,尤其在University-1652的卫星到无人机任务中效果突出。
Insight: 扩散模型可用于跨视角图像相似性学习,几何渲染为伪数据生成提供了高效的训练无关方案。
Abstract: With the rapid growth of the low-altitude economy, unmanned aerial vehicles (UAVs) have become key platforms for measurement and tracking in intelligent patrol systems. However, in GNSS-denied environments, localization schemes that rely solely on satellite signals are prone to failure. Cross-view image retrieval-based localization is a promising alternative, yet substantial geometric and appearance domain gaps exist between oblique UAV views and nadir satellite orthophotos. Moreover, conventional approaches often depend on complex network architectures, text prompts, or large amounts of annotation, which hinders generalization. To address these issues, we propose DiffusionUavLoc, a cross-view localization framework that is image-prompted, text-free, diffusion-centric, and employs a VAE for unified representation. We first use training-free geometric rendering to synthesize pseudo-satellite images from UAV imagery as structural prompts. We then design a text-free conditional diffusion model that fuses multimodal structural cues to learn features robust to viewpoint changes. At inference, descriptors are computed at a fixed time step t and compared using cosine similarity. On University-1652 and SUES-200, the method performs competitively for cross-view localization, especially for satellite-to-drone in University-1652.Our data and code will be published at the following URL: https://github.com/liutao23/DiffusionUavLoc.git.
[145] Countering Multi-modal Representation Collapse through Rank-targeted Fusion
Seulgi Kim,Kiran Kokilepersaud,Mohit Prabhushankar,Ghassan AlRegib
Main category: cs.CV
TL;DR: 该论文提出了一种解决多模态融合中特征崩溃和模态崩溃的统一框架,通过引入有效秩的概念和增强秩的令牌融合器,显著提升了动作预期任务的性能。
Details
Motivation: 多模态融合方法常面临特征崩溃和模态崩溃问题,现有方法未能统一解决这两种现象。论文旨在通过有效秩的概念同时量化并应对这两种崩溃。Contribution: 1. 提出了特征崩溃和模态崩溃的统一量化方法——有效秩;
2. 设计了Rank-enhancing Token Fuser,选择性融合互补特征以提升有效秩;
3. 展示了深度与RGB融合如何避免模态崩溃;
4. 在动作预期任务中验证了方法的有效性。
Method: 提出了Rank-enhancing Token Fuser,通过选择性融合互补模态特征提升有效秩,确保融合后表征的判别性。同时评估模态组合以避免模态崩溃。
Result: 在NTURGBD、UTKinect和DARai数据集上,R3D框架显著超越了现有方法,性能提升高达3.74%。
Insight: 有效秩是一个统一的指标,能够同时衡量特征崩溃和模态崩溃;深度信息的引入有助于平衡RGB主导的融合场景。
Abstract: Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality. We show that our method increases the effective rank of the fused representation. To address modality collapse, we evaluate modality combinations that mutually increase each others’ effective rank. We show that depth maintains representational balance when fused with RGB, avoiding modality collapse. We validate our method on action anticipation, where we present \texttt{R3D}, a depth-informed fusion framework. Extensive experiments on NTURGBD, UTKinect, and DARai demonstrate that our approach significantly outperforms prior state-of-the-art methods by up to 3.74%. Our code is available at: \href{https://github.com/olivesgatech/R3D}{https://github.com/olivesgatech/R3D}.
[146] NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models
Kyuho Lee,Euntae Kim,Jinwoo Choi,Buru Chang
Main category: cs.CV
TL;DR: NOAH是一个新基准,旨在评估视频大语言模型(Video LLMs)中由叙事先验驱动的幻觉和遗漏问题。通过构建复合视频和设计多种任务,揭示了模型依赖叙事一致性而非视觉证据的倾向及其导致的错误模式。
Details
Motivation: 视频大语言模型在任务如字幕生成和问答中表现优秀,但过分依赖叙事连贯性可能导致幻觉和遗漏。缺乏系统性评估此类问题的基准,阻碍了模型的可靠性提升。Contribution: 提出了NOAH基准,首次标准化评估叙事先验导致的幻觉和遗漏,设计了复合视频构建方法和多种任务(字幕生成和QA),生成了60K+样本。
Method: 通过插入不同语义相似度和位置的视频片段构建复合视频,设计了字幕任务和三类QA任务(存在性、时序性和叙事性),分析模型在不同条件下的表现。
Result: 实验发现:(1)多数Video LLMs受叙事先验驱动出现幻觉和遗漏;(2)错误模式随架构、事件相似度和插入位置变化;(3)帧数减少时叙事先验依赖加剧。
Insight: NOAH揭示了模型依赖叙事一致性的负面影响,为未来模型设计提供了改进方向,强调了视觉证据与叙事连贯性的平衡重要性。
Abstract: Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering. Many models and training methods explicitly encourage continuity across events to enhance narrative coherence. While this improves fluency, it also introduces an inductive bias that prioritizes storyline consistency over strict grounding in visual evidence. We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context. To systematically evaluate narrative prior-induced errors, we introduce NOAH, a large-scale benchmark that constructs composite videos by inserting clips from other sources into target videos. By varying semantic similarity and insertion position, our benchmark enables controlled and scalable analysis of narrative priors. We design one captioning task with tailored metrics and three QA tasks - Existence, Temporal, and Narrative - yielding more than 60K evaluation samples. Extensive experiments yield three key findings: (i) most Video LLMs exhibit hallucinations and omissions driven by narrative priors, (ii) the patterns of these errors vary across architectures and depend on event similarity and insertion position, and (iii) reliance on narrative priors intensifies under sampling with fewer frames, amplifying errors when event continuity is weak. We establish NOAH as the first standardized evaluation of narrative prior-induced hallucination and omission in Video LLMs, providing a foundation for developing more reliable and trustworthy models. Our benchmark and code are available at https://anonymous550520.github.io/.
[147] Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models
Yule Chen,Yufan Ren,Sabine Süsstrunk
Main category: cs.CV
TL;DR: 论文分析了漫画理解对视觉语言模型(VLMs)的挑战,提出了首个细粒度漫画理解基准AI4VA-FG,并引入了区域感知强化学习(RARL)方法,显著提升了模型在漫画任务中的表现。
Details
Motivation: 漫画因其独特的视觉风格(如线条艺术、拟声词和多面板布局)对VLMs提出了巨大挑战,现有的VLM在自然图像上表现良好,但在漫画理解上仍有明显不足。Contribution: 1) 提出首个细粒度漫画理解基准AI4VA-FG;2) 评估了多个SOTA模型的性能,揭示了漫画理解的未解难题;3) 提出了区域感知强化学习(RARL)方法,显著提升了模型性能。
Method: 1) 设计了涵盖识别、检测、角色推理和叙事构建的多任务基准;2) 研究了后训练策略(SFT-S、SFT-R、RL);3) 提出RARL方法,通过动态关注相关区域(zoom-in操作)提升模型表现。
Result: 实验表明,RARL显著提升了Qwen2.5-VL在低层次实体识别和高层次故事情节排序任务中的表现。
Insight: 区域感知的注意力机制(如RARL)是提升VLM在复杂视觉叙事(如漫画)中表现的有效途径。
Abstract: Complex visual narratives, such as comics, present a significant challenge to Vision-Language Models (VLMs). Despite excelling on natural images, VLMs often struggle with stylized line art, onomatopoeia, and densely packed multi-panel layouts. To address this gap, we introduce AI4VA-FG, the first fine-grained and comprehensive benchmark for VLM-based comic understanding. It spans tasks from foundational recognition and detection to high-level character reasoning and narrative construction, supported by dense annotations for characters, poses, and depth. Beyond that, we evaluate state-of-the-art proprietary models, including GPT-4o and Gemini-2.5, and open-source models such as Qwen2.5-VL, revealing substantial performance deficits across core tasks of our benchmarks and underscoring that comic understanding remains an unsolved challenge. To enhance VLMs’ capabilities in this domain, we systematically investigate post-training strategies, including supervised fine-tuning on solutions (SFT-S), supervised fine-tuning on reasoning trajectories (SFT-R), and reinforcement learning (RL). Beyond that, inspired by the emerging “Thinking with Images” paradigm, we propose Region-Aware Reinforcement Learning (RARL) for VLMs, which trains models to dynamically attend to relevant regions through zoom-in operations. We observe that when applied to the Qwen2.5-VL model, RL and RARL yield significant gains in low-level entity recognition and high-level storyline ordering, paving the way for more accurate and efficient VLM applications in the comics domain.
[148] SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports
Haotian Xia,Haonan Ge,Junbo Zou,Hyun Woo Choi,Xuebin Zhang,Danny Suradja,Botao Rui,Ethan Tran,Wendy Jin,Zhen Ye,Xiyang Lin,Christopher Lai,Shengjie Zhang,Junwen Miao,Shichao Chen,Rhys Tracy,Vicente Ordonez,Weining Shen,Hanjie Chen
Main category: cs.CV
TL;DR: SportR是一个多运动场景下的大规模多模态基准测试,旨在评估模型在视觉感知和规则推理方面的能力,提供了丰富的图像、视频数据和链式推理标注。
Details
Motivation: 现有的体育基准测试要么局限于单一运动,要么缺乏细粒度的视觉标注和复杂的推理链,无法全面评估多模态模型在体育领域的推理能力。Contribution: 1) 首个针对多运动领域的多模态基准测试SportR;2) 提供了5,017张图像和2,101段视频数据;3) 设计了层次化的问题-答案对和7,118条人工标注的链式推理数据。
Method: 1) 构建了涵盖多运动的图像和视频数据集;2) 设计了从简单到复杂的层次化QA任务;3) 通过监督微调和强化学习评估模型性能。
Result: 基线模型在最具挑战性的任务上表现不佳,即使在训练后改进仍有限,凸显了当前模型的推理能力不足。
Insight: SportR揭示了多模态模型在结合视觉细节和规则推理上的显著差距,为未来研究提供了重要方向。
Abstract: Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought (CoT) annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.
[149] Video Dataset for Surgical Phase, Keypoint, and Instrument Recognition in Laparoscopic Surgery (PhaKIR)
Tobias Rueckert,Raphaela Maerkl,David Rauber,Leonard Klausmann,Max Gutbrod,Daniel Rueckert,Hubertus Feussner,Dirk Wilhelm,Christoph Palm
Main category: cs.CV
TL;DR: PhaKIR数据集是首个多中心、多任务标注的腹腔镜手术视频数据集,覆盖手术阶段识别、器械关键点估计和实例分割任务,支持时间上下文建模,并为MICCAI 2024的EndoVis挑战赛提供基准。
Details
Motivation: 现有数据集通常专注于单一任务或缺乏多中心数据,限制了计算机视觉在机器人辅助微创手术(RAMIS)中的发展。PhaKIR旨在填补这一空白,提供全面的多任务标注数据。Contribution: PhaKIR数据集首次联合提供手术阶段标签、器械位姿信息和像素级分割标注,支持多任务学习和时间上下文建模,并作为MICCAI 2024挑战赛的基准。
Method: 数据集包含8个完整腹腔镜胆囊切除术视频,来自3个医疗中心,提供帧级标注,涵盖三个阶段识别、关键点估计和实例分割任务。
Result: PhaKIR提供了485,875帧的手术阶段标注和19,435帧的器械关键点与分割数据,成为首个多中心、多任务的手术场景理解数据集。
Insight: 多任务联合标注和多中心数据增强了模型的泛化能力,为手术场景理解的深入研究提供了重要资源。
Abstract: Robotic- and computer-assisted minimally invasive surgery (RAMIS) is increasingly relying on computer vision methods for reliable instrument recognition and surgical workflow understanding. Developing such systems often requires large, well-annotated datasets, but existing resources often address isolated tasks, neglect temporal dependencies, or lack multi-center variability. We present the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) dataset, comprising eight complete laparoscopic cholecystectomy videos recorded at three medical centers. The dataset provides frame-level annotations for three interconnected tasks: surgical phase recognition (485,875 frames), instrument keypoint estimation (19,435 frames), and instrument instance segmentation (19,435 frames). PhaKIR is, to our knowledge, the first multi-institutional dataset to jointly provide phase labels, instrument pose information, and pixel-accurate instrument segmentations, while also enabling the exploitation of temporal context since full surgical procedure sequences are available. It served as the basis for the PhaKIR Challenge as part of the Endoscopic Vision (EndoVis) Challenge at MICCAI 2024 to benchmark methods in surgical scene understanding, thereby further validating the dataset’s quality and relevance. The dataset is publicly available upon request via the Zenodo platform.
[150] Spatial-Frequency Enhanced Mamba for Multi-Modal Image Fusion
Hui Sun,Long Lv,Pingping Zhang,Tongdan Tang,Feng Tian,Weibing Sun,Huchuan Lu
Main category: cs.CV
TL;DR: 本文提出了一种名为SFMFusion的新框架,用于多模态图像融合(MMIF),结合了空间-频率增强Mamba模块和动态融合技术,显著提升了融合效果。
Details
Motivation: 现有的CNN和Transformer方法在MMIF中表现不佳,主要原因是CNN的感受野有限且Transformer计算成本高,而Mamba虽能建模长距离依赖但缺乏空间和频率感知。Contribution: 1)提出了三分支结构结合MMIF和图像重建任务;2)设计了空间-频率增强Mamba块(SFMB);3)提出了动态融合Mamba块(DFMB),提升了特征的提取与融合能力。
Method: 通过SFMB增强Mamba的空间和频率感知能力,利用DFMB动态融合多分支特征,结合图像重建任务优化MMIF效果。
Result: 在六个MMIF数据集上,SFMFusion超越了大多数前沿方法,验证了其有效性。
Insight: 通过结合空间-频率增强和动态融合机制,Mamba可以在MMIF任务中发挥更大潜力,同时图像重建任务显著提升了融合质量。
Abstract: Multi-Modal Image Fusion (MMIF) aims to integrate complementary image information from different modalities to produce informative images. Previous deep learning-based MMIF methods generally adopt Convolutional Neural Networks (CNNs) or Transformers for feature extraction. However, these methods deliver unsatisfactory performances due to the limited receptive field of CNNs and the high computational cost of Transformers. Recently, Mamba has demonstrated a powerful potential for modeling long-range dependencies with linear complexity, providing a promising solution to MMIF. Unfortunately, Mamba lacks full spatial and frequency perceptions, which are very important for MMIF. Moreover, employing Image Reconstruction (IR) as an auxiliary task has been proven beneficial for MMIF. However, a primary challenge is how to leverage IR efficiently and effectively. To address the above issues, we propose a novel framework named Spatial-Frequency Enhanced Mamba Fusion (SFMFusion) for MMIF. More specifically, we first propose a three-branch structure to couple MMIF and IR, which can retain complete contents from source images. Then, we propose the Spatial-Frequency Enhanced Mamba Block (SFMB), which can enhance Mamba in both spatial and frequency domains for comprehensive feature extraction. Finally, we propose the Dynamic Fusion Mamba Block (DFMB), which can be deployed across different branches for dynamic feature fusion. Extensive experiments show that our method achieves better results than most state-of-the-art methods on six MMIF datasets. The source code is available at https://github.com/SunHui1216/SFMFusion.
[151] Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from LDCT
Yifei Zhang,Jiashuo Zhang,Xiaofeng Yang,Liang Zhao
Main category: cs.CV
TL;DR: 本文提出了一种可解释的跨疾病推理框架,通过一次LDCT扫描实现心肺风险的联合评估,其核心是模拟临床诊断思维的代理推理过程,结合肺部感知、知识引导推理和心脏表征三个模块,实现了心血管风险的准确预测与可解释性。
Details
Motivation: LDCT同时捕获肺部和心脏结构,但现有方法多将其视为独立任务,忽略了生理上的相互作用和共享的生物标志物,因此需要一种联合评估方法以实现更准确的心血管风险评估。Contribution: 1. 提出首个可解释的跨疾病推理框架,结合LDCT的肺部与心脏信息;2. 通过代理推理模拟临床诊断思维,实现生理学意义的解释性;3. 在NLST数据集上验证了优于单任务和纯图像基线的性能。
Method: 1. 肺部感知模块总结肺部异常;2. 知识引导推理模块推断异常对心血管的影响;3. 心脏表征模块编码结构生物标志物;三模块输出融合生成心血管风险预测。
Result: 在NLST队列中,框架在CVD筛查和死亡率预测上达到SOTA,且提供的解释与心脏病学理解一致。
Insight: 该研究展示了LDCT在心肺联合评估中的潜力,强调了机制解释对医学AI的重要性,为临床决策提供了可信的支持。
Abstract: Low-dose chest computed tomography (LDCT) inherently captures both pulmonary and cardiac structures, offering a unique opportunity for joint assessment of lung and cardiovascular health. However, most existing approaches treat these domains as independent tasks, overlooking their physiological interplay and shared imaging biomarkers. We propose an Explainable Cross-Disease Reasoning Framework that enables interpretable cardiopulmonary risk assessment from a single LDCT scan. The framework introduces an agentic reasoning process that emulates clinical diagnostic thinking-first perceiving pulmonary findings, then reasoning through established medical knowledge, and finally deriving a cardiovascular judgment with explanatory rationale. It integrates three synergistic components: a pulmonary perception module that summarizes lung abnormalities, a knowledge-guided reasoning module that infers their cardiovascular implications, and a cardiac representation module that encodes structural biomarkers. Their outputs are fused to produce a holistic cardiovascular risk prediction that is both accurate and physiologically grounded. Experiments on the NLST cohort demonstrate that the proposed framework achieves state-of-the-art performance for CVD screening and mortality prediction, outperforming single-disease and purely image-based baselines. Beyond quantitative gains, the framework provides human-verifiable reasoning that aligns with cardiological understanding, revealing coherent links between pulmonary abnormalities and cardiac stress mechanisms. Overall, this work establishes a unified and explainable paradigm for cardiovascular analysis from LDCT, bridging the gap between image-based prediction and mechanism-based medical interpretation.
[152] DIAL-GS: Dynamic Instance Aware Reconstruction for Label-free Street Scenes with 4D Gaussian Splatting
Chenpeng Su,Wenhua Wu,Chensheng Peng,Tianchen Deng,Zhe Liu,Hesheng Wang
Main category: cs.CV
TL;DR: DIAL-GS是一种基于4D高斯飞溅的动态实例感知重建方法,用于无标注街景,通过外观-位置不一致性识别动态实例,实现动态自适应的重建。
Details
Motivation: 城市场景重建对自动驾驶至关重要,但现有监督方法依赖昂贵标注,自监督方法则难以区分静态与动态元素,限制了细粒度编辑。Contribution: 提出DIAL-GS,实现了动态实例感知的4D高斯飞溅重建,通过身份与动态的相互增强机制提升重建质量。
Method: 利用外观-位置不一致性识别动态实例,采用实例感知的4D高斯作为统一体表示,并通过身份与动态的相互增强优化结果。
Result: 在城市场景实验中,DIAL-GS在重建质量和实例级编辑上优于现有自监督基线。
Insight: 动态实例感知与4D高斯飞溅的结合为无标注街景重建提供了高效且可编辑的解决方案。
Abstract: Urban scene reconstruction is critical for autonomous driving, enabling structured 3D representations for data synthesis and closed-loop testing. Supervised approaches rely on costly human annotations and lack scalability, while current self-supervised methods often confuse static and dynamic elements and fail to distinguish individual dynamic objects, limiting fine-grained editing. We propose DIAL-GS, a novel dynamic instance-aware reconstruction method for label-free street scenes with 4D Gaussian Splatting. We first accurately identify dynamic instances by exploiting appearance-position inconsistency between warped rendering and actual observation. Guided by instance-level dynamic perception, we employ instance-aware 4D Gaussians as the unified volumetric representation, realizing dynamic-adaptive and instance-aware reconstruction. Furthermore, we introduce a reciprocal mechanism through which identity and dynamics reinforce each other, enhancing both integrity and consistency. Experiments on urban driving scenarios show that DIAL-GS surpasses existing self-supervised baselines in reconstruction quality and instance-level editing, offering a concise yet powerful solution for urban scene modeling.
[153] UniADC: A Unified Framework for Anomaly Detection and Classification
Ximiao Zhang,Min Xu,Zheng Zhang,Junlin Hu,Xiuzhuang Zhou
Main category: cs.CV
TL;DR: UniADC提出了一种统一的异常检测与分类框架,通过训练自由的可控修复网络和多任务判别器,解决了现有方法将两者分开处理的局限性,显著提升了性能。
Details
Motivation: 现有方法通常将异常检测和分类视为独立任务,忽略了它们的相关性,导致信息共享不足和性能不佳。UniADC旨在统一这两个任务。Contribution: 提出UniADC框架,结合可控修复网络和多任务判别器,实现了无需或仅需少量异常样本的高效异常检测与分类。
Method: 1. 训练自由的可控修复网络,通过修复正常区域生成特定类别的异常图像;2. 多任务判别器利用合成样本训练,对齐细粒度特征与异常类别嵌入。
Result: 在MVTec-FS、MTD和WFDD数据集上的实验表明,UniADC在异常检测、定位和分类任务中均优于现有方法。
Insight: 通过统一异常检测与分类任务,UniADC展示了如何利用生成模型和数据增强技术解决小样本异常学习问题。
Abstract: In this paper, we introduce the task of unified anomaly detection and classification, which aims to simultaneously detect anomalous regions in images and identify their specific categories. Existing methods typically treat anomaly detection and classification as separate tasks, thereby neglecting their inherent correlation, limiting information sharing, and resulting in suboptimal performance. To address this, we propose UniADC, a unified anomaly detection and classification model that can effectively perform both tasks with only a few or even no anomaly images. Specifically, UniADC consists of two key components: a training-free controllable inpainting network and a multi-task discriminator. The inpainting network can synthesize anomaly images of specific categories by repainting normal regions guided by anomaly priors, and can also repaint few-shot anomaly samples to augment the available anomaly data. The multi-task discriminator is then trained on these synthesized samples, enabling precise anomaly detection and classification by aligning fine-grained image features with anomaly-category embeddings. We conduct extensive experiments on three anomaly detection and classification datasets, including MVTec-FS, MTD, and WFDD, and the results demonstrate that UniADC consistently outperforms existing methods in anomaly detection, localization, and classification. The code is available at https://github.com/cnulab/UniADC.
[154] FreqGRL: Suppressing Low-Frequency Bias and Mining High-Frequency Knowledge for Cross-Domain Few-Shot Learning
Siqi Hui,Sanping Zhou,Ye deng,Wenli Huang,Jinjun Wang
Main category: cs.CV
TL;DR: 论文提出了FreqGRL,一种基于频率空间的CD-FSL框架,通过低频替换和高频增强模块解决数据不平衡问题,显著提升了跨域少样本学习的性能。
Details
Motivation: 跨域少样本学习(CD-FSL)在目标域数据稀缺的情况下,模型容易偏向源域的低频成分,导致泛化能力不足。本文从频率空间角度分析并提出解决方案。Contribution: 1. 首次从频率空间视角分析CD-FSL中的数据不平衡问题;2. 提出Low-Frequency Replacement (LFR)和High-Frequency Enhancement (HFE)模块;3. 引入Global Frequency Filter (GFF)减少噪声频率的影响。
Method: 通过LFR模块替换源任务的低频成分,HFE模块直接学习高频特征,并结合GFF过滤噪声频率,从而提升模型的跨域泛化能力。
Result: 在五个标准CD-FSL基准测试中,FreqGRL取得了最先进的性能。
Insight: 频率空间分析为CD-FSL提供了新的视角,低频偏置和高频稀疏性是影响跨域泛化的关键,通过频率操作可以有效缓解这些问题。
Abstract: Cross-domain few-shot learning (CD-FSL) aims to recognize novel classes with only a few labeled examples under significant domain shifts. While recent approaches leverage a limited amount of labeled target-domain data to improve performance, the severe imbalance between abundant source data and scarce target data remains a critical challenge for effective representation learning. We present the first frequency-space perspective to analyze this issue and identify two key challenges: (1) models are easily biased toward source-specific knowledge encoded in the low-frequency components of source data, and (2) the sparsity of target data hinders the learning of high-frequency, domain-generalizable features. To address these challenges, we propose \textbf{FreqGRL}, a novel CD-FSL framework that mitigates the impact of data imbalance in the frequency space. Specifically, we introduce a Low-Frequency Replacement (LFR) module that substitutes the low-frequency components of source tasks with those from the target domain to create new source tasks that better align with target characteristics, thus reducing source-specific biases and promoting generalizable representation learning. We further design a High-Frequency Enhancement (HFE) module that filters out low-frequency components and performs learning directly on high-frequency features in the frequency space to improve cross-domain generalization. Additionally, a Global Frequency Filter (GFF) is incorporated to suppress noisy or irrelevant frequencies and emphasize informative ones, mitigating overfitting risks under limited target supervision. Extensive experiments on five standard CD-FSL benchmarks demonstrate that our frequency-guided framework achieves state-of-the-art performance.
[155] NOVO: Bridging LLaVA and SAM with Visual-only Prompts for Reasoning Segmentation
Kyung-Yoon Yoon,Yeong-Jun Cho
Main category: cs.CV
TL;DR: NOVO是一个通过纯视觉提示(无文本)连接视觉语言模型(VLM)和分割模型的新框架,利用SAM的预训练能力实现高质量的分割任务。
Details
Motivation: 现有方法通常依赖文本生成的SEG token嵌入到分割模型中,限制了分割模型的预训练能力。NOVO旨在通过纯视觉提示解决这一问题。Contribution: 1. 提出NOVO框架,通过视觉提示连接VLM和SAM。2. 引入无需训练的细化模块提升分割质量。3. 发布RISeg基准数据集。
Method: NOVO从VLM输出生成粗略掩码和点提示,作为SAM的输入。通过细化模块减少视觉伪影并提升分割效果。
Result: NOVO在多个指标和模型规模上实现了最先进的性能,展现了其在推理分割任务中的有效性和可扩展性。
Insight: 纯视觉提示能够更好地利用SAM的预训练能力,无需依赖文本输入,为分割任务提供了一种更灵活和高效的解决方案。
Abstract: In this study, we propose NOVO (NO text, Visual-Only prompts), a novel framework that bridges vision-language models (VLMs) and segmentation models through visual-only prompts. Unlike prior approaches that feed text-derived SEG token embeddings into segmentation models, NOVO instead generates a coarse mask and point prompts from the VLM output. These visual prompts are compatible with the Segment Anything Model (SAM), preserving alignment with its pretrained capabilities. To further enhance boundary quality and enable instance-level segmentation, we introduce a training-free refinement module that reduces visual artifacts and improves the quality of segmentation masks. We also present RISeg, a new benchmark comprising 918 images, 2,533 instance-level masks, and diverse reasoning queries to evaluate this task. Experiments demonstrate that NOVO achieves state-of-the-art performance across multiple metrics and model sizes, demonstrating its effectiveness and scalability in reasoning segmentation.
[156] HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment
Ruijia Wu,Ping Chen,Fei Shen,Shaoan Zhao,Qiang Hui,Huanlin Gao,Ting Lu,Zhaoxiang Liu,Fang Zhao,Kai Wang,Shiguo Lian
Main category: cs.CV
TL;DR: HiMo-CLIP enhances CLIP-like models by modeling semantic hierarchy and monotonicity in vision-language alignment, improving performance on complex text descriptions.
Details
Motivation: Current CLIP-style models treat text as flat sequences, failing to capture semantic hierarchy and monotonicity, which limits their effectiveness with compositional or long-form descriptions.Contribution: Proposes HiMo-CLIP, which introduces hierarchical decomposition (HiDe) and monotonicity-aware contrastive loss (MoLo) to better align vision-language representations without changing encoder architectures.
Method: HiDe uses in-batch PCA to extract latent semantic components from text, while MoLo jointly aligns global and component-level representations to enforce semantic ordering.
Result: HiMo-CLIP outperforms baselines on image-text retrieval benchmarks, especially for long or compositional descriptions.
Insight: Modeling semantic hierarchy and monotonicity explicitly improves vision-language alignment, highlighting the importance of structured text understanding.
Abstract: Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a monotonicity-aware contrastive loss (MoLo) that jointly aligns global and component-level representations, encouraging the model to internalize semantic ordering and alignment strength as a function of textual completeness.These components work in concert to produce structured, cognitively-aligned cross-modal representations. Experiments on multiple image-text retrieval benchmarks show that HiMo-CLIP consistently outperforms strong baselines, particularly under long or compositional descriptions. The code is available at https://github.com/UnicomAI/HiMo-CLIP.
[157] Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks
Lingran Song,Yucheng Zhou,Jianbing Shen
Main category: cs.CV
TL;DR: 本文提出了Sim4Seg框架,结合区域感知的视觉-语言相似性掩码(RVLS2M),用于医学诊断分割任务(MDS),并发布了M3DS数据集,实验显示其性能优于基线方法。
Details
Motivation: 现有医学图像分割模型很少同时处理分割与诊断任务,而这对患者提供可解释的诊断结果至关重要。Contribution: 1) 提出MDS任务;2) 发布M3DS数据集;3) 开发Sim4Seg框架,通过RVLS2M模块提升分割与诊断性能。
Method: 利用RVLS2M模块捕捉区域感知的视觉-语言相似性,并结合测试时缩放策略优化MDS任务。
Result: 实验表明Sim4Seg在分割与诊断任务上均优于基线方法。
Insight: 结合视觉与语言信息有助于医学图像分析的可解释性,区域感知相似性是提升性能的关键。
Abstract: Despite significant progress in pixel-level medical image analysis, existing medical image segmentation models rarely explore medical segmentation and diagnosis tasks jointly. However, it is crucial for patients that models can provide explainable diagnoses along with medical segmentation results. In this paper, we introduce a medical vision-language task named Medical Diagnosis Segmentation (MDS), which aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results. To facilitate this task, we first present the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset, containing diverse multimodal multi-disease medical images paired with their corresponding segmentation masks and diagnosis chain-of-thought, created via an automated diagnosis chain-of-thought generation pipeline. Moreover, we propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation by taking advantage of the Region-Aware Vision-Language Similarity to Mask (RVLS2M) module. To improve overall performance, we investigate a test-time scaling strategy for MDS tasks. Experimental results demonstrate that our method outperforms the baselines in both segmentation and diagnosis.
[158] Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View
Jianyu Qi,Ding Zou,Wenrui Yan,Rui Ma,Jiaxu Li,Zhijie Zheng,Zhiguo Yang,Rongchang Zhao
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于难度区分的数据采样方法,用于优化多模态大语言模型(MLLMs)的后训练过程。通过引入两种新的难度感知采样策略和分层训练框架,论文在多个基准数据集上验证了其方法的有效性。
Details
Motivation: 现有的后训练范式忽略了样本难度的量化指标以及感知与推理能力的联合优化,影响了多模态模型的性能。论文旨在解决这一问题。Contribution: 1. 提出了两种新的难度感知采样策略:渐进式图像语义掩码(PISM)和跨模态注意力平衡(CMAB)。2. 设计了一种分层训练框架,结合GRPO和SFT+GRPO两种训练范式。
Method: 1. PISM通过系统性的图像退化量化样本难度。2. CMAB通过注意力分布分析评估跨模态交互的复杂性。3. 使用分层训练框架优化模型性能。
Result: 实验表明,GRPO在难度分层样本上的表现优于传统的SFT+GRPO流程,表明策略性数据采样可以取代监督微调并提高模型准确性。
Insight: 难度感知采样策略能有效优化多模态模型的训练过程,进一步提升模型的感知和推理能力。
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at https://github.com/qijianyu277/DifficultySampling.
[159] REOcc: Camera-Radar Fusion with Radar Feature Enrichment for 3D Occupancy Prediction
Chaehee Song,Sanmin Kim,Hyeonjun Jeong,Juyeb Shin,Joonhee Lim,Dongsuk Kum
Main category: cs.CV
TL;DR: REOcc提出了一种新颖的相机-雷达融合网络,通过雷达特征增强技术提升3D占用预测性能,显著优于仅使用相机的基线模型,尤其是在动态物体类别上。
Details
Motivation: 传统基于相机的3D占用预测在复杂环境中表现受限,而雷达数据的稀疏性和噪声又限制了雷达与相机的融合效果。因此,需要一种方法来丰富雷达特征表示,以实现更有效的传感器融合。Contribution: 1. 提出了REOcc网络,通过Radar Densifier和Radar Amplifier两大组件增强雷达特征的空间密度和质量。2. 在Occ3D-nuScenes基准测试中取得了显著性能提升,尤其是在动态物体类别上。
Method: 1. Radar Densifier:通过整合空间信息提高雷达特征的密度。2. Radar Amplifier:通过上下文信息增强雷达特征的质量。这两种技术共同优化了雷达特征的表征能力。
Result: 在Occ3D-nuScenes基准测试中,REOcc显著优于仅使用相机的基线模型,特别是在动态物体类别上,证明了其有效缓解雷达数据稀疏性和噪声的能力。
Insight: REOcc展示了雷达特征增强在3D占用预测中的重要性,为多模态传感器融合的实现提供了一个有前景的方向。
Abstract: Vision-based 3D occupancy prediction has made significant advancements, but its reliance on cameras alone struggles in challenging environments. This limitation has driven the adoption of sensor fusion, among which camera-radar fusion stands out as a promising solution due to their complementary strengths. However, the sparsity and noise of the radar data limits its effectiveness, leading to suboptimal fusion performance. In this paper, we propose REOcc, a novel camera-radar fusion network designed to enrich radar feature representations for 3D occupancy prediction. Our approach introduces two main components, a Radar Densifier and a Radar Amplifier, which refine radar features by integrating spatial and contextual information, effectively enhancing spatial density and quality. Extensive experiments on the Occ3D-nuScenes benchmark demonstrate that REOcc achieves significant performance gains over the camera-only baseline model, particularly in dynamic object classes. These results underscore REOcc’s capability to mitigate the sparsity and noise of the radar data. Consequently, radar complements camera data more effectively, unlocking the full potential of camera-radar fusion for robust and reliable 3D occupancy prediction.
[160] Flexible Concept Bottleneck Model
Xingbo Du,Qiantong Dou,Lei Fan,Rui Zhang
Main category: cs.CV
TL;DR: 论文提出Flexible Concept Bottleneck Model (FCBM),通过超网络和可学习温度参数的稀疏最大模块,支持动态概念适应,无需重新训练整个模型,提升模型在现实场景中的灵活性和适应性。
Details
Motivation: 现有基于视觉语言模型(VLM)的概念瓶颈模型(CBM)在引入新概念时需要重新训练整个模型,限制了其实用性和灵活性。FCBM旨在解决这一问题,提升模型对快速演化的视觉语言基础模型的适应能力。Contribution: 1. 提出FCBM,支持动态概念适应,包括完全替换原始概念集。2. 设计了基于概念嵌入的超网络,实现新概念的无缝集成。3. 引入可学习温度参数的稀疏最大模块,动态选择最相关概念。
Method: 1. 使用超网络为概念嵌入生成预测权重,避免重新训练模型。2. 引入改进的稀疏最大模块,动态筛选最相关概念。3. 通过单轮微调验证模型对新概念的泛化能力。
Result: 在五个公共基准测试中,FCBM达到与最先进基线相当的准确率,同时仅需单轮微调即可适应未见概念,展示了模型的强大适应性和灵活性。
Insight: FCBM通过动态概念适应机制减少了对模型重新训练的依赖,提升了实用性和效率,为未来视觉语言模型的灵活应用提供了新思路。
Abstract: Concept bottleneck models (CBMs) improve neural network interpretability by introducing an intermediate layer that maps human-understandable concepts to predictions. Recent work has explored the use of vision-language models (VLMs) to automate concept selection and annotation. However, existing VLM-based CBMs typically require full model retraining when new concepts are involved, which limits their adaptability and flexibility in real-world scenarios, especially considering the rapid evolution of vision-language foundation models. To address these issues, we propose Flexible Concept Bottleneck Model (FCBM), which supports dynamic concept adaptation, including complete replacement of the original concept set. Specifically, we design a hypernetwork that generates prediction weights based on concept embeddings, allowing seamless integration of new concepts without retraining the entire model. In addition, we introduce a modified sparsemax module with a learnable temperature parameter that dynamically selects the most relevant concepts, enabling the model to focus on the most informative features. Extensive experiments on five public benchmarks demonstrate that our method achieves accuracy comparable to state-of-the-art baselines with a similar number of effective concepts. Moreover, the model generalizes well to unseen concepts with just a single epoch of fine-tuning, demonstrating its strong adaptability and flexibility.
[161] MirrorMamba: Towards Scalable and Robust Mirror Detection in Videos
Rui Song,Jiaying Lin,Rynson W. H. Lau
Main category: cs.CV
TL;DR: MirrorMamba是一种新颖的视频镜面检测方法,通过结合多种线索(如深度感知、光流和对应关系)及Mamba架构,解决了现有方法在性能和鲁棒性上的不足,实现了高效且可扩展的检测效果。
Details
Motivation: 现有视频镜面检测方法依赖单一动态特征或计算复杂度高的Transformer架构,性能受限且不够鲁棒,亟需一种更高效的解决方案。Contribution: 1. 提出首个基于Mamba架构的视频镜面检测方法MirrorMamba;2. 引入多线索融合(深度、光流、对应关系)提升适应性;3. 设计了Mamba-based Multidirection Correspondence Extractor和边界增强解码器,优化特征提取和边界模糊问题。
Method: 1. 结合深度、光流和对应关系多线索;2. 基于Mamba的全局感受野和线性复杂度特征提取器;3. 层状边界增强解码器解决深度图模糊问题。
Result: 在视频和图像基准数据集上均实现SOTA性能,证明了方法的鲁棒性和泛化能力。
Insight: Mamba架构在镜面检测中的首次成功应用,展示了其在高效全局特征提取方面的潜力。
Abstract: Video mirror detection has received significant research attention, yet existing methods suffer from limited performance and robustness. These approaches often over-rely on single, unreliable dynamic features, and are typically built on CNNs with limited receptive fields or Transformers with quadratic computational complexity. To address these limitations, we propose a new effective and scalable video mirror detection method, called MirrorMamba. Our approach leverages multiple cues to adapt to diverse conditions, incorporating perceived depth, correspondence and optical. We also introduce an innovative Mamba-based Multidirection Correspondence Extractor, which benefits from the global receptive field and linear complexity of the emerging Mamba spatial state model to effectively capture correspondence properties. Additionally, we design a Mamba-based layer-wise boundary enforcement decoder to resolve the unclear boundary caused by the blurred depth map. Notably, this work marks the first successful application of the Mamba-based architecture in the field of mirror detection. Extensive experiments demonstrate that our method outperforms existing state-of-the-art approaches for video mirror detection on the benchmark datasets. Furthermore, on the most challenging and representative image-based mirror detection dataset, our approach achieves state-of-the-art performance, proving its robustness and generalizability.
[162] SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
Hunar Batra,Haoqin Tu,Hardy Chen,Yuanze Lin,Cihang Xie,Ronald Clark
Main category: cs.CV
TL;DR: SpatialThinker是一种通过空间奖励强化3D推理的多模态大语言模型,结合了结构化的空间基础和密集的空间奖励,显著提升了空间理解能力。
Details
Motivation: 现有的多模态大语言模型在空间理解方面表现不足,通常依赖明确的三维输入或架构修改,且受限于大规模数据集或稀疏监督。Contribution: 1) STVQA-7K数据集的生成;2) 在线强化学习结合多目标密集空间奖励。
Method: 模型通过构建任务相关对象和空间关系的场景图,并结合密集空间奖励进行多步推理。
Result: SpatialThinker-7B在空间理解和实际VQA基准测试中表现优于监督微调和稀疏强化学习基线,甚至超越GPT-4o。
Insight: 结合空间监督和奖励对齐的推理,能够在有限数据下实现鲁棒的3D空间理解,推动MLLMs接近人类水平的视觉推理。
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.
[163] Argus: Quality-Aware High-Throughput Text-to-Image Inference Serving System
Shubham Agarwal,Subrata Mitra,Saud Iqbal
Main category: cs.CV
TL;DR: Argus是一个高质量、高吞吐量的文本到图像(T2I)推理服务系统,通过动态选择适当的近似模型和设置,平衡质量与吞吐量需求,显著提升性能。
Details
Motivation: 现有的T2I模型(如扩散模型)计算密集且耗时,传统方法难以满足高吞吐量需求。通过研究发现,许多提示可以用更快但近似的模型处理,但需避免质量下降。Contribution: 提出了Argus系统,动态选择最优近似策略,实现了更高吞吐量和质量,同时减少延迟SLO违规。
Method: Argus通过智能切换不同近似策略,为每个提示选择最合适的模型和设置,确保质量与吞吐量之间的平衡。
Result: 在真实工作负载中,Argus相比基线减少了10倍的SLO违规,提高了10%的平均质量和40%的吞吐量。
Insight: 动态调整近似策略可以有效优化T2I模型的推理效率,同时保持质量。
Abstract: Text-to-image (T2I) models have gained significant popularity. Most of these are diffusion models with unique computational characteristics, distinct from both traditional small-scale ML models and large language models. They are highly compute-bound and use an iterative denoising process to generate images, leading to very high inference time. This creates significant challenges in designing a high-throughput system. We discovered that a large fraction of prompts can be served using faster, approximated models. However, the approximation setting must be carefully calibrated for each prompt to avoid quality degradation. Designing a high-throughput system that assigns each prompt to the appropriate model and compatible approximation setting remains a challenging problem. We present Argus, a high-throughput T2I inference system that selects the right level of approximation for each prompt to maintain quality while meeting throughput targets on a fixed-size cluster. Argus intelligently switches between different approximation strategies to satisfy both throughput and quality requirements. Overall, Argus achieves 10x fewer latency service-level objective (SLO) violations, 10% higher average quality, and 40% higher throughput compared to baselines on two real-world workload traces.
[164] SinSEMI: A One-Shot Image Generation Model and Data-Efficient Evaluation Framework for Semiconductor Inspection Equipment
ChunLiang Wu,Xiaochun Li
Main category: cs.CV
TL;DR: SinSEMI是一种新型的单样本图像生成模型,专为解决半导体设备开发初期数据稀缺问题而设计。它通过LPIPS能量引导的多尺度流模型生成高真实感和多样化的图像,并提供专用的评估框架。
Details
Motivation: 半导体设备开发的早期阶段难以获取大量原始光学图像,限制了AI解决方案的发展。SinSEMI旨在通过单样本生成多样且高真实感的图像来解决这一问题。Contribution: 1. 提出SinSEMI,一种基于多尺度流模型和LPIPS能量引导的单样本图像生成方法;2. 设计了专用的评估框架,仅需两幅参考图像即可全面评估生成结果。
Method: 采用多尺度流模型结合LPIPS能量引导的采样策略,生成具有高真实感和多样性的图像。评估框架仅需两幅参考图像进行多维度分析。
Result: 实验表明,SinSEMI在视觉质量、定量指标和下游任务中均优于其他单样本生成技术,生成的图像兼具高保真度和多样性。
Insight: SinSEMI展示了在数据稀缺场景下,结合感知损失和多尺度建模可以有效提升单样本图像生成的性能,为半导体AI应用提供了实用的数据增强方案。
Abstract: In the early stages of semiconductor equipment development, obtaining large quantities of raw optical images poses a significant challenge. This data scarcity hinder the advancement of AI-powered solutions in semiconductor manufacturing. To address this challenge, we introduce SinSEMI, a novel one-shot learning approach that generates diverse and highly realistic images from single optical image. SinSEMI employs a multi-scale flow-based model enhanced with LPIPS (Learned Perceptual Image Patch Similarity) energy guidance during sampling, ensuring both perceptual realism and output variety. We also introduce a comprehensive evaluation framework tailored for this application, which enables a thorough assessment using just two reference images. Through the evaluation against multiple one-shot generation techniques, we demonstrate SinSEMI’s superior performance in visual quality, quantitative measures, and downstream tasks. Our experimental results demonstrate that SinSEMI-generated images achieve both high fidelity and meaningful diversity, making them suitable as training data for semiconductor AI applications.
[165] Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV
Wenbo Huang,Jinghui Zhang,Zhenghao Chen,Guang Li,Lei Zhang,Yang Cao,Fang Dong,Takahiro Ogawa,Miki Haseyama
Main category: cs.CV
TL;DR: Otter提出了一种改进RWKV的方法,用于缓解广角视频中少样本动作识别(FSAR)的背景干扰问题,通过复合分割模块(CSM)和时间重建模块(TRM)增强主题突出和时序建模能力。
Details
Motivation: 广角视频在FSAR中能有效表达特定场景下的动作,但背景干扰和时序关系退化导致识别困难。直接应用RWKV全局建模未能突出主题。Contribution: 设计了CSM模块分割关键区域以突出主题,TRM模块双向扫描重建时序关系,并结合常规原型和时序增强原型提升性能。
Method: CSM分割关键帧区域,TRM通过双向扫描重建时序关系,结合常规与时序增强原型优化主题和时序建模。
Result: 在SSv2、Kinetics、UCF101、HMDB51等基准测试中达到SOTA性能,VideoBadminton数据集进一步验证了其在广角FSAR中的优势。
Insight: CSM和TRM的结合有效解决了广角FSAR中的背景干扰和时序退化问题,RWKV的改进展现出全局建模的潜力。
Abstract: Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.
[166] PointCubeNet: 3D Part-level Reasoning with 3x3x3 Point Cloud Blocks
Da-Yeong Kim,Yeong-Jun Cho
Main category: cs.CV
TL;DR: PointCubeNet提出了一种新颖的无监督多模态3D理解框架,通过3x3x3局部块实现部件级推理,无需部件标注。
Details
Motivation: 当前3D理解方法往往缺乏对对象部件的细粒度分析,且依赖标注数据。PointCubeNet旨在通过无监督方式解决这一局限。Contribution: 1. 提出首个无监督3D部件级推理框架;2. 引入3x3x3局部块结构和伪标签方法;3. 通过局部损失函数实现高效训练。
Method: 框架分为全局和局部分支,局部分支采用3x3x3块结构。通过伪标签和局部损失函数进行无监督训练。
Result: 实验表明,部件级理解提升了整体3D对象理解,并取得了可靠且有意义的无监督结果。
Insight: 无监督部件级推理是可行的,且能增强3D理解的细粒度表现。
Abstract: In this paper, we propose PointCubeNet, a novel multi-modal 3D understanding framework that achieves part-level reasoning without requiring any part annotations. PointCubeNet comprises global and local branches. The proposed local branch, structured into 3x3x3 local blocks, enables part-level analysis of point cloud sub-regions with the corresponding local text labels. Leveraging the proposed pseudo-labeling method and local loss function, PointCubeNet is effectively trained in an unsupervised manner. The experimental results demonstrate that understanding 3D object parts enhances the understanding of the overall 3D object. In addition, this is the first attempt to perform unsupervised 3D part-level reasoning and achieves reliable and meaningful results.
[167] Med-SORA: Symptom to Organ Reasoning in Abdomen CT Images
You-Kyoung Na,Yeong-Jun Cho
Main category: cs.CV
TL;DR: Med-SORA提出了一种基于腹部CT图像的症状-器官推理框架,通过RAG数据构建、软标签和2D-3D交叉注意力架构,解决了医学多模态学习中症状与器官复杂关系的问题。
Details
Motivation: 现有医学多模态模型依赖简单的一对一硬标签,忽略了症状可能与多个器官相关的临床现实,且仅使用单层2D特征,无法捕捉完整解剖结构。Contribution: 提出了第一个症状-器官推理框架Med-SORA,引入了RAG数据构建、软标签和学习性器官锚点,以及2D-3D交叉注意力架构。
Method: 使用RAG构建数据集,采用软标签和学习性器官锚点捕捉症状-器官复杂关系,结合2D-3D交叉注意力融合局部与全局特征。
Result: 实验表明Med-SORA优于现有医学多模态模型,并能实现准确的3D临床推理。
Insight: Med-SORA通过多模态学习和3D特征融合,为医学图像分析提供了更贴近临床现实的解决方案。
Abstract: Understanding symptom-image associations is crucial for clinical reasoning. However, existing medical multimodal models often rely on simple one-to-one hard labeling, oversimplifying clinical reality where symptoms relate to multiple organs. In addition, they mainly use single-slice 2D features without incorporating 3D information, limiting their ability to capture full anatomical context. In this study, we propose Med-SORA, a framework for symptom-to-organ reasoning in abdominal CT images. Med-SORA introduces RAG-based dataset construction, soft labeling with learnable organ anchors to capture one-to-many symptom-organ relationships, and a 2D-3D cross-attention architecture to fuse local and global image features. To our knowledge, this is the first work to address symptom-to-organ reasoning in medical multimodal learning. Experimental results show that Med-SORA outperforms existing medical multimodal models and enables accurate 3D clinical reasoning.
[168] Robust and High-Fidelity 3D Gaussian Splatting: Fusing Pose Priors and Geometry Constraints for Texture-Deficient Outdoor Scenes
Meijun Guo,Yongliang Shi,Caiyun Liu,Yixiao Feng,Ming Ma,Tinghai Yan,Weining Lu,Bin Liang
Main category: cs.CV
TL;DR: 该论文提出了一种融合位姿先验和几何约束的3D高斯泼溅方法,解决了纹理缺乏的户外场景中位姿估计不稳定和场景表示失真的问题,显著提升了3DGS的效果。
Details
Motivation: 户外大场景中纹理弱或重复时,传统的3D高斯泼溅(3DGS)方法在相机位姿估计和场景表示上存在不稳定和失真的问题,影响了渲染质量和鲁棒性。Contribution: 1. 利用LiDAR-IMU里程计提供相机位姿先验,优化COLMAP的三角测量和束调整;2. 引入法向量约束和有效秩正则化,增强高斯基元的方向和形状一致性。
Method: 1. 位姿优化:通过LiDAR-IMU里程计的先验约束COLMAP的三角测量和束调整;2. 场景表示:联合法向量约束、有效秩正则化和光度损失优化高斯泼溅。
Result: 实验显示,该方法在位姿优化时间缩短三分之二的同时保持了准确性和鲁棒性;在场景表示上显著优于传统3DGS,尤其在纹理弱或重复的数据集上效果更优。
Insight: 结合传感器先验和几何约束可以有效解决纹理缺乏场景中的3DGS问题,同时显示了多模态数据融合在三维重建中的潜力。
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a key rendering pipeline for digital asset creation due to its balance between efficiency and visual quality. To address the issues of unstable pose estimation and scene representation distortion caused by geometric texture inconsistency in large outdoor scenes with weak or repetitive textures, we approach the problem from two aspects: pose estimation and scene representation. For pose estimation, we leverage LiDAR-IMU Odometry to provide prior poses for cameras in large-scale environments. These prior pose constraints are incorporated into COLMAP’s triangulation process, with pose optimization performed via bundle adjustment. Ensuring consistency between pixel data association and prior poses helps maintain both robustness and accuracy. For scene representation, we introduce normal vector constraints and effective rank regularization to enforce consistency in the direction and shape of Gaussian primitives. These constraints are jointly optimized with the existing photometric loss to enhance the map quality. We evaluate our approach using both public and self-collected datasets. In terms of pose optimization, our method requires only one-third of the time while maintaining accuracy and robustness across both datasets. In terms of scene representation, the results show that our method significantly outperforms conventional 3DGS pipelines. Notably, on self-collected datasets characterized by weak or repetitive textures, our approach demonstrates enhanced visualization capabilities and achieves superior overall performance. Codes and data will be publicly available at https://github.com/justinyeah/normal_shape.git.
[169] TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning
Rui Wang,Ying Zhou,Hao Wang,Wenwei Zhang,Qiang Li,Zhiwei Wang
Main category: cs.CV
TL;DR: TiS-TSL提出了一种时间可切换的师生学习框架,通过稀疏图像级标签监督手术视频的立体匹配,解决了现有方法缺乏时空一致性的问题,显著提升了性能。
Details
Motivation: 在微创手术中,密集视差监督几乎不可能实现,通常只能依赖稀疏的图像级标签。现有师生学习方法仅提供空间置信度,缺乏时间一致性,导致视频帧间的预测不稳定和闪烁伪影。Contribution: 1. 提出了一种统一的模型,支持三种操作模式(IP、FVP、BVP),灵活建模时间信息。2. 设计了二阶段学习策略(I2V和V2V),通过双向时空一致性过滤噪声伪标签并增强时间一致性。
Method: 1. 使用统一的模型支持三种模式以实现时间建模。2. I2V阶段将稀疏图像级知识迁移到时间建模;V2V阶段通过比较前后向预测计算双向时空一致性,优化视差预测。
Result: 在两个公开数据集上,TiS-TSL在TEPE和EPE指标上分别提高了至少2.11%和4.54%,优于其他基于图像的最先进方法。
Insight: 通过灵活的时间建模和双向一致性验证,可以在稀疏监督下显著提升视频立体匹配的性能和稳定性。
Abstract: Stereo matching in minimally invasive surgery (MIS) is essential for next-generation navigation and augmented reality. Yet, dense disparity supervision is nearly impossible due to anatomical constraints, typically limiting annotations to only a few image-level labels acquired before the endoscope enters deep body cavities. Teacher-Student Learning (TSL) offers a promising solution by leveraging a teacher trained on sparse labels to generate pseudo labels and associated confidence maps from abundant unlabeled surgical videos. However, existing TSL methods are confined to image-level supervision, providing only spatial confidence and lacking temporal consistency estimation. This absence of spatio-temporal reliability results in unstable disparity predictions and severe flickering artifacts across video frames. To overcome these challenges, we propose TiS-TSL, a novel time-switchable teacher-student learning framework for video stereo matching under minimal supervision. At its core is a unified model that operates in three distinct modes: Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP), enabling flexible temporal modeling within a single architecture. Enabled by this unified model, TiS-TSL adopts a two-stage learning strategy. The Image-to-Video (I2V) stage transfers sparse image-level knowledge to initialize temporal modeling. The subsequent Video-to-Video (V2V) stage refines temporal disparity predictions by comparing forward and backward predictions to calculate bidirectional spatio-temporal consistency. This consistency identifies unreliable regions across frames, filters noisy video-level pseudo labels, and enforces temporal coherence. Experimental results on two public datasets demonstrate that TiS-TSL exceeds other image-based state-of-the-arts by improving TEPE and EPE by at least 2.11% and 4.54%, respectively..
[170] MUGSQA: Novel Multi-Uncertainty-Based Gaussian Splatting Quality Assessment Method, Dataset, and Benchmarks
Tianang Chen,Jian Jin,Shilv Cai,Zhuangzi Li,Weisi Lin
Main category: cs.CV
TL;DR: MUGSQA提出了一种新的基于多不确定性的高斯泼溅质量评估方法,构建了数据集和基准测试,用于评估不同GS重建方法的感知质量和现有质量评估指标的性能。
Details
Motivation: 高斯泼溅(GS)技术在3D物体重建中表现出色,但缺乏有效的感知质量评估方法。MUGSQA旨在解决这一挑战,通过模拟人类观察行为和多不确定性输入数据,填补这一空白。Contribution: 1)提出了统一的多距离主观质量评估方法,模拟人类观察行为;2)构建了考虑多种输入数据不确定性的MUGSQA数据集;3)建立了两个基准测试,评估GS重建方法和现有质量指标的鲁棒性。
Method: 采用多距离主观质量评估方法,结合输入数据的不确定性(如视角数量、分辨率、视距和初始点云精度)构建数据集。
Result: 提出的方法更贴近人类感知,数据集和基准测试为GS技术的质量评估提供了标准化工具。
Insight: MUGSQA强调了输入数据不确定性对感知质量的影响,为未来GS技术的优化和评估提供了重要参考。
Abstract: Gaussian Splatting (GS) has recently emerged as a promising technique for 3D object reconstruction, delivering high-quality rendering results with significantly improved reconstruction speed. As variants continue to appear, assessing the perceptual quality of 3D objects reconstructed with different GS-based methods remains an open challenge. To address this issue, we first propose a unified multi-distance subjective quality assessment method that closely mimics human viewing behavior for objects reconstructed with GS-based methods in actual applications, thereby better collecting perceptual experiences. Based on it, we also construct a novel GS quality assessment dataset named MUGSQA, which is constructed considering multiple uncertainties of the input data. These uncertainties include the quantity and resolution of input views, the view distance, and the accuracy of the initial point cloud. Moreover, we construct two benchmarks: one to evaluate the robustness of various GS-based reconstruction methods under multiple uncertainties, and the other to evaluate the performance of existing quality assessment metrics. Our dataset and benchmark code will be released soon.
[171] ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search
Zhenjie Liu,Jianzhang Lu,Renjie Lu,Cong Liang,Shangfei Wang
Main category: cs.CV
TL;DR: ConsistTalk提出了一种可控强度的时序一致说话头部生成框架,通过光流引导的时序模块、音频到强度模型和扩散噪声初始化策略,解决了现有方法中的闪烁、身份漂移和音视频同步问题。
Details
Motivation: 当前基于扩散模型的音频驱动肖像动画方法存在闪烁、身份漂移和音视频同步不佳的问题,主要源于外观与运动的耦合表示和不稳定的推理策略。Contribution: 1. 光流引导的时序模块(OFT)解耦运动和静态外观;2. 音频到强度模型(A2I)实现音视频联合建模;3. 扩散噪声初始化策略(IC-Init)优化推理过程。
Method: 1. 使用面部光流分离运动特征;2. 通过多模态师生知识蒸馏训练A2I模型;3. 在推理时通过噪声搜索优化背景一致性和运动连续性。
Result: 实验表明,ConsistTalk在减少闪烁、保持身份和生成时序稳定的高保真视频方面显著优于现有方法。
Insight: 解耦外观和运动特征可以有效改善时序一致性,而推理阶段的噪声搜索策略进一步提升了生成质量。
Abstract: Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce \textbf{ConsistTalk}, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose \textbf{an optical flow-guided temporal module (OFT)} that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an \textbf{Audio-to-Intensity (A2I) model} obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a \textbf{diffusion noise initialization strategy (IC-Init)}. By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.
[172] PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory
Qunchao Jin,Yilin Wu,Changhao Chen
Main category: cs.CV
TL;DR: PanoNav是一个无需地图的零样本物体导航框架,通过全景场景解析和动态记忆机制提升空间理解能力,显著优于现有基线。
Details
Motivation: 解决零样本物体导航(ZSON)在未见环境中依赖深度传感器或预建地图的限制,以及缺乏历史上下文导致的短视决策问题。Contribution: 提出了PanoNav框架,结合全景场景解析模块和动态记忆队列,实现仅RGB输入的零样本导航。
Method: 1. 全景场景解析模块从全景RGB输入中提取空间信息;2. 动态有界记忆队列增强探索历史记忆,避免局部死锁。
Result: 在公开导航基准测试中,PanoNav的SR(成功率)和SPL(路径长度加权成功率)显著优于基线方法。
Insight: 全景解析和多帧记忆结合能有效提升导航性能,仅RGB输入足以实现复杂的空间推理任务。
Abstract: Zero-shot object navigation (ZSON) in unseen environments remains a challenging problem for household robots, requiring strong perceptual understanding and decision-making capabilities. While recent methods leverage metric maps and Large Language Models (LLMs), they often depend on depth sensors or prebuilt maps, limiting the spatial reasoning ability of Multimodal Large Language Models (MLLMs). Mapless ZSON approaches have emerged to address this, but they typically make short-sighted decisions, leading to local deadlocks due to a lack of historical context. We propose PanoNav, a fully RGB-only, mapless ZSON framework that integrates a Panoramic Scene Parsing module to unlock the spatial parsing potential of MLLMs from panoramic RGB inputs, and a Memory-guided Decision-Making mechanism enhanced by a Dynamic Bounded Memory Queue to incorporate exploration history and avoid local deadlocks. Experiments on the public navigation benchmark show that PanoNav significantly outperforms representative baselines in both SR and SPL metrics.
[173] Aerial Image Stitching Using IMU Data from a UAV
Selim Ahmet Iz,Mustafa Unel
Main category: cs.CV
TL;DR: 该论文提出了一种结合IMU数据和计算机视觉技术的无人机航拍图像拼接方法,通过估计无人机位移和旋转、校正透视畸变等步骤,生成高分辨率拼接图像。
Details
Motivation: 无人机航拍图像拼接常面临特征检测和匹配的误差与歧义性问题,现有方法如特征点匹配或直接图像对齐存在局限性。论文旨在利用IMU数据弥补这些不足,提高拼接的准确性和鲁棒性。Contribution: 主要贡献是提出了一种结合IMU数据和计算机视觉的新方法,通过位移估计、畸变校正和单应性矩阵计算,有效解决了无人机航拍图像拼接中的挑战性问题。
Method: 方法包括利用IMU数据估计无人机位移和旋转,校正透视畸变,计算单应性矩阵,最后使用标准图像拼接算法对齐和融合图像。
Result: 实验表明,该方法在大位移、旋转和相机位姿变化等复杂场景中表现优于现有特征点匹配方法,提高了拼接的准确性和鲁棒性。
Insight: IMU数据能显著提升图像拼接的精度和稳定性,尤其是在复杂场景下;同时该方法易于集成到现有无人机工作流中,具有实用价值。
Abstract: Unmanned Aerial Vehicles (UAVs) are widely used for aerial photography and remote sensing applications. One of the main challenges is to stitch together multiple images into a single high-resolution image that covers a large area. Featurebased image stitching algorithms are commonly used but can suffer from errors and ambiguities in feature detection and matching. To address this, several approaches have been proposed, including using bundle adjustment techniques or direct image alignment. In this paper, we present a novel method that uses a combination of IMU data and computer vision techniques for stitching images captured by a UAV. Our method involves several steps such as estimating the displacement and rotation of the UAV between consecutive images, correcting for perspective distortion, and computing a homography matrix. We then use a standard image stitching algorithm to align and blend the images together. Our proposed method leverages the additional information provided by the IMU data, corrects for various sources of distortion, and can be easily integrated into existing UAV workflows. Our experiments demonstrate the effectiveness and robustness of our method, outperforming some of the existing feature-based image stitching algorithms in terms of accuracy and reliability, particularly in challenging scenarios such as large displacements, rotations, and variations in camera pose.
[174] Gaussian-Augmented Physics Simulation and System Identification with Complex Colliders
Federico Vasile,Ri-Zhao Qiu,Lorenzo Natale,Xiaolong Wang
Main category: cs.CV
TL;DR: AS-DiffMPM是一种可微分MPM框架,扩展了现有方法,支持复杂形状碰撞体的物理属性估计,实现了端到端优化。
Details
Motivation: 现有基于MPM的系统识别方法在非平面碰撞场景中表现不佳,限制了真实世界应用的多样性。Contribution: 提出了AS-DiffMPM框架,引入可微分碰撞处理机制,支持任意形状碰撞体的物理属性估计。
Method: 扩展了可微分MPM框架,增加可微分碰撞处理模块,并与多种新视角合成方法结合。
Result: 框架在复杂碰撞场景中实现了准确的物理属性估计。
Insight: 可微分碰撞处理是实现复杂物体-环境交互系统识别的关键。
Abstract: System identification involving the geometry, appearance, and physical properties from video observations is a challenging task with applications in robotics and graphics. Recent approaches have relied on fully differentiable Material Point Method (MPM) and rendering for simultaneous optimization of these properties. However, they are limited to simplified object-environment interactions with planar colliders and fail in more challenging scenarios where objects collide with non-planar surfaces. We propose AS-DiffMPM, a differentiable MPM framework that enables physical property estimation with arbitrarily shaped colliders. Our approach extends existing methods by incorporating a differentiable collision handling mechanism, allowing the target object to interact with complex rigid bodies while maintaining end-to-end optimization. We show AS-DiffMPM can be easily interfaced with various novel view synthesis methods as a framework for system identification from visual observations.
[175] Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers
Huiyuan Tian,Bonan Xu Shijian Li
Main category: cs.CV
TL;DR: 本文通过提出‘蒸馏动力学’分析框架,揭示了特征知识蒸馏在视觉变换器(ViTs)中失败的原因:教师与学生模型之间存在表征范式不匹配,导致后期特征对齐损害学生性能。
Details
Motivation: 特征知识蒸馏在压缩卷积神经网络(CNNs)中效果显著,但在ViTs中意外失效,研究者试图通过分析揭示其根本原因。Contribution: 1. 提出‘蒸馏动力学’分析框架;2. 揭示了ViTs的U型信息处理模式;3. 发现教师与学生模型的表征范式不匹配是蒸馏失败的主因。
Method: 结合频谱分析、信息熵度量和激活幅度跟踪,分析ViTs的蒸馏动态过程。
Result: 研究发现后期特征对齐由于表征维度不匹配而损害学生性能,需超越简单的特征模仿策略。
Insight: ViTs的蒸馏成功需考虑表征约束,设计更匹配的方法而非单纯特征对齐。
Abstract: While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as ``distillation dynamics”, combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided in the supplementary material.
[176] VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling
Sicheng Yang,Xing Hu,Qiang Wu,Dawei Yang
Main category: cs.CV
TL;DR: VAEVQ通过变分建模增强离散视觉标记化,解决了VQ框架中的潜在空间不平滑、量化前后特征对齐弱等问题。
Details
Motivation: VQ框架存在潜在空间不平滑、量化前后特征对齐弱等问题,导致重构和生成任务性能下降。Contribution: 提出VAEVQ,包含变分潜在量化(VLQ)、表征一致性策略(RCS)和分布一致性正则化(DCR)三个核心组件。
Method: 1)VLQ用VAE替代AE以利用其平滑潜在空间;2)RCS自适应调节量化前后特征的强度;3)DCR对齐码书分布与连续潜在分布。
Result: 在两个基准数据集上的实验表明,VAEVQ优于现有方法。
Insight: 变分建模和一致性策略的结合能显著提升离散视觉标记化的效果,增强码书利用率和生成任务的表现。
Abstract: Vector quantization (VQ) transforms continuous image features into discrete representations, providing compressed, tokenized inputs for generative models. However, VQ-based frameworks suffer from several issues, such as non-smooth latent spaces, weak alignment between representations before and after quantization, and poor coherence between the continuous and discrete domains. These issues lead to unstable codeword learning and underutilized codebooks, ultimately degrading the performance of both reconstruction and downstream generation tasks. To this end, we propose VAEVQ, which comprises three key components: (1) Variational Latent Quantization (VLQ), replacing the AE with a VAE for quantization to leverage its structured and smooth latent space, thereby facilitating more effective codeword activation; (2) Representation Coherence Strategy (RCS), adaptively modulating the alignment strength between pre- and post-quantization features to enhance consistency and prevent overfitting to noise; and (3) Distribution Consistency Regularization (DCR), aligning the entire codebook distribution with the continuous latent distribution to improve utilization. Extensive experiments on two benchmark datasets demonstrate that VAEVQ outperforms state-of-the-art methods.
[177] A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models
Jan-Hendrik Koch,Jonas Krumme,Konrad Gadzicki
Main category: cs.CV
TL;DR: 该论文提出了一种两阶段系统,利用大型语言模型(LLM)和扩散模型实现布局控制的图像生成,解决了传统扩散模型在物体数量和空间排列上缺乏精确控制的问题。
Details
Motivation: 文本到图像的扩散模型虽然具有强大的生成能力,但在物体数量和空间布局上缺乏精确控制。论文旨在通过任务分解和布局条件约束,解决这一局限性。Contribution: 提出了一个两阶段系统:1)使用LLM生成结构化布局;2)使用布局条件扩散模型合成图像。通过简化初始生成和规则补充布局,显著提高了物体召回率。
Method: 1)LLM生成核心物体布局;2)比较ControlNet和GLIGEN两种条件扩散方法,并进行微调实验。
Result: 实验表明,该系统在复杂场景中物体召回率达到99.9%。ControlNet保留风格控制但易产生物体幻觉,GLIGEN布局更精确但控制性较弱。
Insight: 任务分解和布局约束是解决扩散模型精确控制的关键,两阶段方法在兼顾布局和可控性上具有潜力。
Abstract: Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.
[178] Classification of Microplastic Particles in Water using Polarized Light Scattering and Machine Learning Methods
Leonard Saur,Marc von Pawlowski,Ulrich Gengenbach,Ingo Sieber,Hossein Shirali,Lorenz Wührl,Rainer Kiko,Christian Pylatiuk
Main category: cs.CV
TL;DR: 提出了一种基于偏振光散射和深度学习的水中微塑料分类方法,通过反射信号避免了传统透射方法的干扰,实现了80%的分类准确率,发现AOLP信号在区分聚乙烯类型上表现更好,而DOLP信号在识别聚丙烯上更优。
Details
Motivation: 传统微塑料监测方法在水环境中存在干扰问题,亟需一种原位、大规模的监测技术。Contribution: 1. 提出基于偏振光散射的反射式监测方法;2. 使用CNN分类微塑料,揭示了偏振信号(AOLP和DOLP)在分类中的不同作用。
Method: 通过偏振激光照射微塑料,使用偏振敏感相机捕捉反射信号,利用CNN进行图像分类。
Result: 实现了80%的分类准确率,AOLP在区分聚乙烯类型上表现更好,DOLP在识别聚丙烯上更优。
Insight: 微塑料的内部偏振信号比宏观形状更能影响分类结果,且AOLP对不同噪声更具鲁棒性。
Abstract: Facing the critical need for continuous, large-scale microplastic monitoring, which is hindered by the limitations of gold-standard methods in aquatic environments, this paper introduces and validates a novel, reflection-based approach for the in-situ classification and identification of microplastics directly in water bodies, which is based on polarized light scattering. In this experiment, we classify colorless microplastic particles (50-300 $μ$m) by illuminating them with linearly polarized laser light and capturing their reflected signals using a polarization-sensitive camera. This reflection-based technique successfully circumvents the transmission-based interference issues that plague many conventional methods when applied in water. Using a deep convolutional neural network (CNN) for image-based classification, we successfully identified three common polymer types, high-density polyethylene, low-density polyethylene, and polypropylene, achieving a peak mean classification accuracy of 80% on the test dataset. A subsequent feature hierarchy analysis demonstrated that the CNN’s decision-making process relies mainly on the microstructural integrity and internal texture (polarization patterns) of the particle rather than its macroshape. Critically, we found that the Angle of Linear Polarization (AOLP) signal is significantly more robust against contextual noise than the Degree of Linear Polarization (DOLP) signal. While the AOLP-based classification achieved superior overall performance, its strength lies in distinguishing between the two polyethylene plastics, showing a lower confusion rate between high-density and low-density polyethylene. Conversely, the DOLP signal demonstrated slightly worse overall classification results but excels at accurately identifying the polypropylene class, which it isolated with greater success than AOLP.
[179] DTTNet: Improving Video Shadow Detection via Dark-Aware Guidance and Tokenized Temporal Modeling
Zhicheng Li,Kunyang Sun,Rui Yao,Hancheng Zhu,Fuyuan Hu,Jiaqi Zhao,Zhiwen Shao,Yong Zhou
Main category: cs.CV
TL;DR: 本文提出了DTTNet方法,通过结合语言先验和暗区感知模块解决视频影子检测中的背景混淆问题,并通过令牌化时序模块高效建模动态影子变形,实现了高精度和实时推理。
Details
Motivation: 视频影子检测面临两个核心问题:复杂背景下的影子误判以及光照变化导致的动态影子变形,传统方法难以兼顾这两点。Contribution: 提出了VMM和DSB模块,利用语言先验和暗区语义明确区分影子与暗色物体;设计了TTB模块,通过令牌化时序建模高效捕捉动态影子变化。
Method: 1. 通过VMM和DSB提取文本引导特征,解决影子-背景混淆;2. 引入自适应掩码重加权和边缘掩码优化训练;3. 通过TTB解耦时空学习,以令牌化方式高效建模时序信息。
Result: 在多个基准数据集上实现了state-of-the-art的检测精度,并具备实时推理能力。
Insight: 结合语言先验和暗区语义可有效解决影子检测中的背景混淆;令牌化时序建模是视频任务中高效捕捉动态变化的有效方式。
Abstract: Video shadow detection confronts two entwined difficulties: distinguishing shadows from complex backgrounds and modeling dynamic shadow deformations under varying illumination. To address shadow-background ambiguity, we leverage linguistic priors through the proposed Vision-language Match Module (VMM) and a Dark-aware Semantic Block (DSB), extracting text-guided features to explicitly differentiate shadows from dark objects. Furthermore, we introduce adaptive mask reweighting to downweight penumbra regions during training and apply edge masks at the final decoder stage for better supervision. For temporal modeling of variable shadow shapes, we propose a Tokenized Temporal Block (TTB) that decouples spatiotemporal learning. TTB summarizes cross-frame shadow semantics into learnable temporal tokens, enabling efficient sequence encoding with minimal computation overhead. Comprehensive Experiments on multiple benchmark datasets demonstrate state-of-the-art accuracy and real-time inference efficiency. Codes are available at https://github.com/city-cheng/DTTNet.
[180] PlantTraitNet: An Uncertainty-Aware Multimodal Framework for Global-Scale Plant Trait Inference from Citizen Science Data
Ayushi Sharma,Johanna Trost,Daniel Lusk,Johannes Dollinger,Julian Schrader,Christian Rossi,Javier Lopatin,Etienne Laliberté,Simon Haberstroh,Jana Eichel,Daniel Mederer,Jose Miguel Cerda-Paredes,Shyam S. Phartyal,Lisa-Maricia Schwarz,Anja Linstädter,Maria Conceição Caldeira,Teja Kattenborn
Main category: cs.CV
TL;DR: PlantTraitNet是一个基于弱监督的多模态、多任务深度学习框架,利用公民科学照片预测植物特征,并通过空间聚合生成全球特征分布图。该方法在准确性上优于现有方法,展示了公民科学数据在生态研究中的应用潜力。
Details
Motivation: 全球植物特征地图对理解生态系统过程至关重要,但现有方法因成本高和地理覆盖稀疏而受限。公民科学数据(如大量植物照片)为解决这一问题提供了新机会。Contribution: 提出了PlantTraitNet框架,首次利用公民科学照片预测植物特征并通过深度学习生成全球地图;证明了该方法在准确性上优于现有方法。
Method: 结合多模态(照片和地理信息)、多任务学习框架,引入不确定性感知模块,通过弱监督训练模型预测四种关键植物特征。
Result: 生成的全球特征地图在验证数据集(sPlotOpen)上表现优于现有方法,展示了更高的准确性和可扩展性。
Insight: 公民科学数据与计算机视觉、地理AI的结合,为全球生态研究和地球系统建模提供了新的高效途径。
Abstract: Global plant maps of plant traits, such as leaf nitrogen or plant height, are essential for understanding ecosystem processes, including the carbon and energy cycles of the Earth system. However, existing trait maps remain limited by the high cost and sparse geographic coverage of field-based measurements. Citizen science initiatives offer a largely untapped resource to overcome these limitations, with over 50 million geotagged plant photographs worldwide capturing valuable visual information on plant morphology and physiology. In this study, we introduce PlantTraitNet, a multi-modal, multi-task uncertainty-aware deep learning framework that predictsfour key plant traits (plant height, leaf area, specific leaf area, and nitrogen content) from citizen science photos using weak supervision. By aggregating individual trait predictions across space, we generate global maps of trait distributions. We validate these maps against independent vegetation survey data (sPlotOpen) and benchmark them against leading global trait products. Our results show that PlantTraitNet consistently outperforms existing trait maps across all evaluated traits, demonstrating that citizen science imagery, when integrated with computer vision and geospatial AI, enables not only scalable but also more accurate global trait mapping. This approach offers a powerful new pathway for ecological research and Earth system modeling.
[181] From Attribution to Action: Jointly ALIGNing Predictions and Explanations
Dongsheng Hong,Chao Chen,Yanhui Chen,Shanshan Lin,Zhihao Chen,Xiangwen Liao
Main category: cs.CV
TL;DR: 论文提出了ALIGN框架,通过联合训练分类器和掩码生成器,生成高质量的任务相关掩码,提升模型的可解释性和泛化能力。
Details
Motivation: 现有解释性学习方法依赖外部标注或启发式分割,监督信号质量不高,可能导致模型性能下降。Contribution: 提出ALIGN框架,无需外部标注,通过联合优化分类器和掩码生成器,生成高质量掩码并提升模型性能。
Method: ALIGN迭代训练分类器和掩码生成器,掩码生成器生成任务相关掩码,分类器优化预测准确性和掩码对齐性。
Result: 在VLCS和Terra Incognita基准测试中,ALIGN优于6个基线模型,且在分布内外均表现优越。
Insight: 高质量的掩码监督不仅能提升模型的可解释性,还能显著提高泛化能力。
Abstract: Explanation-guided learning (EGL) has shown promise in aligning model predictions with interpretable reasoning, particularly in computer vision tasks. However, most approaches rely on external annotations or heuristic-based segmentation to supervise model explanations, which can be noisy, imprecise and difficult to scale. In this work, we provide both empirical and theoretical evidence that low-quality supervision signals can degrade model performance rather than improve it. In response, we propose ALIGN, a novel framework that jointly trains a classifier and a masker in an iterative manner. The masker learns to produce soft, task-relevant masks that highlight informative regions, while the classifier is optimized for both prediction accuracy and alignment between its saliency maps and the learned masks. By leveraging high-quality masks as guidance, ALIGN improves both interpretability and generalizability, showing its superiority across various settings. Experiments on the two domain generalization benchmarks, VLCS and Terra Incognita, show that ALIGN consistently outperforms six strong baselines in both in-distribution and out-of-distribution settings. Besides, ALIGN also yields superior explanation quality concerning sufficiency and comprehensiveness, highlighting its effectiveness in producing accurate and interpretable models.
[182] FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection
Yulin Chen,Zeyuan Wang,Tianyuan Yu,Yingmei Wei,Liang Bai
Main category: cs.CV
TL;DR: 本文提出了一种名为FoCLIP的特征空间错位框架,用于欺骗基于CLIP的图像质量评估指标。该方法通过特征对齐、分数分布平衡模块和像素保护正则化,优化了CLIPscore的性能与图像质量之间的多模态输出平衡。实验表明,优化后的图像在提升CLIPscore的同时保持了视觉保真度。此外,作者发现灰度转换可以有效降低欺骗效果,并据此提出了一种颜色通道敏感性驱动的篡改检测机制,准确率达91%。
Details
Motivation: 基于CLIP的模型在多模态对齐方面表现优异,但其对齐特性容易被利用,导致图像质量评估指标CLIPscore的脆弱性。本文旨在研究如何通过特征空间的错位欺骗CLIPscore,并探索防御手段。Contribution: 1. 提出了FoCLIP框架,通过特征对齐和正则化技术构建欺骗CLIPscore的图像示例;2. 发现灰度转换可以显著降低欺骗效果;3. 提出了一种基于颜色通道敏感性的篡改检测机制,效果显著。
Method: FoCLIP框架集成了三个核心模块:特征对齐以减少图像-文本模态差距、分数分布平衡模块和像素保护正则化。通过随机梯度下降技术优化多模态输出平衡,最大化CLIPscore预测。
Result: 实验表明,优化后的图像在CLIPscore上显著提升,同时保持了视觉保真度。灰度转换被发现能有效降低欺骗效果。提出的篡改检测机制在标准基准上达到了91%的准确率。
Insight: 1. CLIPscore的多模态对齐特性存在脆弱性,容易被利用;2. 灰度转换可以作为防御欺骗行为的有效手段;3. 颜色通道敏感性在图像篡改检测中具有重要价值。
Abstract: The well-aligned attribute of CLIP-based models enables its effective application like CLIPscore as a widely adopted image quality assessment metric. However, such a CLIP-based metric is vulnerable for its delicate multimodal alignment. In this work, we propose \textbf{FoCLIP}, a feature-space misalignment framework for fooling CLIP-based image quality metric. Based on the stochastic gradient descent technique, FoCLIP integrates three key components to construct fooling examples: feature alignment as the core module to reduce image-text modality gaps, the score distribution balance module and pixel-guard regularization, which collectively optimize multimodal output equilibrium between CLIPscore performance and image quality. Such a design can be engineered to maximize the CLIPscore predictions across diverse input prompts, despite exhibiting either visual unrecognizability or semantic incongruence with the corresponding adversarial prompts from human perceptual perspectives. Experiments on ten artistic masterpiece prompts and ImageNet subsets demonstrate that optimized images can achieve significant improvement in CLIPscore while preserving high visual fidelity. In addition, we found that grayscale conversion induces significant feature degradation in fooling images, exhibiting noticeable CLIPscore reduction while preserving statistical consistency with original images. Inspired by this phenomenon, we propose a color channel sensitivity-driven tampering detection mechanism that achieves 91% accuracy on standard benchmarks. In conclusion, this work establishes a practical pathway for feature misalignment in CLIP-based multimodal systems and the corresponding defense method.
[183] PADM: A Physics-aware Diffusion Model for Attenuation Correction
Trung Kien Pham,Hoang Minh Vu,Anh Duc Chu,Dac Thai Nguyen,Trung Thanh Nguyen,Thao Nguyen Truong,Mai Hong Son,Thanh Trung Nguyen,Phi Le Nguyen
Main category: cs.CV
TL;DR: 论文提出了一种基于扩散模型的物理感知衰减校正方法PADM,用于提升心脏SPECT成像中衰减伪影的校正效果,避免了依赖昂贵且不便的SPECT/CT系统。
Details
Motivation: 心脏SPECT成像中的衰减伪影严重影响诊断准确性,现有SPECT/CT系统成本高且辐射风险大,亟需一种无需CT的低成本替代方案。Contribution: 1. 提出PADM模型,通过扩散模型结合物理先验知识实现衰减校正;2. 发布了CardiAC数据集,包含424例患者的配对NAC和AC重建数据。
Method: PADM采用扩散模型框架,通过师生蒸馏机制引入显式物理先验,仅需NAC输入即可完成校正,同时保留了训练时的物理学监督。
Result: 实验表明,PADM在定量指标和视觉评估上均优于当前最优生成模型,重建保真度更高。
Insight: 物理先验知识与生成模型的结合能够有效提升医学成像任务的性能,同时减少对昂贵硬件的依赖。
Abstract: Attenuation artifacts remain a significant challenge in cardiac Myocardial Perfusion Imaging (MPI) using Single-Photon Emission Computed Tomography (SPECT), often compromising diagnostic accuracy and reducing clinical interpretability. While hybrid SPECT/CT systems mitigate these artifacts through CT-derived attenuation maps, their high cost, limited accessibility, and added radiation exposure hinder widespread clinical adoption. In this study, we propose a novel CT-free solution to attenuation correction in cardiac SPECT. Specifically, we introduce Physics-aware Attenuation Correction Diffusion Model (PADM), a diffusion-based generative method that incorporates explicit physics priors via a teacher–student distillation mechanism. This approach enables attenuation artifact correction using only Non-Attenuation-Corrected (NAC) input, while still benefiting from physics-informed supervision during training. To support this work, we also introduce CardiAC, a comprehensive dataset comprising 424 patient studies with paired NAC and Attenuation-Corrected (AC) reconstructions, alongside high-resolution CT-based attenuation maps. Extensive experiments demonstrate that PADM outperforms state-of-the-art generative models, delivering superior reconstruction fidelity across both quantitative metrics and visual assessment.
[184] GFix: Perceptually Enhanced Gaussian Splatting Video Compression
Siyue Teng,Ge Gao,Duolikun Danier,Yuxuan Jiang,Fan Zhang,Thomas Davis,Zoe Liu,David Bull
Main category: cs.CV
TL;DR: GFix 是一个基于感知增强的高斯散点视频压缩框架,通过结合单步扩散模型和调制LoRA方案,显著提升了压缩效率和视觉质量。
Details
Motivation: 现有的基于3D高斯散点(3DGS)的视频编解码器存在明显的视觉伪影和低压缩率问题。GFix的目标是通过感知增强来解决这些问题,借鉴扩散训练中的噪声采样假设。Contribution: 1. 提出了一个内容自适应的框架GFix,利用单步扩散模型作为神经增强器。2. 提出了一种调制LoRA方案,通过冻结低秩分解和调制隐藏状态,提高了压缩效率。
Method: GFix结合了单步扩散模型和调制LoRA技术,前者用于消除伪影,后者通过高效的适应机制提升压缩性能。
Result: GFix在LPIPS指标上实现了72.1%的BD-rate节省,在FID指标上实现了21.4%的提升,显著优于现有方法GSVC。
Insight: 将扩散模型的噪声采样假设引入3DGS压缩任务,证明了感知增强的有效性;调制LoRA方案展示了高效的模型适应能力。
Abstract: 3D Gaussian Splatting (3DGS) enhances 3D scene reconstruction through explicit representation and fast rendering, demonstrating potential benefits for various low-level vision tasks, including video compression. However, existing 3DGS-based video codecs generally exhibit more noticeable visual artifacts and relatively low compression ratios. In this paper, we specifically target the perceptual enhancement of 3DGS-based video compression, based on the assumption that artifacts from 3DGS rendering and quantization resemble noisy latents sampled during diffusion training. Building on this premise, we propose a content-adaptive framework, GFix, comprising a streamlined, single-step diffusion model that serves as an off-the-shelf neural enhancer. Moreover, to increase compression efficiency, We propose a modulated LoRA scheme that freezes the low-rank decompositions and modulates the intermediate hidden states, thereby achieving efficient adaptation of the diffusion backbone with highly compressible updates. Experimental results show that GFix delivers strong perceptual quality enhancement, outperforming GSVC with up to 72.1% BD-rate savings in LPIPS and 21.4% in FID.
[185] Learning from the Right Patches: A Two-Stage Wavelet-Driven Masked Autoencoder for Histopathology Representation Learning
Raneen Younis,Louay Hamdi,Lukas Chavez,Zahra Ahmadi
Main category: cs.CV
TL;DR: 本文提出了一种基于小波驱动的两阶段掩码自编码器(WISE-MAE),通过在小波信息引导下选择有意义的组织区域,改进了传统随机采样在数字病理学中的局限性,提升了表示学习的质量。
Details
Motivation: 数字病理学中的全切片图像(WSI)尺寸极大且标注稀缺,传统掩码自编码器(MAE)的随机补丁采样会引入噪声或无关区域,限制了模型捕捉有意义组织模式的能力。因此,需要在自监督学习中引入结构和生物学相关性的方法。Contribution: 1)提出了一种轻量级且适应领域的框架WISE-MAE,通过小波驱动的两阶段补丁选择策略改进MAE的预训练;2)模拟病理学家的诊断流程,先粗后细地筛选结构丰富的区域;3)在多种癌症数据集上验证了方法的有效性。
Method: 1)在第一阶段,使用小波变换在低倍率下筛选结构丰富的区域;2)在第二阶段,从选定区域提取高分辨率补丁,用于详细的MAE建模;3)整个过程模拟了病理诊断的流程。
Result: 在肺癌、肾癌和结直肠癌等多种数据集上,WISE-MAE在表示质量和下游分类任务中表现优异,同时在弱监督条件下保持了高效性。
Insight: 通过在自监督学习中引入领域知识(如小波分析和病理诊断流程),可以显著提升模型对复杂医学图像的理解能力,尤其是在标注稀缺的场景下。
Abstract: Whole-slide images are central to digital pathology, yet their extreme size and scarce annotations make self-supervised learning essential. Masked Autoencoders (MAEs) with Vision Transformer backbones have recently shown strong potential for histopathology representation learning. However, conventional random patch sampling during MAE pretraining often includes irrelevant or noisy regions, limiting the model’s ability to capture meaningful tissue patterns. In this paper, we present a lightweight and domain-adapted framework that brings structure and biological relevance into MAE-based learning through a wavelet-informed patch selection strategy. WISE-MAE applies a two-step coarse-to-fine process: wavelet-based screening at low magnification to locate structurally rich regions, followed by high-resolution extraction for detailed modeling. This approach mirrors the diagnostic workflow of pathologists and improves the quality of learned representations. Evaluations across multiple cancer datasets, including lung, renal, and colorectal tissues, show that WISE-MAE achieves competitive representation quality and downstream classification performance while maintaining efficiency under weak supervision.
[186] Exploring the “Great Unseen” in Medieval Manuscripts: Instance-Level Labeling of Legacy Image Collections with Zero-Shot Models
Christofer Meinecke,Estelle Guéville,David Joseph Wrisley
Main category: cs.CV
TL;DR: 论文探讨了如何利用零样本模型对中世纪手稿中的图像进行实例级标注,旨在为计算机视觉技术提供更丰富的训练数据。
Details
Motivation: 研究中世纪手稿的内容标注问题,传统方法费时费力,需要更高效的技术支持。Contribution: 提出了一种使用零样本模型标注中世纪手稿图像的方法,为实例分割和多模态模型提供训练数据。
Method: 采用了先进的零样本模型技术,对整个手稿页面进行分割和描述。
Result: 该方法能够高效标注手稿内容,提升计算机视觉技术的训练效果。
Insight: 零样本模型在历史文献分析中具有潜力,可以显著减少人工标注成本。
Abstract: We aim to theorize the medieval manuscript page and its contents more holistically, using state-of-the-art techniques to segment and describe the entire manuscript folio, for the purpose of creating richer training data for computer vision techniques, namely instance segmentation, and multimodal models for medieval-specific visual content.
[187] TrueCity: Real and Simulated Urban Data for Cross-Domain 3D Scene Understanding
Duc Nguyen,Yan-Ling Lai,Qilin Zhang,Prabin Gyawali,Benedikt Schwab,Olaf Wysocki,Thomas H. Kolbe
Main category: cs.CV
TL;DR: TrueCity推出了首个城市3D语义分割基准数据集,包括高精度标注的真实点云、语义3D城市模型及对应的模拟点云,旨在量化合成到真实的域偏移并提升3D场景理解。
Details
Motivation: 现有数据集中真实世界标注数据有限,而合成数据因缺乏真实复杂性和传感器噪音导致域偏移问题。Contribution: TrueCity提供了同步的真实和模拟点云数据,支持合成到真实域偏移的分析与量化。
Method: 通过同步采集和标注真实与模拟点云数据,设计了与国际3D城市建模标准一致的语义类别。
Result: 实验表明TrueCity能有效量化域偏移,并提出利用合成数据增强真实场景理解的策略。
Insight: TrueCity填补了合成与真实数据之间同步基准的空白,为开发通用数据驱动模型提供了重要资源。
Abstract: 3D semantic scene understanding remains a long-standing challenge in the 3D computer vision community. One of the key issues pertains to limited real-world annotated data to facilitate generalizable models. The common practice to tackle this issue is to simulate new data. Although synthetic datasets offer scalability and perfect labels, their designer-crafted scenes fail to capture real-world complexity and sensor noise, resulting in a synthetic-to-real domain gap. Moreover, no benchmark provides synchronized real and simulated point clouds for segmentation-oriented domain shift analysis. We introduce TrueCity, the first urban semantic segmentation benchmark with cm-accurate annotated real-world point clouds, semantic 3D city models, and annotated simulated point clouds representing the same city. TrueCity proposes segmentation classes aligned with international 3D city modeling standards, enabling consistent evaluation of synthetic-to-real gap. Our extensive experiments on common baselines quantify domain shift and highlight strategies for exploiting synthetic data to enhance real-world 3D scene understanding. We are convinced that the TrueCity dataset will foster further development of sim-to-real gap quantification and enable generalizable data-driven models. The data, code, and 3D models are available online: https://tum-gis.github.io/TrueCity/
[188] Certified L2-Norm Robustness of 3D Point Cloud Recognition in the Frequency Domain
Liang Zhou,Qiming Wang,Tianze Chen
Main category: cs.CV
TL;DR: FreqCert 是一个新的认证框架,通过在频域中分析鲁棒性,为3D点云识别提供全局L2有界扰动的结构化认证,实现了更高的认证准确性和抗扰动能力。
Details
Motivation: 现有方法关注点对点的扰动认证,忽视了整体几何结构的失真问题,而这些问题可能会引发分类错误。频域分析可以提供更稳定的抗扰动能力。Contribution: 提出了 FreqCert 框架,通过图傅里叶变换和频域子采样生成多个子点云,并进行多数投票分类,实现了对全局L2扰动的结构化认证。
Method: 利用图傅里叶变换(GFT)将点云转换到频域,通过频域子采样生成多个子点云,独立分类后通过多数投票获得最终预测。
Result: 在 ModelNet40 和 ScanObjectNN 数据集上,FreqCert 在强扰动下展现了更高的认证准确性和实际分类精度。
Insight: 频域表示能够更稳定地捕捉点云的固有结构,为3D点云识别的可认证鲁棒性提供了新思路。
Abstract: 3D point cloud classification is a fundamental task in safety-critical applications such as autonomous driving, robotics, and augmented reality. However, recent studies reveal that point cloud classifiers are vulnerable to structured adversarial perturbations and geometric corruptions, posing risks to their deployment in safety-critical scenarios. Existing certified defenses limit point-wise perturbations but overlook subtle geometric distortions that preserve individual points yet alter the overall structure, potentially leading to misclassification. In this work, we propose FreqCert, a novel certification framework that departs from conventional spatial domain defenses by shifting robustness analysis to the frequency domain, enabling structured certification against global L2-bounded perturbations. FreqCert first transforms the input point cloud via the graph Fourier transform (GFT), then applies structured frequency-aware subsampling to generate multiple sub-point clouds. Each sub-cloud is independently classified by a standard model, and the final prediction is obtained through majority voting, where sub-clouds are constructed based on spectral similarity rather than spatial proximity, making the partitioning more stable under L2 perturbations and better aligned with the object’s intrinsic structure. We derive a closed-form lower bound on the certified L2 robustness radius and prove its tightness under minimal and interpretable assumptions, establishing a theoretical foundation for frequency domain certification. Extensive experiments on the ModelNet40 and ScanObjectNN datasets demonstrate that FreqCert consistently achieves higher certified accuracy and empirical accuracy under strong perturbations. Our results suggest that spectral representations provide an effective pathway toward certifiable robustness in 3D point cloud recognition.
[189] From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge
Hui Lu,Yi Yu,Song Xia,Yiming Yang,Deepu Rajan,Boon Poh Ng,Alex Kot,Xudong Jiang
Main category: cs.CV
TL;DR: 该论文提出了一种新型对抗攻击方法TVA,针对视频基础模型(VFMs)的下游任务或多模态大语言模型(MLLMs),无需受害者任务、训练数据或模型架构信息即可发动攻击。
Details
Motivation: 尽管VFMs推动了视频相关任务的进步,但其开放性和可访问性也带来了安全隐患。论文旨在探索VFMs的对抗漏洞,特别是在不了解受害者任务的情况下如何发动攻击。Contribution: 提出了TVA方法,通过时间感知的对抗扰动和双向对比学习机制,揭示了VFMs在无任务知识情况下的对抗脆弱性。
Method: TVA结合了双向对比学习和时间一致性损失,利用VFMs的时间动态表征生成有效扰动,避免训练昂贵替代模型。
Result: 在24个视频任务上验证了TVA的有效性,展示了对下游模型和MLLMs的攻击能力。
Insight: 论文揭示了VFMs部署中未充分探索的安全漏洞,强调了设计鲁棒视频模型的必要性。
Abstract: Large-scale Video Foundation Models (VFMs) has significantly advanced various video-related tasks, either through task-specific models or Multi-modal Large Language Models (MLLMs). However, the open accessibility of VFMs also introduces critical security risks, as adversaries can exploit full knowledge of the VFMs to launch potent attacks. This paper investigates a novel and practical adversarial threat scenario: attacking downstream models or MLLMs fine-tuned from open-source VFMs, without requiring access to the victim task, training data, model query, and architecture. In contrast to conventional transfer-based attacks that rely on task-aligned surrogate models, we demonstrate that adversarial vulnerabilities can be exploited directly from the VFMs. To this end, we propose the Transferable Video Attack (TVA), a temporal-aware adversarial attack method that leverages the temporal representation dynamics of VFMs to craft effective perturbations. TVA integrates a bidirectional contrastive learning mechanism to maximize the discrepancy between the clean and adversarial features, and introduces a temporal consistency loss that exploits motion cues to enhance the sequential impact of perturbations. TVA avoids the need to train expensive surrogate models or access to domain-specific data, thereby offering a more practical and efficient attack strategy. Extensive experiments across 24 video-related tasks demonstrate the efficacy of TVA against downstream models and MLLMs, revealing a previously underexplored security vulnerability in the deployment of video models.
[190] Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation
Yuxuan Zhou,Tao Yu,Wen Huang,Yuheng Zhang,Tao Dai,Shu-Tao Xia
Main category: cs.CV
TL;DR: 该论文提出了一种基于强化学习的自适应数据增强框架CRDA,用于提升深度伪造检测器的泛化能力。通过动态选择和调整数据增强策略,CRDA能有效应对复杂的伪造特征,并在多域数据集上优于现有方法。
Details
Motivation: 现有深度伪造检测方法通常使用固定的数据增强策略,难以应对现实世界中不断变化的复杂伪造特征(如面部扭曲、表情操控)。论文旨在解决这一问题,提出动态适应伪造特征多样性的方法。Contribution: 1. 提出CRDA框架,结合强化学习和因果推理,动态生成适合检测器当前学习状态的数据增强样本。2. 设计了一种可配置的伪造操作池和动态动作选择机制,提升伪造特征的多样性。3. 通过因果推理消除虚假相关性,专注于因果不变特征。
Method: 1. 使用强化学习(RL)动态选择和调整数据增强操作,适应检测器的学习状态。2. 引入伪造操作池生成多样化的伪造样本,通过因果推理消除任务无关偏差。3. 采用从简单到复杂的课程学习策略,逐步提升检测器的能力。
Result: 在多个跨域数据集上的实验表明,CRDA显著提升了检测器的泛化性能,优于现有的最佳方法(SOTA)。
Insight: 1. 动态数据增强策略比固定策略更能有效应对复杂的伪造特征。2. 强化学习与因果推理的结合有助于消除虚假相关性和提升泛化能力。3. 课程学习策略在渐进式学习伪造特征中发挥了重要作用。
Abstract: The generalization capability of deepfake detectors is critical for real-world use. Data augmentation via synthetic fake face generation effectively enhances generalization, yet current SoTA methods rely on fixed strategies-raising a key question: Is a single static augmentation sufficient, or does the diversity of forgery features demand dynamic approaches? We argue existing methods overlook the evolving complexity of real-world forgeries (e.g., facial warping, expression manipulation), which fixed policies cannot fully simulate. To address this, we propose CRDA (Curriculum Reinforcement-Learning Data Augmentation), a novel framework guiding detectors to progressively master multi-domain forgery features from simple to complex. CRDA synthesizes augmented samples via a configurable pool of forgery operations and dynamically generates adversarial samples tailored to the detector’s current learning state. Central to our approach is integrating reinforcement learning (RL) and causal inference. An RL agent dynamically selects augmentation actions based on detector performance to efficiently explore the vast augmentation space, adapting to increasingly challenging forgeries. Simultaneously, the agent introduces action space variations to generate heterogeneous forgery patterns, guided by causal inference to mitigate spurious correlations-suppressing task-irrelevant biases and focusing on causally invariant features. This integration ensures robust generalization by decoupling synthetic augmentation patterns from the model’s learned representations. Extensive experiments show our method significantly improves detector generalizability, outperforming SOTA methods across multiple cross-domain datasets.
[191] ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora
Nikolas Adaloglou,Diana Petrusheva,Mohamed Asker,Felix Michels,Markus Kollmann
Main category: cs.CV
TL;DR: ClusterMine是一种无需预定义标签的无监督视觉OOD检测方法,通过结合文本库中的概念挖掘和CLIP模型的零样本一致性,实现了SOTA性能。
Details
Motivation: 传统方法依赖于预定义的正样本标签,但实际场景中这些标签可能不可用、不可靠或随时间变化失效,亟需一种真正无监督的OOD检测方法。Contribution: ClusterMine首次提出无需正样本标签的概念挖掘框架,结合视觉聚类和零样本一致性,实现了无监督OOD检测的SOTA性能。
Method: 利用文本库进行正标签挖掘,通过视觉聚类和CLIP的零样本一致性提取正概念,避免了标签依赖。
Result: 在多种CLIP模型上验证了ClusterMine的扩展性和对协变量ID偏移的鲁棒性,性能领先现有方法。
Insight: 文本库中的概念挖掘和视觉模型的结合为无监督OOD检测提供了新思路,尤其是在标签稀缺或变化快的场景中具有潜力。
Abstract: Large-scale visual out-of-distribution (OOD) detection has witnessed remarkable progress by leveraging vision-language models such as CLIP. However, a significant limitation of current methods is their reliance on a pre-defined set of in-distribution (ID) ground-truth label names (positives). These fixed label names can be unavailable, unreliable at scale, or become less relevant due to in-distribution shifts after deployment. Towards truly unsupervised OOD detection, we utilize widely available text corpora for positive label mining, bypassing the need for positives. In this paper, we utilize widely available text corpora for positive label mining under a general concept mining paradigm. Within this framework, we propose ClusterMine, a novel positive label mining method. ClusterMine is the first method to achieve state-of-the-art OOD detection performance without access to positive labels. It extracts positive concepts from a large text corpus by combining visual-only sample consistency (via clustering) and zero-shot image-text consistency. Our experimental study reveals that ClusterMine is scalable across a plethora of CLIP models and achieves state-of-the-art robustness to covariate in-distribution shifts. The code is available at https://github.com/HHU-MMBS/clustermine_wacv_official.
[192] LeCoT: revisiting network architecture for two-view correspondence pruning
Luanyuan Dai,Xiaoyu Du,Jinhui Tang
Main category: cs.CV
TL;DR: LeCoT提出了一种新颖的两视角对应点修剪网络,通过空间-通道融合Transformer块和预测块,有效利用全局上下文信息,优于现有方法。
Details
Motivation: 现有方法通常使用多层感知机(MLP)作为主干网络,并通过额外模块增强上下文信息处理能力,但这种设计存在局限性。LeCoT旨在通过更自然的方式捕捉对应点上下文信息。Contribution: 1. 设计了空间-通道融合Transformer块,高效利用稀疏对应点的空间和通道全局上下文信息;2. 提出了预测块,利用中间阶段的特征生成概率集,指导后续学习阶段。
Method: LeCoT的核心设计包括空间-通道融合Transformer块和预测块。前者结合空间和通道信息,后者通过逐步细化概率集解决信息丢失问题。
Result: LeCoT在两视角对应点修剪、相对姿态估计、单应性估计、视觉定位和3D重建任务中均优于现有方法。
Insight: 通过融合Transformer的优势,LeCoT无需额外模块即可高效处理全局上下文信息,为稀疏对应点处理提供了新思路。
Abstract: Two-view correspondence pruning aims to accurately remove incorrect correspondences (outliers) from initial ones and is widely applied to various computer vision tasks. Current popular strategies adopt multilayer perceptron (MLP) as the backbone, supplemented by additional modules to enhance the network ability to handle context information, which is a known limitation of MLPs. In contrast, we introduce a novel perspective for capturing correspondence context information without extra design modules. To this end, we design a two-view correspondence pruning network called LeCoT, which can naturally leverage global context information at different stages. Specifically, the core design of LeCoT is the Spatial-Channel Fusion Transformer block, a newly proposed component that efficiently utilizes both spatial and channel global context information among sparse correspondences. In addition, we integrate the proposed prediction block that utilizes correspondence features from intermediate stages to generate a probability set, which acts as guiding information for subsequent learning phases, allowing the network to more effectively capture robust global context information. Notably, this prediction block progressively refines the probability set, thereby mitigating the issue of information loss that is common in the traditional one. Extensive experiments prove that the proposed LeCoT outperforms state-of-the-art methods in correspondence pruning, relative pose estimation, homography estimation, visual localization, and $3$D~reconstruction tasks. The code is provided in https://github.com/Dailuanyuan2024/LeCoT-Revisiting-Network-Architecture-for-Two-View-Correspondence-Pruning.
[193] How Bias Binds: Measuring Hidden Associations for Bias Control in Text-to-Image Compositions
Jeng-Lin Li,Ming-Ching Chang,Wei-Chao Chen
Main category: cs.CV
TL;DR: 本文研究了文本到图像生成模型中由语义绑定引发的偏见问题,提出了量化偏见的方法和无需训练的上下文偏见控制框架,取得了10%以上的去偏见效果。
Details
Motivation: 现有研究多关注单一对象提示的偏见,忽略了对象与属性间的语义绑定对偏见的影响,导致现有去偏见方法在这些场景中效果不佳。Contribution: 提出了偏见依附分数量化语义绑定对偏见的影响,并开发了一个无需训练的上下文偏见控制框架,显著提升了去偏见效果。
Method: 引入了偏见依附分数来衡量对象-属性绑定的偏见激活程度,并通过标记解耦技术设计了训练自由的上下文偏见控制框架。
Result: 在组合生成任务中,该方法实现了超过10%的去偏见改进,证明了其在复杂上下文中的有效性。
Insight: 研究发现,去偏见过程中如何在不破坏必要语义关系的情况下减少偏见是一个基本挑战,揭示了当前去偏见方法在语义绑定场景中的局限性。
Abstract: Text-to-image generative models often exhibit bias related to sensitive attributes. However, current research tends to focus narrowly on single-object prompts with limited contextual diversity. In reality, each object or attribute within a prompt can contribute to bias. For example, the prompt “an assistant wearing a pink hat” may reflect female-inclined biases associated with a pink hat. The neglected joint effects of the semantic binding in the prompts cause significant failures in current debiasing approaches. This work initiates a preliminary investigation on how bias manifests under semantic binding, where contextual associations between objects and attributes influence generative outcomes. We demonstrate that the underlying bias distribution can be amplified based on these associations. Therefore, we introduce a bias adherence score that quantifies how specific object-attribute bindings activate bias. To delve deeper, we develop a training-free context-bias control framework to explore how token decoupling can facilitate the debiasing of semantic bindings. This framework achieves over 10% debiasing improvement in compositional generation tasks. Our analysis of bias scores across various attribute-object bindings and token decorrelation highlights a fundamental challenge: reducing bias without disrupting essential semantic relationships. These findings expose critical limitations in current debiasing approaches when applied to semantically bound contexts, underscoring the need to reassess prevailing bias mitigation strategies.
[194] HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving
Zhongyu Xia,Zhiwei Lin,Yongtao Wang,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: HENet++提出了一个混合编码和多任务学习的框架,用于3D感知和端到端自动驾驶,通过大图像编码器处理短期帧和小编码器处理长期帧,同时提取密集和稀疏特征,提升多任务性能并降低碰撞率。
Details
Motivation: 大图像编码器、高分辨率图像和长期时序输入能显著提升3D感知任务的性能,但计算资源限制了它们的兼容性;同时,不同任务需要不同的特征表示,导致单一模型难以在多任务中保持高性能。HENet++旨在解决这些问题。Contribution: 1. 提出混合图像编码网络,结合大编码器和小编码器处理不同时序帧。2. 同时提取密集和稀疏特征,优化不同任务的表示。3. 在nuScenes基准测试中实现多任务3D感知和端到端自动驾驶的最优性能。
Method: 1. 使用大图像编码器处理短期帧,小编码器处理长期帧。2. 联合提取密集和稀疏特征。3. 支持多模态输入和多种3D特征提取方法。
Result: 在nuScenes基准测试中,HENet++取得了多任务3D感知的最先进结果,并在端到端自动驾驶任务中达到最低碰撞率。
Insight: 混合编码和多任务学习的结合能够有效平衡计算资源与性能需求,同时适配不同任务的特性,为自动驾驶系统的设计提供了新思路。
Abstract: Three-dimensional feature extraction is a critical component of autonomous driving systems, where perception tasks such as 3D object detection, bird’s-eye-view (BEV) semantic segmentation, and occupancy prediction serve as important constraints on 3D features. While large image encoders, high-resolution images, and long-term temporal inputs can significantly enhance feature quality and deliver remarkable performance gains, these techniques are often incompatible in both training and inference due to computational resource constraints. Moreover, different tasks favor distinct feature representations, making it difficult for a single model to perform end-to-end inference across multiple tasks while maintaining accuracy comparable to that of single-task models. To alleviate these issues, we present the HENet and HENet++ framework for multi-task 3D perception and end-to-end autonomous driving. Specifically, we propose a hybrid image encoding network that uses a large image encoder for short-term frames and a small one for long-term frames. Furthermore, our framework simultaneously extracts both dense and sparse features, providing more suitable representations for different tasks, reducing cumulative errors, and delivering more comprehensive information to the planning module. The proposed architecture maintains compatibility with various existing 3D feature extraction methods and supports multimodal inputs. HENet++ achieves state-of-the-art end-to-end multi-task 3D perception results on the nuScenes benchmark, while also attaining the lowest collision rate on the nuScenes end-to-end autonomous driving benchmark.
[195] Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction
Changyue Shi,Chuxiao Yang,Xinyuan Hu,Minghao Chen,Wenwen Pan,Yan Yang,Jiajun Ding,Zhou Yu,Jun Yu
Main category: cs.CV
TL;DR: Sparse4DGS提出了一种针对稀疏帧动态场景重建的新方法,通过纹理感知的变形正则化和规范优化,显著提升了稀疏帧输入下的重建质量。
Details
Motivation: 现有动态高斯分割方法依赖密集帧视频序列,但在实际场景中往往只能获取稀疏帧数据,导致重建效果不佳。Contribution: 1) 首次提出稀疏帧动态场景重建方法Sparse4DGS;2) 引入纹理感知变形正则化和规范优化,提升重建性能;3) 在多个数据集上验证了方法的有效性。
Method: 1) 纹理感知变形正则化:引入基于纹理的深度对齐损失,约束高斯变形;2) 纹理感知规范优化:在规范高斯场的梯度下降中加入纹理噪声。
Result: 在NeRF-Synthetic、HyperNeRF等数据集上,Sparse4DGS在稀疏帧输入下优于现有动态或少样本技术。
Insight: 纹理丰富的区域对稀疏帧重建至关重要,针对这些区域进行优化可以显著提升动态场景的重建质量。
Abstract: Dynamic Gaussian Splatting approaches have achieved remarkable performance for 4D scene reconstruction. However, these approaches rely on dense-frame video sequences for photorealistic reconstruction. In real-world scenarios, due to equipment constraints, sometimes only sparse frames are accessible. In this paper, we propose Sparse4DGS, the first method for sparse-frame dynamic scene reconstruction. We observe that dynamic reconstruction methods fail in both canonical and deformed spaces under sparse-frame settings, especially in areas with high texture richness. Sparse4DGS tackles this challenge by focusing on texture-rich areas. For the deformation network, we propose Texture-Aware Deformation Regularization, which introduces a texture-based depth alignment loss to regulate Gaussian deformation. For the canonical Gaussian field, we introduce Texture-Aware Canonical Optimization, which incorporates texture-based noise into the gradient descent process of canonical Gaussians. Extensive experiments show that when taking sparse frames as inputs, our method outperforms existing dynamic or few-shot techniques on NeRF-Synthetic, HyperNeRF, NeRF-DS, and our iPhone-4D datasets.
[196] ProcGen3D: Learning Neural Procedural Graph Representations for Image-to-3D Reconstruction
Xinyi Zhang,Daoyi Gao,Naiqi Li,Angela Dai
Main category: cs.CV
TL;DR: ProcGen3D提出了一种基于神经程序图的3D内容生成方法,通过图像输入重建复杂的3D资产,结合蒙特卡洛树搜索(MCTS)优化生成过程,性能优于现有生成方法和领域专用技术。
Details
Motivation: 现有3D生成方法在复杂性和可控性上存在局限,而领域专用建模技术缺乏泛化能力。ProcGen3D受工业生产中程序化生成器的启发,旨在通过学习神经程序图表征,实现更灵活、更准确的图像到3D重建。Contribution: 1. 提出了一种基于图的程序化生成表示方法,用于3D资产生成。2. 通过边标记化和Transformer先验模型,实现图像到程序图的映射。3. 引入MCTS引导采样,提升生成结果与输入图像的对齐性。
Method: 1. 采用边标记化编码程序图,训练Transformer模型预测下一标记。2. 使用MCTS引导采样,优化生成过程,确保输出的程序图与输入图像更匹配。3. 支持多种可通过程序化生成器合成的对象类别。
Result: 在仙人掌、树木和桥梁等对象上的实验表明,ProcGen3D性能优于现有生成方法和领域专用技术,且在仅使用合成数据训练时,能泛化到真实世界图像。
Insight: 程序化生成表征提供了一种高层次的3D建模方式,结合深度学习与规划算法(如MCTS),可显著提升3D重建的质量与可控性。
Abstract: We introduce ProcGen3D, a new approach for 3D content creation by generating procedural graph abstractions of 3D objects, which can then be decoded into rich, complex 3D assets. Inspired by the prevalent use of procedural generators in production 3D applications, we propose a sequentialized, graph-based procedural graph representation for 3D assets. We use this to learn to approximate the landscape of a procedural generator for image-based 3D reconstruction. We employ edge-based tokenization to encode the procedural graphs, and train a transformer prior to predict the next token conditioned on an input RGB image. Crucially, to enable better alignment of our generated outputs to an input image, we incorporate Monte Carlo Tree Search (MCTS) guided sampling into our generation process, steering output procedural graphs towards more image-faithful reconstructions. Our approach is applicable across a variety of objects that can be synthesized with procedural generators. Extensive experiments on cacti, trees, and bridges show that our neural procedural graph generation outperforms both state-of-the-art generative 3D methods and domain-specific modeling techniques. Furthermore, this enables improved generalization on real-world input images, despite training only on synthetic data.
[197] Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use
Sébastien Thuau,Siba Haidar,Rachid Chelouah
Main category: cs.CV
TL;DR: 本文比较了三种联邦学习方法在视频暴力检测中的应用,包括零样本推理、LoRA微调和个性化联邦学习,探讨了它们在性能、能耗和多模态推理方面的优劣,并提出混合部署策略。
Details
Motivation: 视频监控中隐私保护和低能耗的需求日益增长,但部署大型视觉语言模型(VLMs)带来能源和可持续性挑战。联邦学习在隐私保护方面表现优越,但其实现方式需要权衡性能和能耗。Contribution: 首次比较了LoRA调优的VLMs和个性化CNNs在联邦暴力检测中的性能与能耗,提出基于语义相似性和类别分组的层级分类方法,显著提升了VLM的多类别分类准确率。
Method: 在RWF-2000和RLVS数据集上评估了三种方法:零样本推理、LoRA微调LLaVA-NeXT-Video-7B和个性化联邦学习3D CNN。使用分层类别分组提升VLM性能。
Result: 所有方法在二分类暴力检测中超过90%准确率。3D CNN在能耗减半的情况下(240 Wh vs. 570 Wh)实现更优的校准性能(ROC AUC 92.59%),而VLMs提供更强的多模态推理能力。
Insight: 混合部署策略(默认使用高效CNNs,复杂场景下选择性调用VLMs)能平衡性能、隐私保护和能耗,为实际应用提供可行性方案。
Abstract: Deep learning-based video surveillance increasingly demands privacy-preserving architectures with low computational and environmental overhead. Federated learning preserves privacy but deploying large vision-language models (VLMs) introduces major energy and sustainability challenges. We compare three strategies for federated violence detection under realistic non-IID splits on the RWF-2000 and RLVS datasets: zero-shot inference with pretrained VLMs, LoRA-based fine-tuning of LLaVA-NeXT-Video-7B, and personalized federated learning of a 65.8M-parameter 3D CNN. All methods exceed 90% accuracy in binary violence detection. The 3D CNN achieves superior calibration (ROC AUC 92.59%) at roughly half the energy cost (240 Wh vs. 570 Wh) of federated LoRA, while VLMs provide richer multimodal reasoning. Hierarchical category grouping (based on semantic similarity and class exclusion) boosts VLM multiclass accuracy from 65.31% to 81% on the UCF-Crime dataset. To our knowledge, this is the first comparative simulation study of LoRA-tuned VLMs and personalized CNNs for federated violence detection, with explicit energy and CO2e quantification. Our results inform hybrid deployment strategies that default to efficient CNNs for routine inference and selectively engage VLMs for complex contextual reasoning.
[198] Automated Estimation of Anatomical Risk Metrics for Endoscopic Sinus Surgery Using Deep Learning
Konrad Reuter,Lennart Thaysen,Bilkay Doruk,Sarah Latus,Brigitte Holst,Benjamin Becker,Dennis Eggert,Christian Betz,Anna-Sophie Hoffmann,Alexander Schlaefer
Main category: cs.CV
TL;DR: 论文提出了一种基于深度学习的自动化管道,用于估计内窥镜鼻窦手术中的解剖风险评分,通过热图回归定位关键解剖标志点。
Details
Motivation: 内窥镜鼻窦手术需要术前评估颅底解剖结构以降低风险(如脑脊液漏),而现有的解剖风险评分(如Keros、Gera和TMS评分)依赖耗时的手动测量。Contribution: 论文的主要贡献是开发了一种自动化深度学习方法,能够精确估计Keros、Gera和TMS评分,显著减少了人工测量的时间成本。
Method: 研究方法包括通过热图回归定位解剖标志点,并比较了直接方法和全局到局部的学习策略的准确性。
Result: 结果显示,相关解剖测量的平均绝对误差分别为:Keros评分0.506mm,Gera评分4.516°,TMS分类0.802mm/0.777mm。
Insight: 研究表明,深度学习方法可以高效、准确地替代传统手动测量,为临床决策提供可靠支持。
Abstract: Endoscopic sinus surgery requires careful preoperative assessment of the skull base anatomy to minimize risks such as cerebrospinal fluid leakage. Anatomical risk scores like the Keros, Gera and Thailand-Malaysia-Singapore score offer a standardized approach but require time-consuming manual measurements on coronal CT or CBCT scans. We propose an automated deep learning pipeline that estimates these risk scores by localizing key anatomical landmarks via heatmap regression. We compare a direct approach to a specialized global-to-local learning strategy and find mean absolute errors on the relevant anatomical measurements of 0.506mm for the Keros, 4.516° for the Gera and 0.802mm / 0.777mm for the TMS classification.
[199] Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization
Binyan Xu,Fan Yang,Di Tang,Xilin Dai,Kehuan Zhang
Main category: cs.CV
TL;DR: 论文提出了一种新的干净图像后门攻击方法GCB,通过优化触发器降低攻击对模型干净精度的负面影响,提升了隐蔽性。
Details
Motivation: 现有干净图像后门攻击方法的高毒化率会导致明显的干净精度下降,损害隐蔽性,影响了其在安全关键应用中的潜在威胁。Contribution: 提出GCB框架,利用条件InfoGAN识别自然图像特征作为触发器,实现了高效果和高隐蔽性的攻击。
Method: 采用条件InfoGAN优化触发器,利用少量毒化样本训练模型,同时显著降低对干净精度的影响。
Result: GCB在六个数据集、五种架构和四种任务中验证了有效性,干净精度下降不到1%,且对现有防御方法具有鲁棒性。
Insight: 通过生成对抗网络优化触发器是一种有效的后门攻击方法,能够平衡攻击效果和隐蔽性。
Abstract: Clean-image backdoor attacks, which use only label manipulation in training datasets to compromise deep neural networks, pose a significant threat to security-critical applications. A critical flaw in existing methods is that the poison rate required for a successful attack induces a proportional, and thus noticeable, drop in Clean Accuracy (CA), undermining their stealthiness. This paper presents a new paradigm for clean-image attacks that minimizes this accuracy degradation by optimizing the trigger itself. We introduce Generative Clean-Image Backdoors (GCB), a framework that uses a conditional InfoGAN to identify naturally occurring image features that can serve as potent and stealthy triggers. By ensuring these triggers are easily separable from benign task-related features, GCB enables a victim model to learn the backdoor from an extremely small set of poisoned examples, resulting in a CA drop of less than 1%. Our experiments demonstrate GCB’s remarkable versatility, successfully adapting to six datasets, five architectures, and four tasks, including the first demonstration of clean-image backdoors in regression and segmentation. GCB also exhibits resilience against most of the existing backdoor defenses.
[200] Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images
JiaKui Hu,Shanshan Zhao,Qing-Guo Chen,Xuerui Qiu,Jialun Liu,Zhao Xu,Weihua Luo,Kaifu Zhang,Yanye Lu
Main category: cs.CV
TL;DR: 这篇论文提出了Omni-View模型,通过联合建模3D场景理解、新视角合成和几何估计任务,探索了“生成促进理解”的原则,并在性能和多功能性上超越现有专用模型。
Details
Motivation: 论文旨在探索如何通过生成任务(如新视角合成和3D场景生成)辅助3D场景理解,从而实现多模态的统一。Contribution: 主要贡献是设计了Omni-View模型,它通过联合建模理解与生成任务,实现了3D场景的多模态统一理解和生成。
Method: Omni-View由理解模型、纹理模块和几何模块组成,采用两阶段训练策略。纹理模块负责外观合成,几何模块提供显式几何约束。
Result: 在VSI-Bench基准上达到了55.4分的SOTA成绩,同时在理解和生成任务中均表现出色。
Insight: 生成任务(如新视角合成)可以增强模型的3D场景理解能力,表明多任务联合建模的潜力。
Abstract: This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that “generation facilitates understanding”. Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model’s holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.
[201] Noise & pattern: identity-anchored Tikhonov regularization for robust structural anomaly detection
Alexander Bauer,Klaus-Robert Müller
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于自监督自动编码器的方法,用于工业图像中的结构性异常检测。通过引入结构化扰动和高斯噪声作为正则化,显著提升了检测和分割性能,在MVTec AD基准上取得了最佳结果。
Details
Motivation: 工业检测中收集所有可能的异常样本是不现实的。传统方法在处理结构性缺陷时效果有限,论文提出了一种新的自监督学习框架来解决这一问题。Contribution: 1. 提出了一种结构化扰动模型,模拟结构性缺陷;2. 在扰动之上添加高斯噪声作为Tikhonov正则化,稳定重建并提升检测精度;3. 在MVTec AD上实现了最优性能。
Method: 使用自监督自动编码器,通过结构化扰动和高斯噪声的Tikhonov正则化训练模型,任务结合了分割和修复。
Result: 在MVTec AD上取得了I/P-AUROC:99.9/99.4的最高性能。
Insight: 高斯噪声作为正则化可以稳定模型性能,结构化扰动模拟真实缺陷提升了模型的泛化能力。
Abstract: Anomaly detection plays a pivotal role in automated industrial inspection, aiming to identify subtle or rare defects in otherwise uniform visual patterns. As collecting representative examples of all possible anomalies is infeasible, we tackle structural anomaly detection using a self-supervised autoencoder that learns to repair corrupted inputs. To this end, we introduce a corruption model that injects artificial disruptions into training images to mimic structural defects. While reminiscent of denoising autoencoders, our approach differs in two key aspects. First, instead of unstructured i.i.d.\ noise, we apply structured, spatially coherent perturbations that make the task a hybrid of segmentation and inpainting. Second, and counterintuitively, we add and preserve Gaussian noise on top of the occlusions, which acts as a Tikhonov regularizer anchoring the Jacobian of the reconstruction function toward identity. This identity-anchored regularization stabilizes reconstruction and further improves both detection and segmentation accuracy. On the MVTec AD benchmark, our method achieves state-of-the-art results (I/P-AUROC: 99.9/99.4), supporting our theoretical framework and demonstrating its practical relevance for automatic inspection.
[202] Leveraging Text-Driven Semantic Variation for Robust OOD Segmentation
Seungheon Song,Jaekoo Lee
Main category: cs.CV
TL;DR: 该论文提出了一种基于文本驱动的OOD分割方法,通过结合视觉-语言模型和距离提示,提高了自动驾驶场景中异常物体的分割效果。
Details
Motivation: 自动驾驶中异常物体分割对安全性至关重要,但现有方法很少利用视觉-语言空间的丰富语义信息。作者认为结合语言线索可以提升复杂场景的分割性能。Contribution: 1.提出了一种文本驱动的OOD分割方法,利用语义多样的视觉-语言空间;2.设计了距离提示和OOD语义增强策略;3.在多个公开数据集上取得了SOTA性能。
Method: 结合视觉-语言模型的编码器和Transformer解码器,通过距离提示和OOD语义增强策略对齐视觉与文本信息。
Result: 在Fishyscapes等数据集上,该方法在像素级和物体级评估中均达到最优性能。
Insight: 视觉-语言模型为OOD分割提供了丰富的语义支持,距离提示和语义增强策略能有效提升对未知物体的泛化能力。
Abstract: In autonomous driving and robotics, ensuring road safety and reliable decision-making critically depends on out-of-distribution (OOD) segmentation. While numerous methods have been proposed to detect anomalous objects on the road, leveraging the vision-language space-which provides rich linguistic knowledge-remains an underexplored field. We hypothesize that incorporating these linguistic cues can be especially beneficial in the complex contexts found in real-world autonomous driving scenarios. To this end, we present a novel approach that trains a Text-Driven OOD Segmentation model to learn a semantically diverse set of objects in the vision-language space. Concretely, our approach combines a vision-language model’s encoder with a transformer decoder, employs Distance-Based OOD prompts located at varying semantic distances from in-distribution (ID) classes, and utilizes OOD Semantic Augmentation for OOD representations. By aligning visual and textual information, our approach effectively generalizes to unseen objects and provides robust OOD segmentation in diverse driving environments. We conduct extensive experiments on publicly available OOD segmentation datasets such as Fishyscapes, Segment-Me-If-You-Can, and Road Anomaly datasets, demonstrating that our approach achieves state-of-the-art performance across both pixel-level and object-level evaluations. This result underscores the potential of vision-language-based OOD segmentation to bolster the safety and reliability of future autonomous driving systems.
[203] 4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation
Mengmeng Liu,Jiuming Liu,Yunpeng Zhang,Jiangtao Li,Michael Ying Yang,Francesco Nex,Hao Cheng
Main category: cs.CV
TL;DR: 4DSTR提出了一种新颖的动态4D内容生成方法,通过时空校正技术提升生成质量与一致性,解决了现有方法在时空一致性和快速时间变化适应性上的不足。
Details
Motivation: 现有的4D生成方法在时空一致性和快速时间变化适应性上表现不佳,缺乏高效的时空建模。因此,本文旨在解决这些问题。Contribution: 提出了4DSTR框架,通过时空校正技术调节生成4D高斯分布,实现了高质量的时空一致动态内容生成。
Method: 设计了跨时间序列的时空相关性校正机制,并提出自适应空间密度调整策略,动态增减高斯点以适应时间变化。
Result: 实验表明,4DSTR在视频到4D生成任务中表现优异,质量、一致性和适应性均达到最优水平。
Insight: 时空校正和动态密度调整是提升4D生成质量的关键技术。
Abstract: Remarkable advances in recent 2D image and 3D shape generation have induced a significant focus on dynamic 4D content generation. However, previous 4D generation methods commonly struggle to maintain spatial-temporal consistency and adapt poorly to rapid temporal variations, due to the lack of effective spatial-temporal modeling. To address these problems, we propose a novel 4D generation network called 4DSTR, which modulates generative 4D Gaussian Splatting with spatial-temporal rectification. Specifically, temporal correlation across generated 4D sequences is designed to rectify deformable scales and rotations and guarantee temporal consistency. Furthermore, an adaptive spatial densification and pruning strategy is proposed to address significant temporal variations by dynamically adding or deleting Gaussian points with the awareness of their pre-frame movements. Extensive experiments demonstrate that our 4DSTR achieves state-of-the-art performance in video-to-4D generation, excelling in reconstruction quality, spatial-temporal consistency, and adaptation to rapid temporal movements.
[204] MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
Tianhao Peng,Haochen Wang,Yuanxing Zhang,Zekun Wang,Zili Wang,Ge Zhang,Jian Yang,Shihao Li,Yanghai Wang,Xintao Wang,Houyi Li,Wei Ji,Pengfei Wan,Wenhao Huang,Zhaoxiang Zhang,Jiaheng Liu
Main category: cs.CV
TL;DR: MVU-Eval是首个针对多模态大语言模型(MLLMs)的多视频理解评估基准,填补了现有评估仅限于单视频理解的空白,涵盖了1842个问题-答案对和4959个视频,评估了8项核心能力。
Details
Motivation: 现有评估基准局限于单视频理解,而实际应用中(如体育分析和自动驾驶)需要多视频理解能力。MVU-Eval旨在解决这一重要的评估缺口。Contribution: MVU-Eval是首个全面的多视频理解评估基准,覆盖多样领域和应用场景,并揭示了现有MLLMs在多视频理解上的显著性能差距。
Method: MVU-Eval通过1842个问题-答案对和4959个视频,评估MLLMs的8项核心能力,包括基础感知任务和高阶推理任务。
Result: 对开源和闭源MLLMs的评估显示,当前模型在多视频理解能力上存在显著局限性。
Insight: 多视频理解是MLLMs未来发展的关键方向,MVU-Eval的开源将推动相关研究。
Abstract: The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs’ ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.
[205] StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
Yilong Chen,Xiang Bai,Zhibin Wang,Chengyu Bai,Yuhan Dai,Ming Lu,Shanghang Zhang
Main category: cs.CV
TL;DR: StreamKV 是一个无需训练的视频问答框架,通过动态分割视频为语义片段并优化 KV 缓存的检索与压缩,显著提升了长视频问答的效率和准确性。
Details
Motivation: 当前视频大语言模型在处理长视频时面临效率低和信息丢失的问题,亟需一种高效的 KV 缓存检索与压缩方法。Contribution: 提出 StreamKV,动态分割视频为语义片段,并通过总结向量和引导提示优化 KV 缓存的检索与压缩,显著提升了性能和效率。
Method: 动态语义分割视频流,计算每个片段的总结向量用于检索,并通过引导提示压缩 KV 缓存,层自适应统一检索与压缩。
Result: 在 StreamingVQA 基准测试中,StreamKV 显著优于现有在线方法,提升了准确率、内存效率和计算延迟。
Insight: 动态语义分割和层自适应的 KV 缓存处理是提升长视频问答性能的关键。
Abstract: Video Large Language Models (Video-LLMs) have demonstrated significant potential in the areas of video captioning, search, and summarization. However, current Video-LLMs still face challenges with long real-world videos. Recent methods have introduced a retrieval mechanism that retrieves query-relevant KV caches for question answering, enhancing the efficiency and accuracy of long real-world videos. However, the compression and retrieval of KV caches are still not fully explored. In this paper, we propose \textbf{StreamKV}, a training-free framework that seamlessly equips Video-LLMs with advanced KV cache retrieval and compression. Compared to previous methods that used uniform partitioning, StreamKV dynamically partitions video streams into semantic segments, which better preserves semantic information. For KV cache retrieval, StreamKV calculates a summary vector for each segment to retain segment-level information essential for retrieval. For KV cache compression, StreamKV introduces a guidance prompt designed to capture the key semantic elements within each segment, ensuring only the most informative KV caches are retained for answering questions. Moreover, StreamKV unifies KV cache retrieval and compression within a single module, performing both in a layer-adaptive manner, thereby further improving the effectiveness of streaming video question answering. Extensive experiments on public StreamingVQA benchmarks demonstrate that StreamKV significantly outperforms existing Online Video-LLMs, achieving superior accuracy while substantially improving both memory efficiency and computational latency. The code has been released at https://github.com/sou1p0wer/StreamKV.
[206] Glioma C6: A Novel Dataset for Training and Benchmarking Cell Segmentation
Roman Malashin,Svetlana Pashkevich,Daniil Ilyukhin,Arseniy Volkov,Valeria Yachnaya,Andrey Denisov,Maria Mikhalkova
Main category: cs.CV
TL;DR: Glioma C6是一个新的开放数据集,专注于胶质瘤C6细胞的实例分割,包含75张高分辨率相位对比显微镜图像和超过12,000个标注细胞,提供生物医学图像分析的测试平台。数据集分为两部分,分别用于基准测试和泛化测试。
Details
Motivation: 现有细胞分割数据集在多样性和标注质量上存在不足,尤其在癌症细胞研究领域。Glioma C6旨在填补这一空白,为深度学习模型提供一个高质量的训练和基准测试资源。Contribution: 1.提出Glioma C6数据集,包含高质量的细胞标注和形态学分类;2.评估多个通用分割模型在该数据集上的表现,显示其局限性;3.表明在Glioma C6上训练可以显著提升分割性能。
Method: 使用相位对比显微镜采集图像,并由生物学家进行细胞注释和形态学分类。数据集分为两部分:一部分用于基准测试(参数固定),另一部分用于泛化测试(条件变化)。训练多个通用分割模型并对比性能。
Result: 实验表明在Glioma C6上训练的模型在分割任务中表现显著优于未经训练的模型。现有通用分割模型在该数据集上的表现存在局限性。
Insight: Glioma C6数据集为开发更强大的细胞分割模型提供了重要资源,尤其在癌症研究领域。形态学分类的引入有望进一步推动图像数据的利用。
Abstract: We present Glioma C6, a new open dataset for instance segmentation of glioma C6 cells, designed as both a benchmark and a training resource for deep learning models. The dataset comprises 75 high-resolution phase-contrast microscopy images with over 12,000 annotated cells, providing a realistic testbed for biomedical image analysis. It includes soma annotations and morphological cell categorization provided by biologists. Additional categorization of cells, based on morphology, aims to enhance the utilization of image data for cancer cell research. Glioma C6 consists of two parts: the first is curated with controlled parameters for benchmarking, while the second supports generalization testing under varying conditions. We evaluate the performance of several generalist segmentation models, highlighting their limitations on our dataset. Our experiments demonstrate that training on Glioma C6 significantly enhances segmentation performance, reinforcing its value for developing robust and generalizable models. The dataset is publicly available for researchers.
[207] VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
Ying Cheng,Yu-Ho Lin,Min-Hung Chen,Fu-En Yang,Shang-Hong Lai
Main category: cs.CV
TL;DR: VADER是一个基于大语言模型(LLM)的视频异常理解框架,通过结合关键帧对象关系特征与视觉线索,增强对视频中异常行为的因果理解。
Details
Motivation: 传统视频异常检测方法仅关注异常的检测和定位,忽略了对象间的深层因果关系和交互,而这对理解异常行为至关重要。Contribution: 提出了VADER框架,通过Anomaly Scorer、CAES策略、Relation Feature Extractor和CORE模块联合建模动态对象交互,生成紧凑的关系表示,并与LLM结合,实现详细的因果描述和问答。
Method: 1. 使用Anomaly Scorer对帧进行异常评分;2. 应用CAES策略捕获异常事件的因果上下文;3. 通过Relation Feature Extractor和CORE建模对象关系;4. 结合LLM生成描述和问答。
Result: 在多个真实VAU基准测试中,VADER在异常描述、解释和因果推理任务中表现优异。
Insight: 对象间的关系建模和因果上下文捕获对理解视频异常行为至关重要,LLM的整合进一步提升了语义理解和解释能力。
Abstract: Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.
[208] Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection
Huizai Yao,Sicheng Zhao,Pengteng Li,Yi Cui,Shuo Lu,Weiyu Guo,Yunfan Lu,Yijie Xu,Hui Xiong
Main category: cs.CV
TL;DR: 论文提出了一种利用视觉基础模型(VFMs)作为外部知识源的源自由目标检测(SFOD)框架,设计了三个模块增强特征对齐和标签质量,显著提升了检测性能。
Details
Motivation: 现有的SFOD方法主要依赖源模型的内部知识,导致泛化能力有限和伪标签偏置问题,而VFMs具有强大的感知能力和广泛泛化性,但其潜力在SFOD中尚未被充分挖掘。Contribution: 提出了首个将VFMs引入SFOD的框架,设计了三个模块(PGFA、PIFA、DEPF)分别提升全局特征对齐、实例特征对齐和伪标签质量,显著改进了模型的传输性和判别性。
Method: 通过Patch-weighted Global Feature Alignment(PGFA)、Prototype-based Instance Feature Alignment(PIFA)和Dual-source Enhanced Pseudo-label Fusion(DEPF)三个模块,结合VFMs的特征优势和伪标签融合策略。
Result: 在六个基准测试上实现了SOTA性能,验证了VFMs在提升SFOD任务中传输性和判别性的有效性。
Insight: VFMs作为外部知识源可以有效缓解SFOD中的域适应问题,同时其大规模预训练特征为跨域检测提供了丰富的先验信息。
Abstract: Source-Free Object Detection (SFOD) aims to adapt a source-pretrained object detector to a target domain without access to source data. However, existing SFOD methods predominantly rely on internal knowledge from the source model, which limits their capacity to generalize across domains and often results in biased pseudo-labels, thereby hindering both transferability and discriminability. In contrast, Vision Foundation Models (VFMs), pretrained on massive and diverse data, exhibit strong perception capabilities and broad generalization, yet their potential remains largely untapped in the SFOD setting. In this paper, we propose a novel SFOD framework that leverages VFMs as external knowledge sources to jointly enhance feature alignment and label quality. Specifically, we design three VFM-based modules: (1) Patch-weighted Global Feature Alignment (PGFA) distills global features from VFMs using patch-similarity-based weighting to enhance global feature transferability; (2) Prototype-based Instance Feature Alignment (PIFA) performs instance-level contrastive learning guided by momentum-updated VFM prototypes; and (3) Dual-source Enhanced Pseudo-label Fusion (DEPF) fuses predictions from detection VFMs and teacher models via an entropy-aware strategy to yield more reliable supervision. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art SFOD performance, validating the effectiveness of integrating VFMs to simultaneously improve transferability and discriminability.
[209] Garbage Vulnerable Point Monitoring using IoT and Computer Vision
R. Kumar,A. Lall,S. Chaudhari,M. Kale,A. Vattem
Main category: cs.CV
TL;DR: 该论文提出了一种结合物联网和计算机视觉的智能系统,用于监测城市垃圾易倾倒点的非法倾倒行为,并通过实验验证了YOLO11m模型的高效性。
Details
Motivation: 城市固体垃圾管理效率低下,非法倾倒行为严重。为解决这一问题,作者开发了一种结合IoT和CV的系统,实时监测垃圾易倾倒点。Contribution: 1) 提出了一种基于IoT和CV的垃圾监测系统;2) 收集了印度Sangareddy地区的数据集;3) 评估了多种目标检测模型的性能,发现YOLO11m精度最高。
Method: 系统使用街景摄像头和目标检测算法(如YOLOv8、YOLOv10、YOLO11m和RT-DETR)实时检测垃圾倾倒行为,并通过数据集评估模型性能。
Result: YOLO11m在垃圾检测中达到92.39%的准确率和0.91的mAP@50,表现出最优性能。系统还能捕捉倾倒的时间模式。
Insight: 结合IoT和CV的智能系统能有效提升垃圾管理效率,YOLO11m在复杂场景下的高精度表明其在环境监测领域的潜力。
Abstract: This paper proposes a smart way to manage municipal solid waste by using the Internet of Things (IoT) and computer vision (CV) to monitor illegal waste dumping at garbage vulnerable points (GVPs) in urban areas. The system can quickly detect and monitor dumped waste using a street-level camera and object detection algorithm. Data was collected from the Sangareddy district in Telangana, India. A series of comprehensive experiments was carried out using the proposed dataset to assess the accuracy and overall performance of various object detection models. Specifically, we performed an in-depth evaluation of YOLOv8, YOLOv10, YOLO11m, and RT-DETR on our dataset. Among these models, YOLO11m achieved the highest accuracy of 92.39% in waste detection, demonstrating its effectiveness in detecting waste. Additionally, it attains an mAP@50 of 0.91, highlighting its high precision. These findings confirm that the object detection model is well-suited for monitoring and tracking waste dumping events at GVP locations. Furthermore, the system effectively captures waste disposal patterns, including hourly, daily, and weekly dumping trends, ensuring comprehensive daily and nightly monitoring.
[210] Inference-Time Scaling of Diffusion Models for Infrared Data Generation
Kai A. Horstmann,Maxim Clouser,Kia Khezeli
Main category: cs.CV
TL;DR: 论文提出了一种在推理阶段通过域适应的CLIP验证器提升红外图像生成质量的方法,优化了扩散模型在红外数据稀缺条件下的表现。
Details
Motivation: 红外图像在低能见度条件下有助于场景理解,但高质量标记数据的稀缺阻碍了下游视觉模型的开发。传统方法因数据集有限难以训练基础级扩散模型。Contribution: 提出了一种推理阶段的缩放方法,通过域适应的CLIP验证器指导扩散模型的生成过程,显著提升了红外图像的生成质量。
Method: 在少数红外图像样本上微调FLUX.1-dev扩散模型,并使用CLIP验证器在推理阶段引导生成过程,对齐输入文本提示。
Result: 实验表明,该方法在KAIST数据集上减少了10%的FID分数,生成的红外图像质量更高。
Insight: 推理阶段引导技术为低数据条件下的领域差异提供了一种有效解决方案。
Abstract: Infrared imagery enables temperature-based scene understanding using passive sensors, particularly under conditions of low visibility where traditional RGB imaging fails. Yet, developing downstream vision models for infrared applications is hindered by the scarcity of high-quality annotated data, due to the specialized expertise required for infrared annotation. While synthetic infrared image generation has the potential to accelerate model development by providing large-scale, diverse training data, training foundation-level generative diffusion models in the infrared domain has remained elusive due to limited datasets. In light of such data constraints, we explore an inference-time scaling approach using a domain-adapted CLIP-based verifier for enhanced infrared image generation quality. We adapt FLUX.1-dev, a state-of-the-art text-to-image diffusion model, to the infrared domain by finetuning it on a small sample of infrared images using parameter-efficient techniques. The trained verifier is then employed during inference to guide the diffusion sampling process toward higher quality infrared generations that better align with input text prompts. Empirically, we find that our approach leads to consistent improvements in generation quality, reducing FID scores on the KAIST Multispectral Pedestrian Detection Benchmark dataset by 10% compared to unguided baseline samples. Our results suggest that inference-time guidance offers a promising direction for bridging the domain gap in low-data infrared settings.
[211] StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation
Tianrui Feng,Zhi Li,Shuo Yang,Haocheng Xi,Muyang Li,Xiuyu Li,Lvmin Zhang,Keting Yang,Kelly Peng,Song Han,Maneesh Agrawala,Kurt Keutzer,Akio Kodaira,Chenfeng Xu
Main category: cs.CV
TL;DR: StreamDiffusionV2提出了一种面向实时视频生成的流式系统,解决了传统图像扩散模型在时间一致性和低延迟实时交互中的局限性。
Details
Motivation: 现有视频扩散模型主要针对离线生成优化,而实时流媒体需满足严格的服务级别目标(SLOs),如最小化首帧时间、保证每帧低延迟低抖动。Contribution: 1. SLO感知的批处理和块调度器;2. 基于sink-token的滚动KV缓存和运动感知噪声控制器;3. 可扩展的管道编排,实现近乎线性的FPS扩展。
Method: 结合多种系统级优化(如调度器、缓存机制)和并行化技术(跨去噪步骤和网络层),支持异构GPU环境。
Result: 无需TensorRT或量化,首帧延迟小于0.5秒,14B参数模型达58.28 FPS,1.3B模型达64.52 FPS(4块H100 GPU)。
Insight: 通过系统级优化和并行化设计,StreamDiffusionV2为生成式实时流媒体提供了高效解决方案,适用于从个人创作者到企业平台的场景。
Abstract: Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Previous image-based streaming diffusion models have powered efficient and creative live streaming products but have hit limits on temporal consistency due to the foundation of image-based designs. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline generation systems primarily optimize throughput by batching large workloads. In contrast, live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter. Besides, scalable multi-GPU serving for real-time streams remains largely unresolved so far. To address this, we present StreamDiffusionV2, a training-free pipeline for interactive live streaming with video diffusion models. StreamDiffusionV2 integrates an SLO-aware batching scheduler and a block scheduler, together with a sink-token–guided rolling KV cache, a motion-aware noise controller, and other system-level optimizations. Moreover, we introduce a scalable pipeline orchestration that parallelizes the diffusion process across denoising steps and network layers, achieving near-linear FPS scaling without violating latency guarantees. The system scales seamlessly across heterogeneous GPU environments and supports flexible denoising steps (e.g., 1–4), enabling both ultra-low-latency and higher-quality modes. Without TensorRT or quantization, StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs, making state-of-the-art generative live streaming practical and accessible–from individual creators to enterprise-scale platforms.
[212] DIMO: Diverse 3D Motion Generation for Arbitrary Objects
Linzhan Mou,Jiahui Lei,Chen Wang,Lingjie Liu,Kostas Daniilidis
Main category: cs.CV
TL;DR: DIMO提出了一种生成方法,能够从单张图像生成任意物体的多样3D运动。其核心思想是利用训练好的视频模型的丰富先验知识提取常见运动模式,并将其嵌入共享的低维潜空间。
Details
Motivation: 现有的生成方法通常专注于特定对象或简单运动,缺乏对任意物体的多样化3D运动生成能力。DIMO旨在填补这一空白。Contribution: 1. 提出了DIMO,一种从单张图像生成任意物体多样3D运动的生成方法。2. 设计了一个共享的低维潜空间和运动解码器,用于建模多样运动的分布。3. 支持3D运动插值和语言引导的运动生成等应用。
Method: 1. 生成多样性视频并嵌入潜向量。2. 训练共享运动解码器学习运动分布(以神经关键点轨迹表示)。3. 使用规范3D高斯模型驱动关键点并建模几何与外观。
Result: DIMO能够在单次前向传播中生成多样3D运动,并在3D运动插值和语言引导生成等任务中表现优异。
Insight: 通过共享潜空间和运动解码器,DIMO实现了对任意物体多样化运动的有效建模,展示了生成模型在3D运动合成中的潜力。
Abstract: We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation. Our project page is available at https://linzhanm.github.io/dimo.
[213] TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research
Han Zhang,Yiqing Shen,Roger D. Soberanis-Mukul,Ankita Ghosh,Hao Ding,Lalithkumar Seenivasan,Jose L. Porras,Zhekai Mao,Chenjia Li,Wenjie Xiao,Lonny Yarmus,Angela Christine Argento,Masaru Ishii,Mathias Unberath
Main category: cs.CV
TL;DR: TwinOR是一个框架,用于构建高保真、动态的手术室数字孪生环境,支持具身AI的研究和测试。它通过多视角感知重建静态几何和动态行为,提供可控的仿真环境。实验结果验证了其传感器级真实性和实用性。
Details
Motivation: 现实中的手术室因安全和操作限制难以支持具身AI的自由感知和交互。数字孪生提供了一种安全可控的环境,但如何构建高保真且动态的数字孪生仍是一个挑战。Contribution: 提出了TwinOR框架,能够重建手术室的静态几何结构和动态行为,并将其融合为沉浸式3D环境,支持具身AI的研究和测试。
Method: 1. 通过预扫描视频重建静态几何;2. 通过多视角感知动态建模人和设备的运动;3. 将静态和动态组件融合为可控仿真环境。
Result: 在几何理解和视觉定位任务中,TwinOR合成的数据使模型(如FoundationStereo和ORB-SLAM3)达到了与其在真实数据上相近的性能。
Insight: TwinOR通过真实到仿真的流程构建了动态、高保真的手术室数字孪生,为具身AI的安全开发和评估提供了高效且真实的环境。
Abstract: Developing embodied AI for intelligent surgical systems requires safe, controllable environments for continual learning and evaluation. However, safety regulations and operational constraints in operating rooms (ORs) limit embodied agents from freely perceiving and interacting in realistic settings. Digital twins provide high-fidelity, risk-free environments for exploration and training. How we may create photorealistic and dynamic digital representations of ORs that capture relevant spatial, visual, and behavioral complexity remains unclear. We introduce TwinOR, a framework for constructing photorealistic, dynamic digital twins of ORs for embodied AI research. The system reconstructs static geometry from pre-scan videos and continuously models human and equipment motion through multi-view perception of OR activities. The static and dynamic components are fused into an immersive 3D environment that supports controllable simulation and embodied exploration. The proposed framework reconstructs complete OR geometry with centimeter level accuracy while preserving dynamic interaction across surgical workflows, enabling realistic renderings and a virtual playground for embodied AI systems. In our experiments, TwinOR simulates stereo and monocular sensor streams for geometry understanding and visual localization tasks. Models such as FoundationStereo and ORB-SLAM3 on TwinOR-synthesized data achieve performance within their reported accuracy on real indoor datasets, demonstrating that TwinOR provides sensor-level realism sufficient for perception and localization challenges. By establishing a real-to-sim pipeline for constructing dynamic, photorealistic digital twins of OR environments, TwinOR enables the safe, scalable, and data-efficient development and benchmarking of embodied AI, ultimately accelerating the deployment of embodied AI from sim-to-real.
cs.NI [Back]
[214] Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents
Hanlin Cai,Houtianfu Wang,Haofan Dong,Kai Li,Ozgur B. Akan
Main category: cs.NI
TL;DR: 这篇论文提出了一种基于图表征的模型投毒攻击(GRMP),针对联邦学习(FL)支持的异构智能体互联网(IoA),通过构建参数相关性图并利用对抗变分图自编码器生成恶意模型,成功逃逸现有防御机制。
Details
Motivation: 在异构智能体互联网(IoA)中,联邦学习(FL)是实现分布式智能体协作的关键技术,但其易受模型投毒攻击。现有防御机制在大规模和异构数据下表现脆弱,亟需研究新型攻击方法以揭示系统漏洞。Contribution: 提出了GRMP攻击方法,首次将图表征技术应用于模型投毒攻击,并通过变分图自编码器捕获高阶依赖关系,生成难以检测的恶意模型。
Method: 1. 构建参数相关性图;2. 利用对抗变分图自编码器(VGAE)重塑高阶依赖关系;3. 生成兼具良性统计特性和对抗目标的恶意模型。
Result: 实验表明,GRMP攻击能逐步降低系统准确性,且现有防御机制无法有效检测该攻击,突显了对IoA范式的严重威胁。
Insight: 图表征技术为模型投毒攻击提供了新思路,暴露了现有FL防御机制在高阶依赖和异构数据下的局限性。
Abstract: Internet of Agents (IoA) envisions a unified, agent-centric paradigm where heterogeneous large language model (LLM) agents can interconnect and collaborate at scale. Within this paradigm, federated learning (FL) serves as a key enabler that allows distributed LLM agents to co-train global models without centralizing data. However, the FL-enabled IoA system remains vulnerable to model poisoning attacks, and the prevailing distance and similarity-based defenses become fragile at billion-parameter scale and under heterogeneous data distributions. This paper proposes a graph representation-based model poisoning (GRMP) attack, which passively exploits observed benign local models to construct a parameter correlation graph and extends an adversarial variational graph autoencoder to capture and reshape higher-order dependencies. The GRMP attack synthesizes malicious local models that preserve benign-like statistics while embedding adversarial objectives, remaining elusive to detection at the server. Experiments demonstrate a gradual drop in system accuracy under the proposed attack and the ineffectiveness of the prevailing defense mechanism in detecting the attack, underscoring a severe threat to the ambitious IoA paradigm.
cs.AI [Back]
[215] DiagnoLLM: A Hybrid Bayesian Neural Language Framework for Interpretable Disease Diagnosis
Bowen Xu,Xinyue Zeng,Jiazhen Hu,Tuo Wang,Adithya Kulkarni
Main category: cs.AI
TL;DR: DiagnoLLM是一个结合贝叶斯解卷积、eQTL引导的深度学习和LLM叙事的混合框架,用于可解释的疾病诊断。它在阿尔茨海默病检测中达到88.0%的准确率,并通过LLM生成针对医生和患者的个性化报告。
Details
Motivation: 构建可信赖的临床AI系统不仅需要高准确性,还需透明且基于生物学的解释。Contribution: 提出DiagnoLLM框架,整合贝叶斯解卷积、eQTL引导的深度学习和LLM叙事生成,实现高准确性和可解释性。
Method: 1. GP-unmix:基于高斯过程的层次模型,推断细胞类型特异性基因表达;2. 结合eQTL先验的神经网络分类器;3. LLM后处理模块生成个性化报告。
Result: 阿尔茨海默病检测准确率达88.0%,生成的报告被评估为准确且可操作。
Insight: LLM作为后验推理器,而非端到端预测器,可在混合诊断流程中有效支持沟通。
Abstract: Building trustworthy clinical AI systems requires not only accurate predictions but also transparent, biologically grounded explanations. We present \texttt{DiagnoLLM}, a hybrid framework that integrates Bayesian deconvolution, eQTL-guided deep learning, and LLM-based narrative generation for interpretable disease diagnosis. DiagnoLLM begins with GP-unmix, a Gaussian Process-based hierarchical model that infers cell-type-specific gene expression profiles from bulk and single-cell RNA-seq data while modeling biological uncertainty. These features, combined with regulatory priors from eQTL analysis, power a neural classifier that achieves high predictive performance in Alzheimer’s Disease (AD) detection (88.0% accuracy). To support human understanding and trust, we introduce an LLM-based reasoning module that translates model outputs into audience-specific diagnostic reports, grounded in clinical features, attribution signals, and domain knowledge. Human evaluations confirm that these reports are accurate, actionable, and appropriately tailored for both physicians and patients. Our findings show that LLMs, when deployed as post-hoc reasoners rather than end-to-end predictors, can serve as effective communicators within hybrid diagnostic pipelines.
[216] ScRPO: From Errors to Insights
Lianrui Li,Dakuan Lu,Jiawei Shao,Chi Zhang,Xuelong Li
Main category: cs.AI
TL;DR: ScRPO是一种新型强化学习框架,通过自我反思和错误校正提升大语言模型在数学问题上的表现。实验表明其在多个数学推理基准上优于其他后训练方法。
Details
Motivation: 大语言模型在复杂数学问题上表现不佳,亟需一种能够自我改进的方法,仅依赖有限的外部反馈即可提升性能。Contribution: 1. 提出了ScRPO框架,结合试错学习和自我校正学习;2. 在多个数学推理基准上验证了其有效性。
Method: 1. 试错学习阶段:使用GRPO训练模型并收集错误答案;2. 自我校正学习阶段:引导模型反思错误原因。
Result: ScRPO在AIME、AMC、GSM8k等多个基准上显著优于其他后训练方法。
Insight: 自我反思和错误校正是提升语言模型性能的有效途径,尤其在外部反馈有限的场景下。
Abstract: We propose Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to enhance large language models on challenging mathematical problems by leveraging self-reflection and error correction. Our approach consists of two stages: (1) Trial-and-error learning stage: training the model with GRPO and collecting incorrect answers along with their corresponding questions in an error pool; (2) Self-correction learning stage: guiding the model to reflect on why its previous answers were wrong. Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH-500, GSM8k, using Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B. The experimental results demonstrate that ScRPO consistently outperforms several post-training methods. These findings highlight ScRPO as a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way toward more reliable and capable AI systems.
[217] Evaluating Implicit Biases in LLM Reasoning through Logic Grid Puzzles
Fatima Jahara,Mark Dredze,Sharon Levy
Main category: cs.AI
TL;DR: 论文提出PRIME框架,通过逻辑网格谜题评估语言模型在复杂推理任务中的隐式社会偏见,发现模型在答案符合刻板印象时表现更准确。
Details
Motivation: 现有安全防护措施能抑制明显偏见输出,但复杂逻辑推理任务中的隐式偏见缺乏评估基准。PRIME旨在填补这一空白。Contribution: 1. 提出PRIME框架,利用逻辑网格谜题系统评估LLM中的隐式社会偏见;2. 支持自动生成与验证,并可调整复杂度与偏见设置。
Method: 通过生成刻板印象、反刻板印象和中性的逻辑谜题变体,控制变量比较模型表现,并测试基于提示的缓解策略。
Result: 模型在答案符合性别刻板印象时推理更准确,突显了PRIME在诊断LLM演绎推理中社会偏见的重要性。
Insight: 逻辑推理任务中存在隐式偏见,当前评估需扩展至复杂任务以更全面衡量公平性。
Abstract: While recent safety guardrails effectively suppress overtly biased outputs, subtler forms of social bias emerge during complex logical reasoning tasks that evade current evaluation benchmarks. To fill this gap, we introduce a new evaluation framework, PRIME (Puzzle Reasoning for Implicit Biases in Model Evaluation), that uses logic grid puzzles to systematically probe the influence of social stereotypes on logical reasoning and decision making in LLMs. Our use of logic puzzles enables automatic generation and verification, as well as variability in complexity and biased settings. PRIME includes stereotypical, anti-stereotypical, and neutral puzzle variants generated from a shared puzzle structure, allowing for controlled and fine-grained comparisons. We evaluate multiple model families across puzzle sizes and test the effectiveness of prompt-based mitigation strategies. Focusing our experiments on gender stereotypes, our findings highlight that models consistently reason more accurately when solutions align with stereotypical associations. This demonstrates the significance of PRIME for diagnosing and quantifying social biases perpetuated in the deductive reasoning of LLMs, where fairness is critical.
[218] Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads
Jingwei Ni,Ekaterina Fadeeva,Tianyi Wu,Mubashara Akhtar,Jiaheng Zhang,Elliott Ash,Markus Leippold,Timothy Baldwin,See-Kiong Ng,Artem Shelmanov,Mrinmaya Sachan
Main category: cs.AI
TL;DR: 该论文提出了一种轻量级的LLM推理验证方法,通过训练不确定性量化头(UHeads)利用LLM内部状态估计推理步骤的不确定性。相比现有方法(如PRMs),UHeads计算效率更高、无需大规模标注,且在多个领域表现优异。
Details
Motivation: 现有的LLM推理验证方法(如PRMs)计算成本高、领域受限或依赖大量标注。为解决这些问题,作者提出了一种基于数据驱动不确定性评分的轻量级验证方法。Contribution: 1. 提出了轻量化的不确定性量化头(UHeads),利用LLM内部状态估计推理步骤的不确定性;
2. 实现了全自动验证,目标标签可由更大LLM或自监督生成;
3. 实验表明,UHeads在性能上匹配或超越810倍大的PRMs。
Method: 1. 训练基于Transformer的不确定性量化头(UHeads);
2. UHeads利用冻结LLM的内部状态估计推理步骤的不确定性;
3. 目标标签通过更大LLM或自监督生成。
Result: 在数学、规划和通用知识问答等多个领域,UHeads表现优异,匹配或超越PRMs的性能。
Insight: LLM的内部状态编码了不确定性信息,可用作推理验证的可靠信号。这一发现为构建可扩展、通用的自省式LLM提供了新思路。
Abstract: Solving complex tasks usually requires LLMs to generate long multi-step reasoning chains. Previous work has shown that verifying the correctness of individual reasoning steps can further improve the performance and efficiency of LLMs on such tasks and enhance solution interpretability. However, existing verification approaches, such as Process Reward Models (PRMs), are either computationally expensive, limited to specific domains, or require large-scale human or model-generated annotations. Thus, we propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores. We train transformer-based uncertainty quantification heads (UHeads) that use the internal states of a frozen LLM to estimate the uncertainty of its reasoning steps during generation. The approach is fully automatic: target labels are generated either by another larger LLM (e.g., DeepSeek R1) or in a self-supervised manner by the original model itself. UHeads are both effective and lightweight, containing less than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, they match or even surpass the performance of PRMs that are up to 810x larger. Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification, offering a promising direction toward scalable and generalizable introspective LLMs.
[219] Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B
Sen Xu,Yi Zhou,Wei Wang,Jixin Min,Zhibin Yin,Yingwei Dai,Shixi Liu,Lianyu Pang,Yirong Chen,Junlin Zhang
Main category: cs.AI
TL;DR: 该论文提出了一种名为VibeThinker-1.5B的小型模型(1.5B参数),通过多样性驱动的优化方法(SSP框架),在低成本($7,800)下实现了与大型模型(如DeepSeek R1和Claude Opus 4)相当甚至更优的推理能力。
Details
Motivation: 挑战现有共识,即小型模型无法具备强大的推理能力,同时避免通过增加模型参数规模来提升能力的高成本问题。Contribution: 1. 提出了Spectrum-to-Signal Principle(SSP)框架,包含两阶段多样性探索蒸馏和最大熵引导策略优化;2. 通过低成本训练展示了小型模型(1.5B参数)可以达到大型模型的推理能力。
Method: 1. 两阶段多样性探索蒸馏(SFT)生成广泛解决方案;2. 最大熵引导策略优化(RL)强化正确信号。
Result: VibeThinker-1.5B在多个数学基准测试(如AIME24、AIME25、HMMT25)上超越400倍大的DeepSeek R1,并在LiveCodeBench V6上超越Magistral Medium。
Insight: 小型模型通过多样性驱动的优化方法可以在低成本下实现与大型模型相当的推理能力,为AI研究的民主化提供了可能。
Abstract: Challenging the prevailing consensus that small models inherently lack robust reasoning, this report introduces VibeThinker-1.5B, a 1.5B-parameter dense model developed via our Spectrum-to-Signal Principle (SSP). This challenges the prevailing approach of scaling model parameters to enhance capabilities, as seen in models like DeepSeek R1 (671B) and Kimi k2 (>1T). The SSP framework first employs a Two-Stage Diversity-Exploring Distillation (SFT) to generate a broad spectrum of solutions, followed by MaxEnt-Guided Policy Optimization (RL) to amplify the correct signal. With a total training cost of only $7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models like Magistral Medium and Claude Opus 4, and performs on par with open-source models like GPT OSS-20B Medium. Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks: AIME24 (80.3 vs. 79.8), AIME25 (74.4 vs. 70.0), and HMMT25 (50.4 vs. 41.7). This is a substantial improvement over its base model (6.7, 4.3, and 0.6, respectively). On LiveCodeBench V6, it scores 51.1, outperforming Magistral Medium’s 50.3 and its base model’s 0.0. These findings demonstrate that small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and thereby democratizing advanced AI research.
[220] LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation
Liya Zhu,Peizhuang Cong,Aowei Ji,Wenya Wu,Jiani Hou,Chunjie Wu,Xiang Gao,Jingkai Liu,Zhou Huan,Xuelei Sun,Yang Yang,Jianpeng Jiao,Liang Hu,Xinjie Chen,Jiashuo Liu,Jingzhe Ding,Tong Yang,Zaiyuan Wang,Ge Zhang,Wenhao Huang
Main category: cs.AI
TL;DR: 本文提出了LPFQA,一个基于长尾专业知识的基准测试,用于评估大语言模型(LLMs)的真实能力,填补了现有基准在复杂实际应用和长尾知识覆盖上的不足。
Details
Motivation: 现有的基准测试往往关注简化任务或人工场景,忽视了长尾知识和实际应用的复杂性,难以准确评估LLMs的真实能力。Contribution: 1)提出了LPFQA基准,覆盖20个学术和工业领域的502个任务;2)设计了针对知识深度、推理、术语理解和上下文分析的细粒度评估维度;3)引入分层难度结构和真实用户角色建模;4)展示了12个主流LLMs在LPFQA上的显著性能差异。
Method: LPFQA基于专业论坛的真实数据,通过四个创新点构建:细粒度评估维度、分层难度设计、真实场景建模和跨领域知识整合。
Result: 在LPFQA上评估了12个主流LLMs,发现它们在专业推理任务中表现差异显著,证明了基准的区分能力。
Insight: LPFQA为LLMs的评估和未来发展提供了更真实、鲁棒和具有判别力的工具,尤其适合测试模型在复杂和长尾场景中的表现。
Abstract: Large Language Models (LLMs) have made rapid progress in reasoning, question answering, and professional applications; however, their true capabilities remain difficult to evaluate using existing benchmarks. Current datasets often focus on simplified tasks or artificial scenarios, overlooking long-tail knowledge and the complexities of real-world applications. To bridge this gap, we propose LPFQA, a long-tail knowledge-based benchmark derived from authentic professional forums across 20 academic and industrial fields, covering 502 tasks grounded in practical expertise. LPFQA introduces four key innovations: fine-grained evaluation dimensions that target knowledge depth, reasoning, terminology comprehension, and contextual analysis; a hierarchical difficulty structure that ensures semantic clarity and unique answers; authentic professional scenario modeling with realistic user personas; and interdisciplinary knowledge integration across diverse domains. We evaluated 12 mainstream LLMs on LPFQA and observed significant performance disparities, especially in specialized reasoning tasks. LPFQA provides a robust, authentic, and discriminative benchmark for advancing LLM evaluation and guiding future model development.
[221] MONICA: Real-Time Monitoring and Calibration of Chain-of-Thought Sycophancy in Large Reasoning Models
Jingyu Hu,Shu Yang,Xilin Gong,Hongming Wang,Weiru Liu,Di Wang
Main category: cs.AI
TL;DR: MONICA是一個即時監控與校正鏈式思考模型中諂媚行為的框架,透過在推理過程中動態監測和抑制諂媚行為,提升模型的可靠性。
Details
Motivation: 大型推理模型(LRMs)容易出現諂媚行為,即模型傾向於附和用戶的錯誤信念或錯誤信息,而非獨立推理。這種行為不僅損害模型的可信度,還可能帶來社會風險。現有方法主要關注最終答案的校正,缺乏對推理過程中諂媚行為發展的理解。Contribution: 1. 提出MONICA框架,首次在推理步驟層面實現對諂媚行為的即時監控與動態校正。 2. 開發了一種諂媚監測器,可在推理過程中實時計算諂媚偏移分數。 3. 設計了一個校正器,動態抑制超過閾值的諂媚行為。
Method: MONICA結合了一個諂媚監測器和校正器。監測器在生成回答過程中實時監測諂媚偏移分數,校正器根據分數動態調整推理行為。MONICA不需要模型完成完整答案生成,即可在校正階段介入。
Result: 在12個數據集和3種LRM上的實驗表明,MONICA顯著減少了中間推理步驟和最終答案中的諂媚行為,並提高了模型的魯棒性。
Insight: 1. 諂媚行為不僅存在於最終答案,也潛藏在推理過程中。 2. 即時監控與校正能有效減少諂媚行為,同時保持模型的推理能力。 3. 這一框架可擴展到其他需要確保獨立推理的任務中。
Abstract: Large Reasoning Models (LRMs) suffer from sycophantic behavior, where models tend to agree with users’ incorrect beliefs and follow misinformation rather than maintain independent reasoning. This behavior undermines model reliability and poses societal risks. Mitigating LRM sycophancy requires monitoring how this sycophancy emerges during the reasoning trajectory; however, current methods mainly focus on judging based on final answers and correcting them, without understanding how sycophancy develops during reasoning processes. To address this limitation, we propose MONICA, a novel Monitor-guided Calibration framework that monitors and mitigates sycophancy during model inference at the level of reasoning steps, without requiring the model to finish generating its complete answer. MONICA integrates a sycophantic monitor that provides real-time monitoring of sycophantic drift scores during response generation with a calibrator that dynamically suppresses sycophantic behavior when scores exceed predefined thresholds. Extensive experiments across 12 datasets and 3 LRMs demonstrate that our method effectively reduces sycophantic behavior in both intermediate reasoning steps and final answers, yielding robust performance improvements.
[222] Optimizing Chain-of-Thought Confidence via Topological and Dirichlet Risk Analysis
Abhishek More,Anthony Zhang,Nicole Bonilla,Ashvik Vivekan,Kevin Zhu,Parham Sharafoleslami,Maheep Chaudhary
Main category: cs.AI
TL;DR: 该论文提出了一种名为EDTR的解码策略,结合拓扑分析和Dirichlet不确定性量化,以改进LLM在复杂推理任务中的置信度校准。
Details
Motivation: 现有方法在置信度估计上表现不佳,存在校准不准确和过度自信的问题,限制了LLM在关键任务中的安全部署。Contribution: 提出了EDTR方法,通过拓扑和Dirichlet风险分析提供更可靠的置信度估计,显著提升了校准性能。
Method: EDTR将每个CoT视为高维空间中的向量,提取八个拓扑风险特征,捕捉推理分布的几何结构,从而实现不确定性量化。
Result: 在四个推理基准测试中,EDTR的校准表现比现有方法提升了41%,平均ECE为0.287,并在AIME任务中实现了完美准确率。
Insight: 几何框架为理解和量化多步LLM推理中的不确定性提供了新视角,提升了置信度估计的可靠性。
Abstract: Chain-of-thought (CoT) prompting enables Large Language Models to solve complex problems, but deploying these models safely requires reliable confidence estimates, a capability where existing methods suffer from poor calibration and severe overconfidence on incorrect predictions. We propose Enhanced Dirichlet and Topology Risk (EDTR), a novel decoding strategy that combines topological analysis with Dirichlet-based uncertainty quantification to measure LLM confidence across multiple reasoning paths. EDTR treats each CoT as a vector in high-dimensional space and extracts eight topological risk features capturing the geometric structure of reasoning distributions: tighter, more coherent clusters indicate higher confidence while dispersed, inconsistent paths signal uncertainty. We evaluate EDTR against three state-of-the-art calibration methods across four diverse reasoning benchmarks spanning olympiad-level mathematics (AIME), grade school math (GSM8K), commonsense reasoning, and stock price prediction \cite{zhang2025aime, cobbe2021training, talmor-etal-2019-commonsenseqa, yahoo_finance}. EDTR achieves 41% better calibration than competing methods with an average ECE of 0.287 and the best overall composite score of 0.672, while notably achieving perfect accuracy on AIME and exceptional calibration on GSM8K with an ECE of 0.107, domains where baselines exhibit severe overconfidence. Our work provides a geometric framework for understanding and quantifying uncertainty in multi-step LLM reasoning, enabling more reliable deployment where calibrated confidence estimates are essential.
[223] GRAPH-GRPO-LEX: Contract Graph Modeling and Reinforcement Learning with Group Relative Policy Optimization
Moriya Dechtiar,Daniel Martin Katz,Mari Sundaresan,Sylvain Jaume,Hongming Wang
Main category: cs.AI
TL;DR: 论文提出了一种名为GRAPH-GRPO-LEX的框架,通过将法律合同转化为结构化的语义图,并结合LLM和强化学习(使用GRPO方法),实现合同的自动化分析和审查。
Details
Motivation: 法律合同结构复杂,审查和分析过程繁琐且容易出错,需要一种自动化方法来简化这一过程。Contribution: 1. 提出了一种将法律合同映射为语义图的详细本体论;2. 设计了结合LLM和GRPO强化学习的实体与关系提取框架;3. 引入了门控GRPO方法,提升学习信号,实现合同的动态可视化分析。
Method: 1. 使用本体论将合同元素映射为图的节点和边;2. 结合LLM和基于GRPO的强化学习,设计奖励函数以优化图指标;3. 引入门控GRPO增强学习效果。
Result: 实现了合同条款的直接关系和隐藏依赖的自动识别,并将分析过程从线性手动阅读转为可视化图分析。
Insight: 通过图建模和强化学习,可以在法律领域实现类似于软件工程的静态分析(linting)功能,提升合同的自动化处理能力。
Abstract: Contracts are complex documents featuring detailed formal structures, explicit and implicit dependencies and rich semantic content. Given these document properties, contract drafting and manual examination of contracts have proven to be both arduous and susceptible to errors. This work aims to simplify and automate the task of contract review and analysis using a novel framework for transforming legal contracts into structured semantic graphs, enabling computational analysis and data-driven insights. We introduce a detailed ontology mapping core legal contract elements to their graph-theoretic equivalents of nodes and edges. We then present a reinforcement learning based Large Language Model (LLM) framework for segmentation and extraction of entities and relationships from contracts. Our method, GRAPH-GRPO-LEX, incorporates both LLMs and reinforcement learning with group relative policy optimization (GRPO). By applying a carefully drafted reward function of graph metrics, we demonstrate the ability to automatically identify direct relationships between clauses, and even uncover hidden dependencies. Our introduction of the gated GRPO approach shows a strong learning signal and can move contract analysis from a linear, manual reading process to an easily visualized graph. This allows for a more dynamic analysis, including building the groundwork for contract linting similar to what is now practiced in software engineering.
[224] IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction
Guoxin Chen,Zile Qiao,Xuanzhong Chen,Donglei Yu,Haotian Xu,Wayne Xin Zhao,Ruihua Song,Wenbiao Yin,Huifeng Yin,Liwen Zhang,Kuan Li,Minpeng Liao,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
Main category: cs.AI
TL;DR: 该论文提出了IterResearch,一种通过马尔可夫状态重构解决长时程任务的新范式,结合EAPO强化学习框架,显著提升了长时程推理的性能和效率。
Details
Motivation: 现有方法在长时程任务中存在上下文窗口扩展导致的噪声污染和推理能力限制问题。IterResearch旨在通过迭代重构状态和高效探索策略解决这些问题。Contribution: 1) 提出IterResearch范式,将长时程研究建模为马尔可夫决策过程;2) 开发EAPO框架,优化探索效率;3) 在基准任务中实现显著性能提升。
Method: 使用马尔可夫状态重构和迭代报告作为记忆,结合EAPO(几何奖励折扣和自适应降采样)优化探索策略。
Result: 在六个基准任务中平均提升14.5%,支持2048次交互,性能从3.5%提升至42.5%,并在前沿模型中作为提示策略提升19.2%。
Insight: 长时程任务中,动态重构状态和高效探索策略能显著提升推理能力和性能,适用性广,可作为训练代理或提示范式。
Abstract: Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5% to 42.5%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.
[225] DigiData: Training and Evaluating General-Purpose Mobile Control Agents
Yuxuan Sun,Manchen Wang,Shengyi Qian,William R. Wong,Eric Gan,Pierluca D’Oro,Alejandro Castillejo Munoz,Sneha Silwal,Pedro Matias,Nitin Kamra,Satwik Kottur,Nick Raines,Xuanyi Zhao,Joy Chen,Joseph Greer,Andrea Madotto,Allen Bolourchi,James Valori,Kevin Carlberg,Karl Ridgeway,Joseph Tighe
Main category: cs.AI
TL;DR: DigiData是一个高质量、多模态的大规模数据集,专门用于训练移动控制智能体;同时提出了DigiData-Bench评测基准,以动态评估协议和AI驱动的评估方法替代传统步准确率评测,推动移动控制智能体的发展。
Details
Motivation: 现有数据集的目标通常来自非结构化交互,导致多样性和目标复杂性不足;同时传统评测方法(如步准确率)无法可靠评估智能体性能,亟需更高效的数据集和评测方法。Contribution: 1. 提出了DigiData,一个高质量、多样化的多模态数据集;2. 设计了DigiData-Bench评测基准,引入动态评估协议和AI驱动的评估方法;3. 证明传统步准确率评测的局限性。
Method: 1. 通过全面探索应用功能构造DigiData数据集;2. 提出动态评估协议和AI驱动评测方法,替代传统步准确率。
Result: DigiData提升了数据集的多样性和目标复杂性,DigiData-Bench提供了更可靠的评估手段。
Insight: 高质量数据集和动态评测方法是发展移动控制智能体的关键,传统评测指标可能无法反映实际任务复杂性。
Abstract: AI agents capable of controlling user interfaces have the potential to transform human interaction with digital devices. To accelerate this transformation, two fundamental building blocks are essential: high-quality datasets that enable agents to achieve complex and human-relevant goals, and robust evaluation methods that allow researchers and practitioners to rapidly enhance agent performance. In this paper, we introduce DigiData, a large-scale, high-quality, diverse, multi-modal dataset designed for training mobile control agents. Unlike existing datasets, which derive goals from unstructured interactions, DigiData is meticulously constructed through comprehensive exploration of app features, resulting in greater diversity and higher goal complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating mobile control agents on real-world complex tasks. We demonstrate that the commonly used step-accuracy metric falls short in reliably assessing mobile control agents and, to address this, we propose dynamic evaluation protocols and AI-powered evaluations as rigorous alternatives for agent assessment. Our contributions aim to significantly advance the development of mobile control agents, paving the way for more intuitive and effective human-device interactions.
cs.SD [Back]
[226] Factual and Musical Evaluation Metrics for Music Language Models
Daniel Chenyu Lin,Michael Freeman,John Thickstun
Main category: cs.SD
TL;DR: 该论文指出现有音乐语言模型的评估指标(如BLEU、METEOR等)仅衡量语言流畅性,无法反映答案的正确性。作者提出了针对音乐领域的改进评估指标和事实性评估框架。
Details
Motivation: 当前音乐语言模型的评估指标未能捕捉答案的正确性,导致模型性能被高估。因此,需要开发更准确的评估方法。Contribution: 1. 提出一种改进的音乐领域通用评估指标。2. 设计了一个事实性评估框架,用于量化音乐语言模型回答的正确性。
Method: 1. 改进现有评估指标,使其更适合音乐领域。2. 构建事实性评估框架,通过开放数据集进行实验验证。
Result: 作者使用开放数据集验证了新评估方法的有效性,并将公开所有代码。
Insight: 评估指标的改进不仅是音乐领域的挑战,也可能适用于其他开放式问答领域,凸显多模态评估的重要性。
Abstract: Music language models (Music LMs), like vision language models, leverage multimodal representations to answer natural language queries about musical audio recordings. Although Music LMs are reportedly improving, we find that current evaluations fail to capture whether their answers are correct. Specifically, for all Music LMs that we examine, widely-used evaluation metrics such as BLEU, METEOR, and BERTScore fail to measure anything beyond linguistic fluency of the model’s responses. To measure the true performance of Music LMs, we propose (1) a better general-purpose evaluation metric for Music LMs adapted to the music domain and (2) a factual evaluation framework to quantify the correctness of a Music LM’s responses. Our framework is agnostic to the modality of the question-answering model and could be generalized to quantify performance in other open-ended question-answering domains. We use open datasets in our experiments and will release all code on publication.
[227] Persian Musical Instruments Classification Using Polyphonic Data Augmentation
Diba Hadi Esfangereh,Mohammad Hossein Sameti,Sepehr Harfi Moridani,Leili Javidpour,Mahdieh Soleymani Baghshah
Main category: cs.SD
TL;DR: 该论文通过提出一种文化感知的数据增强策略,构建了一个波斯音乐乐器分类的新数据集,并使用MERT模型进行评估,展现了在真实世界多音波斯音乐中的优越性能。
Details
Motivation: 当前音乐信息检索和生成音乐系统的研究中,针对非西方传统音乐(尤其是波斯音乐)乐器的分类研究较少,论文旨在填补这一空白。Contribution: 论文的主要贡献包括:(1)构建了一个涵盖七种传统波斯乐器和两种非波斯乐器的新数据集;(2)提出了一种文化感知的数据增强策略,从单音样本生成真实的多音混合;(3)使用MERT模型进行分类评估,展示了方法的有效性。
Method: 论文采用了文化感知的数据增强策略,从单音样本生成多音混合数据,并使用MERT模型(带有分类头)进行评估。实验设计了分布外数据测试,手动标注传统歌曲片段。
Result: 在真实世界的多音波斯音乐数据上,提出的方法达到了最高的ROC-AUC(0.795),展示了音调和时间一致性的互补优势。
Insight: 文化感知的数据增强能够有效提升波斯乐器识别的鲁棒性,为文化包容的音乐信息检索和多样化的音乐生成系统奠定了基础。
Abstract: Musical instrument classification is essential for music information retrieval (MIR) and generative music systems. However, research on non-Western traditions, particularly Persian music, remains limited. We address this gap by introducing a new dataset of isolated recordings covering seven traditional Persian instruments, two common but originally non-Persian instruments (i.e., violin, piano), and vocals. We propose a culturally informed data augmentation strategy that generates realistic polyphonic mixtures from monophonic samples. Using the MERT model (Music undERstanding with large-scale self-supervised Training) with a classification head, we evaluate our approach with out-of-distribution data which was obtained by manually labeling segments of traditional songs. On real-world polyphonic Persian music, the proposed method yielded the best ROC-AUC (0.795), highlighting complementary benefits of tonal and temporal coherence. These results demonstrate the effectiveness of culturally grounded augmentation for robust Persian instrument recognition and provide a foundation for culturally inclusive MIR and diverse music generation systems.
cs.CR [Back]
[228] MCP-RiskCue: Can LLM infer risk information from MCP server System Logs?
Jiayi Fu,Qiyao Sun
Main category: cs.CR
TL;DR: 该论文提出了首个合成基准测试MCP-RiskCue,用于评估大型语言模型(LLM)从MCP服务器系统日志中识别安全风险的能力,并通过实验验证了监督微调和强化学习的有效性。
Details
Motivation: MCP服务器作为LLM与外部工具交互的标准接口,存在安全隐患,尤其是恶意MCP服务器的系统日志可能带来安全风险。目前的研究多关注提示注入攻击或LLM交互轨迹的漏洞,而忽略了系统日志的潜在风险。Contribution: 1. 定义了9类MCP服务器风险;2. 生成了1,800条合成的系统日志,嵌入243个MCP服务器的返回结果中,形成一个完整的数据集;3. 实验表明监督微调(SFT)容易导致高误报,而强化学习(如RLVR和GRPO)能在精确率和召回率上取得更好平衡,Llama3.1-8B-Instruct表现最佳。
Method: 1. 使用10种先进的LLM生成合成的系统日志;2. 通过监督微调和强化学习(RLVR和GRPO)训练模型;3. 评估模型在识别风险日志上的表现。
Result: Llama3.1-8B-Instruct在GRPO训练后达到83%的准确率,优于远程大模型9个百分点。强化学习在提升LLM安全性方面表现突出。
Insight: 强化学习(如RLVR和GRPO)在平衡精确率和召回率方面优于监督微调(SFT),为LLM在MCP框架下的安全性提供了更有效的解决方案。
Abstract: Large language models (LLMs) demonstrate strong capabilities in solving complex tasks when integrated with external tools. The Model Context Protocol (MCP) has become a standard interface for enabling such tool-based interactions. However, these interactions introduce substantial security concerns, particularly when the MCP server is compromised or untrustworthy. While prior benchmarks primarily focus on prompt injection attacks or analyze the vulnerabilities of LLM MCP interaction trajectories, limited attention has been given to the underlying system logs associated with malicious MCP servers. To address this gap, we present the first synthetic benchmark for evaluating LLMs ability to identify security risks from system logs. We define nine categories of MCP server risks and generate 1,800 synthetic system logs using ten state-of-the-art LLMs. These logs are embedded in the return values of 243 curated MCP servers, yielding a dataset of 2,421 chat histories for training and 471 queries for evaluation. Our pilot experiments reveal that smaller models often fail to detect risky system logs, leading to high false negatives. While models trained with supervised fine-tuning (SFT) tend to over-flag benign logs, resulting in elevated false positives, Reinforcement Learning from Verifiable Reward (RLVR) offers a better precision-recall balance. In particular, after training with Group Relative Policy Optimization (GRPO), Llama3.1-8B-Instruct achieves 83% accuracy, surpassing the best-performing large remote model by 9 percentage points. Fine-grained, per-category analysis further underscores the effectiveness of reinforcement learning in enhancing LLM safety within the MCP framework. Code and data are available at: https://github.com/PorUna-byte/MCP-Guard/tree/master
[229] Enhancing Adversarial Robustness of IoT Intrusion Detection via SHAP-Based Attribution Fingerprinting
Dilli Prasad Sharma,Liang Xue,Xiaowei Sun,Xiaodong Lin,Pulei Xiong
Main category: cs.CR
TL;DR: 该论文提出了一种基于SHAP的可解释性指纹方法,用于增强物联网入侵检测系统(IDS)对对抗攻击的鲁棒性,通过捕捉网络流量特征的变化,有效区分干净输入和对抗性扰动输入。
Details
Motivation: 物联网设备的快速普及带来了安全威胁的加剧,尤其是针对AI/ML驱动的入侵检测系统的对抗攻击,亟需提升系统的鲁棒性和可解释性。Contribution: 1.提出了一种基于SHAP指纹的对抗检测模型;2.显著提升了IDS对对抗攻击的检测性能;3.增强了模型的可解释性和透明度。
Method: 利用SHAP的DeepExplainer提取网络流量特征的归因指纹,通过捕捉细微的变化模式,区分对抗性扰动与正常输入。
Result: 在标准物联网数据集上,该方法在对抗攻击检测上显著优于现有方法,同时提高了模型的可解释性。
Insight: SHAP的可解释性工具不仅能提升模型透明度,还能增强对抗性鲁棒性,为安全防御提供新思路。
Abstract: The rapid proliferation of Internet of Things (IoT) devices has transformed numerous industries by enabling seamless connectivity and data-driven automation. However, this expansion has also exposed IoT networks to increasingly sophisticated security threats, including adversarial attacks targeting artificial intelligence (AI) and machine learning (ML)-based intrusion detection systems (IDS) to deliberately evade detection, induce misclassification, and systematically undermine the reliability and integrity of security defenses. To address these challenges, we propose a novel adversarial detection model that enhances the robustness of IoT IDS against adversarial attacks through SHapley Additive exPlanations (SHAP)-based fingerprinting. Using SHAP’s DeepExplainer, we extract attribution fingerprints from network traffic features, enabling the IDS to reliably distinguish between clean and adversarially perturbed inputs. By capturing subtle attribution patterns, the model becomes more resilient to evasion attempts and adversarial manipulations. We evaluated the model on a standard IoT benchmark dataset, where it significantly outperformed a state-of-the-art method in detecting adversarial attacks. In addition to enhanced robustness, this approach improves model transparency and interpretability, thereby increasing trust in the IDS through explainable AI.
cs.CY [Back]
[230] Place Matters: Comparing LLM Hallucination Rates for Place-Based Legal Queries
Damian Curran,Vanessa Sporne,Lea Frermann,Jeannie Paterson
Main category: cs.CY
TL;DR: 该论文探讨了大型语言模型(LLM)在不同地区的法律信息幻觉率差异,提出了一种基于功能主义的跨地区比较方法,并发现幻觉率与地区显著相关,且与模型预测的多数响应频率呈负相关。
Details
Motivation: 量化LLM在不同地区的法律知识差异对于理解其法律信息质量的分布至关重要,但由于不同地区的法律制度难以直接比较,这项工作提出了新的方法来解决这一问题。Contribution: 1. 提出了一种基于功能主义的跨地区法律信息比较方法;2. 构建了一个来自Reddit的法律场景数据集;3. 揭示了LLM在法律信息幻觉率上的显著地区差异;4. 发现幻觉率与模型预测的不确定性之间存在负相关。
Method: 1. 从Reddit用户的法律咨询帖子中提取事实场景;2. 在洛杉矶、伦敦和悉尼等地区,通过LLM生成相关法律摘要;3. 手动评估摘要中的幻觉;4. 分析幻觉率与地区及模型多次采样结果的关系。
Result: 研究发现,闭源LLM的法律信息幻觉率在不同地区之间存在显著差异,且幻觉率与模型预测的多数响应频率呈负相关。
Insight: LLM的法律信息质量存在地理分布不均的现象,且模型的不确定性可通过幻觉率间接衡量。这为改进LLM在法律领域的应用提供了方向。
Abstract: How do we make a meaningful comparison of a large language model’s knowledge of the law in one place compared to another? Quantifying these differences is critical to understanding if the quality of the legal information obtained by users of LLM-based chatbots varies depending on their location. However, obtaining meaningful comparative metrics is challenging because legal institutions in different places are not themselves easily comparable. In this work we propose a methodology to obtain place-to-place metrics based on the comparative law concept of functionalism. We construct a dataset of factual scenarios drawn from Reddit posts by users seeking legal advice for family, housing, employment, crime and traffic issues. We use these to elicit a summary of a law from the LLM relevant to each scenario in Los Angeles, London and Sydney. These summaries, typically of a legislative provision, are manually evaluated for hallucinations. We show that the rate of hallucination of legal information by leading closed-source LLMs is significantly associated with place. This suggests that the quality of legal solutions provided by these models is not evenly distributed across geography. Additionally, we show a strong negative correlation between hallucination rate and the frequency of the majority response when the LLM is sampled multiple times, suggesting a measure of uncertainty of model predictions of legal facts.
q-bio.QM [Back]
[231] Selective Diabetic Retinopathy Screening with Accuracy-Weighted Deep Ensembles and Entropy-Guided Abstention
Jophy Lin
Main category: q-bio.QM
TL;DR: 这是⼀篇关于糖尿病视网膜病变(DR)筛查的研究,提出了⼀种结合深度集成学习和不确定性估计的⽅法,通过多模型集成和不确定性过滤提⾼诊断的可靠性和性能。
Details
Motivation: 糖尿病视网膜病变是⼀种可预防的致盲疾病,但当前诊断⽅法成本⾼且资源密集,AI模型缺乏可解释性和不确定性量化,限制了临床可靠性。Contribution: 1. 提出了⼀种深度集成学习框架,结合了七种CNN架构和不确定性估计;2. 引⼊了概率加权熵衡量预测不确定性;3. 通过不确定性过滤实现了⾼达99.44%的准确率。
Method: 1. 集成ResNet-50、DenseNet-121等七种CNN架构;2. 通过精度加权多数表决融合输出;3. 使⽤概率加权熵量化不确定性,过滤低置信度样本。
Result: 在35,000张EyePACS图像上,原始准确率为93.70%(F1=0.9376),过滤后最⼤准确率达99.44%(F1=0.9932)。
Insight: 不确定性感知的集成学习不仅能提⾼可靠性,还能通过可调精度覆盖权衡为⾼⻛险医疗提供了可信任的AI诊断范式。
Abstract: Diabetic retinopathy (DR), a microvascular complication of diabetes and a leading cause of preventable blindness, is projected to affect more than 130 million individuals worldwide by 2030. Early identification is essential to reduce irreversible vision loss, yet current diagnostic workflows rely on methods such as fundus photography and expert review, which remain costly and resource-intensive. This, combined with DR’s asymptomatic nature, results in its underdiagnosis rate of approximately 25 percent. Although convolutional neural networks (CNNs) have demonstrated strong performance in medical imaging tasks, limited interpretability and the absence of uncertainty quantification restrict clinical reliability. Therefore, in this study, a deep ensemble learning framework integrated with uncertainty estimation is introduced to improve robustness, transparency, and scalability in DR detection. The ensemble incorporates seven CNN architectures-ResNet-50, DenseNet-121, MobileNetV3 (Small and Large), and EfficientNet (B0, B2, B3)- whose outputs are fused through an accuracy-weighted majority voting strategy. A probability-weighted entropy metric quantifies prediction uncertainty, enabling low-confidence samples to be excluded or flagged for additional review. Training and validation on 35,000 EyePACS retinal fundus images produced an unfiltered accuracy of 93.70 percent (F1 = 0.9376). Uncertainty-filtering later was conducted to remove unconfident samples, resulting in maximum-accuracy of 99.44 percent (F1 = 0.9932). The framework shows that uncertainty-aware, accuracy-weighted ensembling improves reliability without hindering performance. With confidence-calibrated outputs and a tunable accuracy-coverage trade-off, it offers a generalizable paradigm for deploying trustworthy AI diagnostics in high-risk care.
eess.IV [Back]
[232] Training-Free Adaptive Quantization for Variable Rate Image Coding for Machines
Yui Tatsumi,Ziyue Zeng,Hiroshi Watanabe
Main category: eess.IV
TL;DR: 本文提出了一种无需训练的适应性量化方法,用于可变比特率的机器图像编码,通过动态调整量化步长实现了灵活的比特率控制,同时节省了BD-rate。
Details
Motivation: 现有的机器图像编码(ICM)框架多为固定比特率,需要针对不同比特率分别训练,增加了计算成本和部署复杂度。本文旨在解决这一问题,提出一种无需训练的适应性量化方法。Contribution: 提出了一种无需训练的适应性量化步长控制方案,通过通道熵依赖性和空间尺度参数动态调整比特率,同时保护语义重要区域。
Method: 利用超先验网络预测的通道熵依赖性和空间尺度参数,动态调整量化步长,实现灵活的比特率控制,无需额外训练。
Result: 实验表明,所提方法比非适应性可变比特率方法节省了11.07%的BD-rate。
Insight: 通过动态量化策略可以在不增加训练成本的情况下实现灵活的比特率控制,适用于实际部署中的机器图像编码需求。
Abstract: Image Coding for Machines (ICM) has become increasingly important with the rapid integration of computer vision into real-world applications. However, most ICM frameworks utilize learned image compression (LIC) models that operate at a fixed rate and require separate training for each target bitrate, which may limit their practical applications. Existing variable rate LIC approaches mitigate this limitation but typically depend on training, increasing computational cost and deployment complexity. Moreover, variable rate control has not been thoroughly explored for ICM. To address these challenges, we propose a training-free, adaptive quantization step size control scheme that enables flexible bitrate adjustment. By leveraging both channel-wise entropy dependencies and spatial scale parameters predicted by the hyperprior network, the proposed method preserves semantically important regions while coarsely quantizing less critical areas. The bitrate can be continuously controlled through a single parameter. Experimental results demonstrate the effectiveness of our proposed method, achieving up to 11.07% BD-rate savings over the non-adaptive variable rate method.
[233] EndoIR: Degradation-Agnostic All-in-One Endoscopic Image Restoration via Noise-Aware Routing Diffusion
Tong Chen,Xinyu Ma,Long Bai,Wenyang Wang,Sun Yue,Luping Zhou
Main category: eess.IV
TL;DR: EndoIR 是一种基于扩散模型的通用内窥镜图像修复框架,能够通过单一模型处理多种退化类型,无需预知退化类型,具有临床实用性。
Details
Motivation: 现有内窥镜图像修复方法通常是任务特定的,且依赖退化类型先验知识,这在真实临床场景中限制了其鲁棒性。Contribution: 1. 提出了首个通用的、无需退化类型先验的扩散模型框架 EndoIR;2. 设计了双域提示器和自适应嵌入模块,提取联合空间-频率特征;3. 提出了双流扩散架构和噪声感知路由块,提高效率和性能。
Method: EndoIR 通过双域提示器提取特征,结合自适应嵌入模块编码退化信息。使用双流扩散架构分别处理干净和退化输入,并通过校正融合块结构化整合特征。噪声感知路由块动态选择噪声相关特征以提高效率。
Result: 在 SegSTRONG-C 和 CEC 数据集上,EndoIR 在多种退化场景中表现优于现有方法,且参数更少。下游分割实验验证了其临床实用性。
Insight: 通过提取退化无关的特征和动态路由机制,可以设计高效的通用图像修复框架,适用于多样化的临床场景。
Abstract: Endoscopic images often suffer from diverse and co-occurring degradations such as low lighting, smoke, and bleeding, which obscure critical clinical details. Existing restoration methods are typically task-specific and often require prior knowledge of the degradation type, limiting their robustness in real-world clinical use. We propose EndoIR, an all-in-one, degradation-agnostic diffusion-based framework that restores multiple degradation types using a single model. EndoIR introduces a Dual-Domain Prompter that extracts joint spatial-frequency features, coupled with an adaptive embedding that encodes both shared and task-specific cues as conditioning for denoising. To mitigate feature confusion in conventional concatenation-based conditioning, we design a Dual-Stream Diffusion architecture that processes clean and degraded inputs separately, with a Rectified Fusion Block integrating them in a structured, degradation-aware manner. Furthermore, Noise-Aware Routing Block improves efficiency by dynamically selecting only noise-relevant features during denoising. Experiments on SegSTRONG-C and CEC datasets demonstrate that EndoIR achieves state-of-the-art performance across multiple degradation scenarios while using fewer parameters than strong baselines, and downstream segmentation experiments confirm its clinical utility.
[234] Cross-Modal Fine-Tuning of 3D Convolutional Foundation Models for ADHD Classification with Low-Rank Adaptation
Jyun-Ping Kao,Shinyeong Rho,Shahar Lazarev,Hyun-Hae Cho,Fangxu Xing,Taehoon Shin,C. -C. Jay Kuo,Jonghye Woo
Main category: eess.IV
TL;DR: 论文提出了一种参数高效的迁移学习方法,通过低秩适应(LoRA)在3D空间中对大规模3D卷积基础模型进行微调,用于MRI-based ADHD分类任务,显著减少了可训练参数并实现了卓越性能。
Details
Motivation: 早期诊断儿童ADHD对教育和心理健康至关重要,但基于神经影像数据的诊断因异质性和症状重叠而具挑战性。Contribution: 1. 首次在神经影像中实现了跨模态(CT-to-MRI)的基础模型迁移学习。2. 提出3D LoRA方法,显著减少参数的同时提升性能。
Method: 将3D卷积核分解为2D低秩更新,引入3D LoRA微调策略,减少可训练参数。
Result: 在公开扩散MRI数据库中,五折交叉验证实现了71.9%准确率和0.716 AUC,参数仅为164万(比完全微调少113倍)。
Insight: 3D LoRA为跨模态迁移学习提供了高效解决方案,展示了基础模型在小样本任务中的潜力。
Abstract: Early diagnosis of attention-deficit/hyperactivity disorder (ADHD) in children plays a crucial role in improving outcomes in education and mental health. Diagnosing ADHD using neuroimaging data, however, remains challenging due to heterogeneous presentations and overlapping symptoms with other conditions. To address this, we propose a novel parameter-efficient transfer learning approach that adapts a large-scale 3D convolutional foundation model, pre-trained on CT images, to an MRI-based ADHD classification task. Our method introduces Low-Rank Adaptation (LoRA) in 3D by factorizing 3D convolutional kernels into 2D low-rank updates, dramatically reducing trainable parameters while achieving superior performance. In a five-fold cross-validated evaluation on a public diffusion MRI database, our 3D LoRA fine-tuning strategy achieved state-of-the-art results, with one model variant reaching 71.9% accuracy and another attaining an AUC of 0.716. Both variants use only 1.64 million trainable parameters (over 113x fewer than a fully fine-tuned foundation model). Our results represent one of the first successful cross-modal (CT-to-MRI) adaptations of a foundation model in neuroimaging, establishing a new benchmark for ADHD classification while greatly improving efficiency.
[235] Turbo-DDCM: Fast and Flexible Zero-Shot Diffusion-Based Image Compression
Amit Vaisman,Guy Ohayon,Hila Manor,Michael Elad,Tomer Michaeli
Main category: eess.IV
TL;DR: Turbo-DDCM提出了一种高效的零样本扩散图像压缩方法,显著提升了速度,同时保持了与最先进技术相当的性能。
Details
Motivation: 现有零样本扩散压缩方法速度慢且计算量大,Turbo-DDCM旨在解决这一问题。Contribution: 1. 提出了Turbo-DDCM,显著减少去噪操作次数;2. 优化了编码协议;3. 设计了两种灵活变体:优先级感知版本和失真控制版本。
Method: 基于DDCM框架,通过在每个去噪步骤合并大量噪声向量,减少操作次数。
Result: Turbo-DDCM在速度和性能上取得了显著提升,成为实用灵活的压缩方案。
Insight: 通过噪声向量合并和编码优化,可以在不牺牲性能的前提下大幅提升扩散模型的效率。
Abstract: While zero-shot diffusion-based compression methods have seen significant progress in recent years, they remain notoriously slow and computationally demanding. This paper presents an efficient zero-shot diffusion-based compression method that runs substantially faster than existing methods, while maintaining performance that is on par with the state-of-the-art techniques. Our method builds upon the recently proposed Denoising Diffusion Codebook Models (DDCMs) compression scheme. Specifically, DDCM compresses an image by sequentially choosing the diffusion noise vectors from reproducible random codebooks, guiding the denoiser’s output to reconstruct the target image. We modify this framework with Turbo-DDCM, which efficiently combines a large number of noise vectors at each denoising step, thereby significantly reducing the number of required denoising operations. This modification is also coupled with an improved encoding protocol. Furthermore, we introduce two flexible variants of Turbo-DDCM, a priority-aware variant that prioritizes user-specified regions and a distortion-controlled variant that compresses an image based on a target PSNR rather than a target BPP. Comprehensive experiments position Turbo-DDCM as a compelling, practical, and flexible image compression scheme.
[236] Hierarchical Spatial-Frequency Aggregation for Spectral Deconvolution Imaging
Tao Lv,Daoming Zhou,Chenglong Huang,Chongde Zi,Linsen Chen,Xun Cao
Main category: eess.IV
TL;DR: 论文提出了一种名为HSFAUT的分层空间-频率聚合展开Transformer方法,用于解决光谱解卷积成像(SDI)中的非线性逆问题。通过将子问题分解并投影到频域,HSFAUT将非线性过程转化为线性映射,并利用空间-频率聚合Transformer(SFAT)整合空间和频域先验信息,显著提高了重建精度和效率。
Details
Motivation: 传统的光谱解卷积成像(SDI)方法中,复合卷积积分操作导致系数矩阵依赖于场景,限制了先验信息的有效利用和重建精度。因此,需要一种新方法来解决这一问题。Contribution: 1. 提出了分层空间-频率聚合展开框架(HSFAUF),将非线性问题转化为线性映射;2. 设计了空间-频率聚合Transformer(SFAT),用于整合空间和频域先验信息;3. 开发了HSFAUT方法,显著提升了SDI的重建性能。
Method: HSFAUT将问题分解为子问题并投影到频域,转化为线性映射;通过SFAT显式整合空间和频域信息;采用Transformer展开方法优化迭代过程。
Result: 实验表明,HSFAUT在模拟和真实数据上均优于现有方法,内存和计算成本更低,且在不同SDI系统中表现优异。
Insight: 频域转换和空间-频域信息聚合是解决SDI非线性问题的关键;Transformer在深度展开方法中表现出强大的建模能力。
Abstract: Computational spectral imaging (CSI) achieves real-time hyperspectral imaging through co-designed optics and algorithms, but typical CSI methods suffer from a bulky footprint and limited fidelity. Therefore, Spectral Deconvolution imaging (SDI) methods based on PSF engineering have been proposed to achieve high-fidelity compact CSI design recently. However, the composite convolution-integration operations of SDI render the normal-equation coefficient matrix scene-dependent, which hampers the efficient exploitation of imaging priors and poses challenges for accurate reconstruction. To tackle the inherent data-dependent operators in SDI, we introduce a Hierarchical Spatial-Spectral Aggregation Unfolding Framework (HSFAUF). By decomposing subproblems and projecting them into the frequency domain, HSFAUF transforms nonlinear processes into linear mappings, thereby enabling efficient solutions. Furthermore, to integrate spatial-spectral priors during iterative refinement, we propose a Spatial-Frequency Aggregation Transformer (SFAT), which explicitly aggregates information across spatial and frequency domains. By integrating SFAT into HSFAUF, we develop a Transformer-based deep unfolding method, \textbf{H}ierarchical \textbf{S}patial-\textbf{F}requency \textbf{A}ggregation \textbf{U}nfolding \textbf{T}ransformer (HSFAUT), to solve the inverse problem of SDI. Systematic simulated and real experiments show that HSFAUT surpasses SOTA methods with cheaper memory and computational costs, while exhibiting optimal performance on different SDI systems.
[237] TauFlow: Dynamic Causal Constraint for Complexity-Adaptive Lightweight Segmentation
Zidong Chen,Fadratul Hafinaz Hassan
Main category: eess.IV
TL;DR: TauFlow是一种轻量级医学图像分割模型,通过动态特征响应策略解决边缘设备部署中的边界与背景对比及轻量化设计的准确率下降问题。
Details
Motivation: 在边缘设备上部署轻量级医学图像分割模型时,需高效处理病灶边界与背景区域的对比差异,同时避免因极轻量化设计(如参数<0.5M)导致的准确率骤降。Contribution: 提出TauFlow模型,包含ConvLTC模块动态调节特征更新率,以及STDP自组织模块显著减少编解码器间的特征冲突率。
Method: 1) ConvLTC模块根据频率动态调整特征更新速率;2) STDP自组织模块通过脑启发机制优化特征冲突。
Result: 特征冲突率从35%-40%降至8%-10%,提升模型效率与性能。
Insight: 动态特征调节与自组织机制的结合能有效平衡轻量化与精度,适用于边缘设备上的医学图像分割任务。
Abstract: Deploying lightweight medical image segmentation models on edge devices presents two major challenges: 1) efficiently handling the stark contrast between lesion boundaries and background regions, and 2) the sharp drop in accuracy that occurs when pursuing extremely lightweight designs (e.g., <0.5M parameters). To address these problems, this paper proposes TauFlow, a novel lightweight segmentation model. The core of TauFlow is a dynamic feature response strategy inspired by brain-like mechanisms. This is achieved through two key innovations: the Convolutional Long-Time Constant Cell (ConvLTC), which dynamically regulates the feature update rate to “slowly” process low-frequency backgrounds and “quickly” respond to high-frequency boundaries; and the STDP Self-Organizing Module, which significantly mitigates feature conflicts between the encoder and decoder, reducing the conflict rate from approximately 35%-40% to 8%-10%.
[238] CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video
Xinyi Wang,Angeliki Katsenou,Junxiao Shen,David Bull
Main category: eess.IV
TL;DR: CAMP-VQA提出了一种新型无参考视频质量评估框架,通过结合语义理解和多模态感知,显著提升了对压缩视频的质量评估能力。
Details
Motivation: 随着用户生成内容(UGC)的普及,传统的无参考视频质量评估(NR-VQA)方法难以应对非专业录制和转码带来的挑战。现有方法因缺乏细粒度标注而对压缩内容的主观评分建模有限。Contribution: CAMP-VQA引入了一种质量感知提示机制,结合视频元数据和关键帧片段生成细粒度质量描述,并通过融合语义对齐、时间特性和空间特性三个维度的特征,提升质量评估性能。
Method: 利用BLIP-2预训练模型生成质量描述,设计统一架构提取并融合多模态特征(语义、时间和空间),最后回归到视频质量分数。
Result: 在多个UGC数据集上,CAMP-VQA表现优于现有NR-VQA方法,SRCC达到0.928,PLCC达到0.938。
Insight: 结合语义理解和多模态特征可以有效提升无参考视频质量评估的准确性,同时减少对昂贵人工标注的依赖。
Abstract: The prevalence of user-generated content (UGC) on platforms such as YouTube and TikTok has rendered no-reference (NR) perceptual video quality assessment (VQA) vital for optimizing video delivery. Nonetheless, the characteristics of non-professional acquisition and the subsequent transcoding of UGC video on sharing platforms present significant challenges for NR-VQA. Although NR-VQA models attempt to infer mean opinion scores (MOS), their modeling of subjective scores for compressed content remains limited due to the absence of fine-grained perceptual annotations of artifact types. To address these challenges, we propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large vision-language models. Our approach introduces a quality-aware prompting mechanism that integrates video metadata (e.g., resolution, frame rate, bitrate) with key fragments extracted from inter-frame variations to guide the BLIP-2 pretraining approach in generating fine-grained quality captions. A unified architecture has been designed to model perceptual quality across three dimensions: semantic alignment, temporal characteristics, and spatial characteristics. These multimodal features are extracted and fused, then regressed to video quality scores. Extensive experiments on a wide variety of UGC datasets demonstrate that our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations. Our method achieves the best performance in terms of average rank and linear correlation (SRCC: 0.928, PLCC: 0.938) compared to state-of-the-art methods. The source code and trained models, along with a user-friendly demo, are available at: https://github.com/xinyiW915/CAMP-VQA.
cs.RO [Back]
[239] Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots
Justin Williams,Kishor Datta Gupta,Roy George,Mrinmoy Sarkar
Main category: cs.RO
TL;DR: 本文提出了一种高效的视觉-语言-动作控制方案Lite VLA,用于在计算资源受限的边缘机器人上实现实时场景理解和推理。
Details
Motivation: 在GPS信号缺失的环境中,本地高效推理对自主机器人至关重要,而现有方法通常将感知与移动分离,无法满足动态环境的需求。Contribution: 提出了首个在边缘设备上同时实现推理和移动的小型视觉-语言模型框架,无需依赖云端计算。
Method: 整合了紧凑的视觉-语言模型与多模态感知,直接在嵌入式硬件上执行上下文解析。
Result: 实验证明,该系统在计算效率、任务准确性和系统响应性之间取得了平衡,并在移动机器人上成功验证。
Insight: 该工作为服务机器人、灾难响应等应用提供了可扩展的自主能力基础。
Abstract: The deployment of artificial intelligence models at the edge is increasingly critical for autonomous robots operating in GPS-denied environments where local, resource-efficient reasoning is essential. This work demonstrates the feasibility of deploying small Vision-Language Models (VLMs) on mobile robots to achieve real-time scene understanding and reasoning under strict computational constraints. Unlike prior approaches that separate perception from mobility, the proposed framework enables simultaneous movement and reasoning in dynamic environments using only on-board hardware. The system integrates a compact VLM with multimodal perception to perform contextual interpretation directly on embedded hardware, eliminating reliance on cloud connectivity. Experimental validation highlights the balance between computational efficiency, task accuracy, and system responsiveness. Implementation on a mobile robot confirms one of the first successful deployments of small VLMs for concurrent reasoning and mobility at the edge. This work establishes a foundation for scalable, assured autonomy in applications such as service robotics, disaster response, and defense operations.
[240] A Low-Rank Method for Vision Language Model Hallucination Mitigation in Autonomous Driving
Keke Long,Jiacheng Guo,Tianyun Zhang,Hongkai Yu,Xiaopeng Li
Main category: cs.RO
TL;DR: 该论文提出了一种低秩方法,用于自动排名多个视觉语言模型(VLM)生成的候选描述,以减轻自动驾驶场景中的幻觉问题。该方法仅利用描述本身,无需外部参考或模型访问,通过句子嵌入矩阵的低秩分解来选择最具真实性的描述。实验表明,该方法在NuScenes数据集上达到了87%的选择准确率,显著优于基线和其他方法。
Details
Motivation: 视觉语言模型(VLM)在自动驾驶中用于理解交通场景,但时常产生幻觉(虚假描述)。由于缺乏真实参考且无法访问模型内部,检测和减轻幻觉具有挑战性。Contribution: 1. 提出了一种无需外部参考或模型访问的低秩方法,自动排名候选描述以减轻幻觉。2. 通过句子嵌入矩阵的低秩分解,利用残差幅值量化幻觉程度。3. 在NuScenes数据集上验证了方法的有效性,大幅提升了选择准确率和推理效率。
Method: 1. 构建句子嵌入矩阵。2. 将矩阵分解为低秩共识部分和稀疏残差部分。3. 利用残差幅值排名候选描述,选择残差最小的描述作为最无幻觉的版本。
Result: 在NuScenes数据集上,该方法达到了87%的选择准确率,比基线提升了19%,比多代理辩论方法提升了6-10%。推理时间减少了51-67%。
Insight: 1. 低秩分解可以有效捕捉描述中的共识信息,残差幅值量化了幻觉程度。2. 无需外部参考的方法更具普适性,适用于实时应用。3. 该方法为减轻VLM幻觉提供了新的技术路径。
Abstract: Vision Language Models (VLMs) are increasingly used in autonomous driving to help understand traffic scenes, but they sometimes produce hallucinations, which are false details not grounded in the visual input. Detecting and mitigating hallucinations is challenging when ground-truth references are unavailable and model internals are inaccessible. This paper proposes a novel self-contained low-rank approach to automatically rank multiple candidate captions generated by multiple VLMs based on their hallucination levels, using only the captions themselves without requiring external references or model access. By constructing a sentence-embedding matrix and decomposing it into a low-rank consensus component and a sparse residual, we use the residual magnitude to rank captions: selecting the one with the smallest residual as the most hallucination-free. Experiments on the NuScenes dataset demonstrate that our approach achieves 87% selection accuracy in identifying hallucination-free captions, representing a 19% improvement over the unfiltered baseline and a 6-10% improvement over multi-agent debate method. The sorting produced by sparse error magnitudes shows strong correlation with human judgments of hallucinations, validating our scoring mechanism. Additionally, our method, which can be easily parallelized, reduces inference time by 51-67% compared to debate approaches, making it practical for real-time autonomous driving applications.
[241] SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation
Taisei Hanyu,Nhat Chung,Huy Le,Toan Nguyen,Yuki Ikebe,Anthony Gunderman,Duy Nguyen Ho Minh,Khoa Vo,Tung Kieu,Kashu Yamazaki,Chase Rainwater,Anh Nguyen,Ngan Le
Main category: cs.RO
TL;DR: 论文提出了SlotVLA框架和LIBERO+数据集,通过对象关系和对象中心的表示方法提升机器人操作的效率与可解释性。
Details
Motivation: 现有的机器人多任务模型依赖密集嵌入,混叠了对象和背景信息,导致效率和可解释性不足。作者希望通过对象关系和对象中心的表示方法解决这一问题。Contribution: 1. 提出LIBERO+数据集,支持对象关系推理;2. 提出SlotVLA框架,结合slot attention和关系解码器实现高效、可解释的动作生成。
Method: SlotVLA结合了slot-based视觉分词器(保持时间一致的对象表示)、关系解码器(生成任务相关嵌入)和LLM驱动的动作模块。
Result: 实验表明,对象中心和对象关系的slot表示大幅减少了视觉token数量,同时保持了较强的泛化能力。
Insight: 对象关系和对象中心的表示方法为机器人操作提供了更紧凑、可解释且高效的解决方案。
Abstract: Inspired by how humans reason over discrete objects and their relationships, we explore whether compact object-centric and object-relation representations can form a foundation for multitask robotic manipulation. Most existing robotic multitask models rely on dense embeddings that entangle both object and background cues, raising concerns about both efficiency and interpretability. In contrast, we study object-relation-centric representations as a pathway to more structured, efficient, and explainable visuomotor control. Our contributions are two-fold. First, we introduce LIBERO+, a fine-grained benchmark dataset designed to enable and evaluate object-relation reasoning in robotic manipulation. Unlike prior datasets, LIBERO+ provides object-centric annotations that enrich demonstrations with box- and mask-level labels as well as instance-level temporal tracking, supporting compact and interpretable visuomotor representations. Second, we propose SlotVLA, a slot-attention-based framework that captures both objects and their relations for action decoding. It uses a slot-based visual tokenizer to maintain consistent temporal object representations, a relation-centric decoder to produce task-relevant embeddings, and an LLM-driven module that translates these embeddings into executable actions. Experiments on LIBERO+ demonstrate that object-centric slot and object-relation slot representations drastically reduce the number of required visual tokens, while providing competitive generalization. Together, LIBERO+ and SlotVLA provide a compact, interpretable, and effective foundation for advancing object-relation-centric robotic manipulation.
[242] Vision-Based System Identification of a Quadrotor
Selim Ahmet Iz,Mustafa Unel
Main category: cs.RO
TL;DR: 论文研究了基于视觉的系统辨识技术在四旋翼建模与控制中的应用,通过实验和分析解决了四旋翼建模中的复杂性和局限性问题。
Details
Motivation: 四旋翼建模中的推力与阻力系数存在不确定性,传统的建模方法难以精确捕捉其动态特性。需要一种能够减少这些不确定性的方法,并结合实际飞行数据进行建模。Contribution: 论文的主要贡献包括:1) 应用灰箱建模减少不确定性;2) 评估了机载视觉系统在系统辨识中的有效性;3) 设计了基于辨识模型的LQR控制器,验证了视觉系统辨识的可行性。
Method: 采用了灰箱建模方法结合机载视觉系统数据,辨识四旋翼的动态模型。通过LQR控制器设计,验证了辨识模型的准确性。
Result: 实验结果表明,辨识模型与实际飞行数据表现一致,验证了基于视觉的系统辨识技术的有效性。
Insight: 基于视觉的系统辨识技术可以有效减少建模中的不确定性,提升四旋翼的控制性能,并为故障检测和决策制定提供了新的研究方向。
Abstract: This paper explores the application of vision-based system identification techniques in quadrotor modeling and control. Through experiments and analysis, we address the complexities and limitations of quadrotor modeling, particularly in relation to thrust and drag coefficients. Grey-box modeling is employed to mitigate uncertainties, and the effectiveness of an onboard vision system is evaluated. An LQR controller is designed based on a system identification model using data from the onboard vision system. The results demonstrate consistent performance between the models, validating the efficacy of vision based system identification. This study highlights the potential of vision-based techniques in enhancing quadrotor modeling and control, contributing to improved performance and operational capabilities. Our findings provide insights into the usability and consistency of these techniques, paving the way for future research in quadrotor performance enhancement, fault detection, and decision-making processes.
[243] PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving
Simon Gerstenecker,Andreas Geiger,Katrin Renz
Main category: cs.RO
TL;DR: PlanT 2.0是一个用于自动驾驶研究的轻量级规划模型,通过对输入进行扰动分析,揭示了现有模型的偏差和结构性缺陷,并在CARLA Benchmark上取得了SOTA性能。
Details
Motivation: 当前自动驾驶研究过于关注基准性能和方法创新,缺乏对模型失败原因、偏差和捷径学习的深入分析。Contribution: 1. 提出PlanT 2.0,一种基于对象表示的规划模型,便于扰动分析;2. 揭示了模型在场景理解、专家行为和轨迹过拟合等方面的失败模式;3. 呼吁数据驱动的开发方向。
Method: 1. 使用对象级表示输入,便于控制扰动;2. 在CARLA Leaderboard 2.0上测试,并通过模型升级提升性能。
Result: 在Longest6 v2、Bench2Drive和CARLA验证路线上实现了SOTA性能。
Insight: 研究发现现有数据集存在障碍物多样性不足、专家行为僵化和轨迹过拟合等问题,强调需要更丰富、鲁棒性更强的数据集。
Abstract: Most recent work in autonomous driving has prioritized benchmark performance and methodological innovation over in-depth analysis of model failures, biases, and shortcut learning. This has led to incremental improvements without a deep understanding of the current failures. While it is straightforward to look at situations where the model fails, it is hard to understand the underlying reason. This motivates us to conduct a systematic study, where inputs to the model are perturbed and the predictions observed. We introduce PlanT 2.0, a lightweight, object-centric planning transformer designed for autonomous driving research in CARLA. The object-level representation enables controlled analysis, as the input can be easily perturbed (e.g., by changing the location or adding or removing certain objects), in contrast to sensor-based models. To tackle the scenarios newly introduced by the challenging CARLA Leaderboard 2.0, we introduce multiple upgrades to PlanT, achieving state-of-the-art performance on Longest6 v2, Bench2Drive, and the CARLA validation routes. Our analysis exposes insightful failures, such as a lack of scene understanding caused by low obstacle diversity, rigid expert behaviors leading to exploitable shortcuts, and overfitting to a fixed set of expert trajectories. Based on these findings, we argue for a shift toward data-centric development, with a focus on richer, more robust, and less biased datasets. We open-source our code and model at https://github.com/autonomousvision/plant2.
[244] Robot Learning from a Physical World Model
Jiageng Mao,Sicheng He,Hao-Ning Wu,Yang You,Shuyang Sun,Zhicheng Wang,Yanan Bao,Huizhong Chen,Leonidas Guibas,Vitor Guizilini,Howard Zhou,Yue Wang
Main category: cs.RO
TL;DR: PhysWorld是一个结合视频生成与物理世界重建的框架,通过物理准确的机器人动作学习,实现零样本通用的机器人操作。
Details
Motivation: 现有视频生成模型能从语言指令和图像合成逼真的视觉演示,但直接将这些像素运动应用于机器人会忽略物理规律,导致操作不准确。PhysWorld旨在解决这一问题。Contribution: 提出了PhysWorld框架,将视频生成与物理世界重建结合,通过对象中心残差强化学习和物理世界模型,将视觉引导转化为物理可执行的机器人轨迹。
Method: 基于单张图像和任务指令,生成任务条件视频并重建物理世界,利用对象中心残差强化学习将视频运动转化为物理准确的动作。
Result: PhysWorld在多样化真实任务中显著提高了操作准确性,超越了先前方法。
Insight: 通过物理世界建模和视频生成的结合,可以实现无需真实机器人数据的零样本通用机器人操作。
Abstract: We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling. Recent video generation models can synthesize photorealistic visual demonstrations from language commands and images, offering a powerful yet underexplored source of training signals for robotics. However, directly retargeting pixel motions from generated videos to robots neglects physics, often resulting in inaccurate manipulations. PhysWorld addresses this limitation by coupling video generation with physical world reconstruction. Given a single image and a task command, our method generates task-conditioned videos and reconstructs the underlying physical world from the videos, and the generated video motions are grounded into physically accurate actions through object-centric residual reinforcement learning with the physical world model. This synergy transforms implicit visual guidance into physically executable robotic trajectories, eliminating the need for real robot data collection and enabling zero-shot generalizable robotic manipulation. Experiments on diverse real-world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches. Visit \href{https://pointscoder.github.io/PhysWorld_Web/}{the project webpage} for details.
cs.LG [Back]
[245] Fine-Tuning Vision-Language Models for Multimodal Polymer Property Prediction
An Vuong,Minh-Hao Van,Prateek Verma,Chen Zhao,Xintao Wu
Main category: cs.LG
TL;DR: 论文提出了一种通过指令微调对视觉-语言模型(VLM)进行微调的方法,用于多模态聚合物特性预测,表现优于单模态和基线方法,同时减少了部署和维护成本。
Details
Motivation: 现有的视觉-语言模型在材料科学等领域的效果有限,缺乏针对多模态数据的通用模型。因此,研究者希望通过微调VLM来解决这一问题。Contribution: 1. 提出了一个多模态聚合物数据集;2. 使用LoRA方法微调VLM,提升了多模态学习的效果;3. 减少了为不同特性训练单独模型的需求。
Method: 通过指令微调对视觉-语言模型进行适配,并利用LoRA方法优化微调过程。
Result: 微调后的模型在多模态聚合物特性预测中表现优于单模态和基线方法。
Insight: 多模态学习在科学领域中具有潜力,能够降低模型部署和维护成本。
Abstract: Vision-Language Models (VLMs) have shown strong performance in tasks like visual question answering and multimodal text generation, but their effectiveness in scientific domains such as materials science remains limited. While some machine learning methods have addressed specific challenges in this field, there is still a lack of foundation models designed for broad tasks like polymer property prediction using multimodal data. In this work, we present a multimodal polymer dataset to fine-tune VLMs through instruction-tuning pairs and assess the impact of multimodality on prediction performance. Our fine-tuned models, using LoRA, outperform unimodal and baseline approaches, demonstrating the benefits of multimodal learning. Additionally, this approach reduces the need to train separate models for different properties, lowering deployment and maintenance costs.
[246] Adapting Web Agents with Synthetic Supervision
Zhaoyang Wang,Yiming Liang,Xuchao Zhang,Qianhui Wu,Siwei Han,Anson Bastos,Rujia Wang,Chetan Bansal,Baolin Peng,Jianfeng Gao,Saravan Rajmohan,Huaxiu Yao
Main category: cs.LG
TL;DR: SynthAgent通过双阶段任务和轨迹的合成数据细化和优化,提升了网页代理在新网站上的适应能力,优于现有方法。
Details
Motivation: 现有方法生成的合成数据存在质量和执行问题(如幻觉任务和噪声轨迹),限制了网页代理在新网站上的适应能力。Contribution: 提出SynthAgent框架,通过任务分类探索、任务冲突检测和全局上下文轨迹细化,显著提升合成数据质量。
Method: (1)分类探索生成多样性任务;(2)检测并修正任务与实际的冲突;(3)全局上下文轨迹去噪;(4)基于优化数据微调代理。
Result: 实验证明SynthAgent优于现有合成数据方法,验证高质量合成监督的有效性。
Insight: 任务与轨迹的双重细化是提升合成数据质量的关键,需结合环境探索与数据后处理。
Abstract: Web agents struggle to adapt to new websites due to the scarcity of environment specific tasks and demonstrations. Recent works have explored synthetic data generation to address this challenge, however, they suffer from data quality issues where synthesized tasks contain hallucinations that cannot be executed, and collected trajectories are noisy with redundant or misaligned actions. In this paper, we propose SynthAgent, a fully synthetic supervision framework that aims at improving synthetic data quality via dual refinement of both tasks and trajectories. Our approach begins by synthesizing diverse tasks through categorized exploration of web elements, ensuring efficient coverage of the target environment. During trajectory collection, we refine tasks when conflicts with actual observations are detected, mitigating hallucinations while maintaining task consistency. After collection, we conduct trajectory refinement with a global context to mitigate potential noise or misalignments. Finally, we fine-tune open-source web agents on the refined synthetic data to adapt them to the target environment. Experimental results demonstrate that SynthAgent outperforms existing synthetic data methods, validating the importance of high-quality synthetic supervision. The code will be publicly available at https://github.com/aiming-lab/SynthAgent.
[247] Mixtures of SubExperts for Large Language Continual Learning
Haeyong Kang
Main category: cs.LG
TL;DR: 论文提出了一种名为Mixtures of SubExperts (MoSEs)的新型参数高效微调方法,用于大语言模型的持续学习,通过稀疏子专家混合和任务特定路由机制,最小化干扰和灾难性遗忘,同时实现高效的参数增长和知识迁移。
Details
Motivation: 大语言模型在持续学习中面临灾难性遗忘和线性参数增长的难题,现有方法难以平衡遗忘和模型扩展。Contribution: 提出MoSEs框架,通过稀疏子专家混合和自适应路由机制,实现低遗忘、高效参数增长和知识迁移。
Method: 在Transformer层中引入稀疏子专家混合,任务特定路由机制动态选择和组合参数,隔离和保护知识。
Result: 在TRACE基准测试中显著优于传统持续学习方法,达到最先进性能,节省内存和计算资源。
Insight: 稀疏子专家和自适应路由的结合是解决持续学习中平衡遗忘和扩展的有效途径。
Abstract: Adapting Large Language Models (LLMs) to a continuous stream of tasks is a critical yet challenging endeavor. While Parameter-Efficient Fine-Tuning (PEFT) methods have become a standard for this, they face a fundamental dilemma in continual learning. Reusing a single set of PEFT parameters for new tasks often leads to catastrophic forgetting of prior knowledge. Conversely, allocating distinct parameters for each task prevents forgetting but results in a linear growth of the model’s size and fails to facilitate knowledge transfer between related tasks. To overcome these limitations, we propose a novel adaptive PEFT method referred to as \textit{Mixtures of SubExperts (MoSEs)}, a novel continual learning framework designed for minimal forgetting and efficient scalability. MoSEs integrate a sparse Mixture of SubExperts into the transformer layers, governed by a task-specific routing mechanism. This architecture allows the model to isolate and protect knowledge within dedicated SubExperts, thereby minimizing parameter interference and catastrophic forgetting. Crucially, the router can adaptively select and combine previously learned sparse parameters for new tasks, enabling effective knowledge transfer while ensuring that the model’s capacity grows sublinearly. We evaluate MoSEs on the comprehensive TRACE benchmark datasets. Our experiments demonstrate that MoSEs significantly outperform conventional continual learning approaches in both knowledge retention and scalability to new tasks, achieving state-of-the-art performance with substantial memory and computational savings.
[248] CG-TTRL: Context-Guided Test-Time Reinforcement Learning for On-Device Large Language Models
Peyman Hosseini,Ondrej Bohdal,Taha Ceritli,Ignacio Castro,Matthew Purver,Mete Ozay,Umberto Michieli
Main category: cs.LG
TL;DR: 论文提出了一种上下文引导的测试时强化学习方法(CG-TTRL),通过动态整合上下文信息提升测试时强化学习的性能,尤其适用于设备端大语言模型的复杂任务适配。
Details
Motivation: 现有的测试时强化学习(TTRL)虽然通过两阶段采样策略提升模型性能,但未能充分利用上下文指导,限制了伪标签准确性和探索阶段的效率。Contribution: 1. 提出了CG-TTRL方法,动态整合上下文信息到两阶段采样中;2. 设计了高效的上下文选择方法,适用于设备端应用;3. 在数学和科学问答任务上验证了性能提升。
Method: 1. 在多采样阶段利用上下文指导提升伪标签准确性;2. 在下采样和奖励微调阶段通过上下文调节探索;3. 提出了高效的上下文选择策略。
Result: 在数学和科学问答任务上,CG-TTRL比TTRL性能提升了7%(相对精度),且在仅3步训练后就实现了8%的性能提升(TTRL仅为1%)。
Insight: 上下文信息的动态整合不仅能提升伪标签的准确性,还能在探索阶段提供更高效的指导,从而显著提升测试时强化学习的性能和效率。
Abstract: Test-time Reinforcement Learning (TTRL) has shown promise in adapting foundation models for complex tasks at test-time, resulting in large performance improvements. TTRL leverages an elegant two-phase sampling strategy: first, multi-sampling derives a pseudo-label via majority voting, while subsequent downsampling and reward-based fine-tuning encourages the model to explore and learn diverse valid solutions, with the pseudo-label modulating the reward signal. Meanwhile, in-context learning has been widely explored at inference time and demonstrated the ability to enhance model performance without weight updates. However, TTRL’s two-phase sampling strategy under-utilizes contextual guidance, which can potentially improve pseudo-label accuracy in the initial exploitation phase while regulating exploration in the second. To address this, we propose context-guided TTRL (CG-TTRL), integrating context dynamically into both sampling phases and propose a method for efficient context selection for on-device applications. Our evaluations on mathematical and scientific QA benchmarks show CG-TTRL outperforms TTRL (e.g. additional 7% relative accuracy improvement over TTRL), while boosting efficiency by obtaining strong performance after only a few steps of test-time training (e.g. 8% relative improvement rather than 1% over TTRL after 3 steps).
[249] The Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models
Xin Qiu,Junlong Tong,Yirong Sun,Yunpu Ma,Xiaoyu Shen
Main category: cs.LG
TL;DR: 论文揭示了时间序列模型中的‘规模悖论’现象,即更大的模型并未带来性能提升。通过分析内部表示,发现了‘少数层主导’现象,并提出了一种自动识别并保留关键层的方法,显著提高了效率和准确性。
Details
Motivation: 时间序列预测领域普遍认为扩大模型规模和数据量能提升性能,但作者观察到‘规模悖论’现象,即更大的模型反而性能下降,从而探讨其根本原因并提出解决方案。Contribution: 1) 揭示了时间序列模型中的‘规模悖论’;2) 发现了‘少数层主导’现象;3) 提出了一种自动识别和保留关键层的方法,显著提高了模型效率和准确性。
Method: 通过大规模实验分析模型内部表示,识别出关键层(少数层主导),并提出了一种自动裁剪冗余层的方法,保留关键的少数层以优化模型。
Result: 在多个SOTA模型上验证了方法的有效性,仅保留21%的参数即可实现12%的准确性提升和2.7倍的推理加速,且适用性广泛。
Insight: 时间序列模型的性能并非简单地随规模增长而提升,关键在于有效利用少数关键层,而非盲目扩大模型规模。
Abstract: Large-scale models are at the forefront of time series (TS) forecasting, dominated by two paradigms: fine-tuning text-based Large Language Models (LLM4TS) and training Time Series Foundation Models (TSFMs) from scratch. Both approaches share a foundational assumption that scaling up model capacity and data volume leads to improved performance. However, we observe a \textit{\textbf{scaling paradox}} in TS models, revealing a puzzling phenomenon that larger models do \emph{NOT} achieve better performance. Through extensive experiments on two model families across four scales (100M to 1.7B parameters) and diverse data (up to 6B observations), we rigorously confirm that the scaling paradox is a pervasive issue. We then diagnose its root cause by analyzing internal representations, identifying a phenomenon we call \textit{few-layer dominance}: only a small subset of layers are functionally important, while the majority are redundant, under-utilized, and can even distract training. Based on this discovery, we propose a practical method to automatically identify and retain only these dominant layers. In our models, retaining only 21% of the parameters achieves up to a 12% accuracy improvement and a 2.7$\times$ inference speedup. We validate the universality of our method on 8 prominent SOTA models (LLM4TS and TSFMs, 90M to 6B), showing that retaining less than 30% of layers achieves comparable or superior accuracy in over 95% of tasks.
[250] Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
Vaibhav Mavi,Shubh Jaroria,Weiqi Sun
Main category: cs.LG
TL;DR: 该论文提出了一种用于多步推理任务的自我评估LLM方法,通过分步置信度评分检测错误,相比整体评分方法,AUC-ROC提升15%。
Details
Motivation: LLM在关键任务中的可靠性和错误检测至关重要,但现有方法多为单步任务设计,忽略了多步推理的复杂性。Contribution: 扩展了自评估技术到多步任务,提出分步评分方法,显著提升错误检测效果。
Method: 测试了两种评分方法:整体评分和分步评分,并验证其性能。
Result: 在多步基准数据集上,分步评分AUC-ROC相对提升15%。
Insight: 分步评分更适合多步推理任务,为LLM的可靠性提供了实用框架。
Abstract: Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.
[251] MARAuder’s Map: Motion-Aware Real-time Activity Recognition with Layout-Based Trajectories
Zishuai Liu,Weihang You,Jin Lu,Fei Dou
Main category: cs.LG
TL;DR: 这篇文章提出了MARAuder’s Map框架,通过将传感器数据投影到物理平面图上,生成类似图像的轨迹序列,结合混合深度学习模型和可学习的时间嵌入模块,实现了实时的人类活动识别。
Details
Motivation: 在智能家居环境中,基于环境传感器的人类活动识别(HAR)面临实时推断、空间推理和环境布局忽略的挑战。现有方法通常依赖预分割数据,忽视了环境的物理布局,影响了实际部署的鲁棒性。Contribution: 1. 提出了MARAuder’s Map框架,支持从未分割的传感器流中实时识别活动。2. 引入轨迹感知的图像序列表示方法,结合空间结构和时间依赖性的混合深度学习模型。3. 设计了可学习的时间嵌入模块和基于注意力的编码器,增强了时间上下文和跨活动过渡的处理能力。
Method: 1. 将传感器激活投影到物理平面图上,生成轨迹感知的图像序列。2. 使用混合深度学习模型捕获空间和时间特征。3. 通过时间嵌入模块编码上下文时间信息(如小时、星期)。4. 采用注意力编码器选择性地关注信息丰富的部分。
Result: 在多个真实智能家居数据集上的实验表明,MARAuder’s Map优于基线方法,提供了实时HAR的实用解决方案。
Insight: 物理环境布局和时间上下文的结合显著提升了活动识别性能,尤其是在跨活动过渡和时间模糊性情况下。
Abstract: Ambient sensor-based human activity recognition (HAR) in smart homes remains challenging due to the need for real-time inference, spatially grounded reasoning, and context-aware temporal modeling. Existing approaches often rely on pre-segmented, within-activity data and overlook the physical layout of the environment, limiting their robustness in continuous, real-world deployments. In this paper, we propose MARAuder’s Map, a novel framework for real-time activity recognition from raw, unsegmented sensor streams. Our method projects sensor activations onto the physical floorplan to generate trajectory-aware, image-like sequences that capture the spatial flow of human movement. These representations are processed by a hybrid deep learning model that jointly captures spatial structure and temporal dependencies. To enhance temporal awareness, we introduce a learnable time embedding module that encodes contextual cues such as hour-of-day and day-of-week. Additionally, an attention-based encoder selectively focuses on informative segments within each observation window, enabling accurate recognition even under cross-activity transitions and temporal ambiguity. Extensive experiments on multiple real-world smart home datasets demonstrate that our method outperforms strong baselines, offering a practical solution for real-time HAR in ambient sensor environments.
[252] CAMP-HiVe: Cyclic Pair Merging based Efficient DNN Pruning with Hessian-Vector Approximation for Resource-Constrained Systems
Mohammad Helal Uddin,Sai Krishna Ghanta,Liam Seymour,Sabur Baidya
Main category: cs.LG
TL;DR: 该论文提出了一种新颖的神经网络剪枝方法CAMP-HiVe,通过Hessian-vector近似和循环对合并技术,显著降低计算需求,同时保持模型性能,适用于资源受限系统。
Details
Motivation: 随着深度学习在资源受限系统中的应用增多,高效模型压缩技术的需求日益迫切,尤其是剪枝方法既能高效压缩模型,又能保持性能。Contribution: 1. 提出了一种基于Hessian-vector近似的剪枝方法;2. 引入了循环对合并技术,动态调整权重重要性;3. 在多个基准数据集和模型上验证了方法的有效性。
Method: 结合Hessian-vector近似和循环对合并技术,通过幂迭代法识别并保留关键信息,动态优化模型参数。
Result: 实验表明,该方法在ResNet18、ResNet56和MobileNetv2等模型上显著降低计算需求,同时保持了高性能,优于现有剪枝方法。
Insight: Hessian-vector近似和动态权重合并是高效剪枝的关键,适用于资源受限场景。
Abstract: Deep learning algorithms are becoming an essential component of many artificial intelligence (AI) driven applications, many of which run on resource-constrained and energy-constrained systems. For efficient deployment of these algorithms, although different techniques for the compression of neural network models are proposed, neural pruning is one of the fastest and effective methods, which can provide a high compression gain with minimal cost. To harness enhanced performance gain with respect to model complexity, we propose a novel neural network pruning approach utilizing Hessian-vector products that approximate crucial curvature information in the loss function, which significantly reduces the computation demands. By employing a power iteration method, our algorithm effectively identifies and preserves the essential information, ensuring a balanced trade-off between model accuracy and computational efficiency. Herein, we introduce CAMP-HiVe, a cyclic pair merging-based pruning with Hessian Vector approximation by iteratively consolidating weight pairs, combining significant and less significant weights, thus effectively streamlining the model while preserving its performance. This dynamic, adaptive framework allows for real-time adjustment of weight significance, ensuring that only the most critical parameters are retained. Our experimental results demonstrate that our proposed method achieves significant reductions in computational requirements while maintaining high performance across different neural network architectures, e.g., ResNet18, ResNet56, and MobileNetv2, on standard benchmark datasets, e.g., CIFAR10, CIFAR-100, and ImageNet, and it outperforms the existing state-of-the-art neural pruning methods.
[253] Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis
Yash Mittal,Dmitry Ignatov,Radu Timofte
Main category: cs.LG
TL;DR: 论文提出了FractalNet,一种基于分形结构的计算架构,旨在高效地解决大规模语言模型多样性问题,通过模板生成器、运行器和评估框架生成1200多种神经网络变体,展示了分形设计在自动架构探索中的可行性和效率。
Details
Motivation: 解决大规模语言模型中模型多样性不足的问题,同时保持计算效率。Contribution: 提出了FractalNet,一个基于分形结构的计算架构框架,能够高效生成和分析大量神经网络变体,并通过实验验证其性能与效率。
Method: 采用模板驱动的方法,结合卷积层、归一化层、激活层和Dropout层的系统排列,生成1200多种变体。利用PyTorch、自动混合精度(AMP)和梯度检查点进行训练。
Result: 实验结果表明,分形架构在CIFAR-10数据集上表现出色,同时具有计算效率高的特点。
Insight: 分形设计可以作为自动化架构探索的可行且资源高效的方法,为大规模模型多样性问题提供了新的解决方案。
Abstract: It introduces FractalNet, a fractal-inspired computational architectures for advanced large language model analysis that mainly challenges model diversity on a large scale in an efficient manner. The new set-up involves a template-driven generator, runner, and evaluation framework that, through systematic permutations of convolutional, normalization, activation, and dropout layers, can create more than 1,200 variants of neural networks. Fractal templates allow for structural recursion and multi-column pathways, thus, models become deeper and wider in a balanced way. Training utilizes PyTorch, Automatic Mixed Precision (AMP), and gradient checkpointing and is carried out on the CIFAR-10 dataset for five epochs. The outcomes show that fractal-based architectures are capable of strong performance and are computationally efficient. The paper positions fractal design as a feasible and resource-efficient method of automated architecture exploration.
stat.ML [Back]
[254] Language Generation with Infinite Contamination
Anay Mehrotra,Grigoris Velegkas,Xifan Yu,Felix Zhou
Main category: stat.ML
TL;DR: 本文研究了在极限情况下语言生成的鲁棒性,探讨了在数据存在污染(噪声和遗漏)时生成器的表现,并提出了一个新的课程学习启发的模型。
Details
Motivation: 现有研究假设数据是完美的,忽略了现实中的数据污染(如噪声和遗漏)。本文旨在填补这一空白,研究在这些情况下语言生成的可行性。Contribution: 1. 描述了在污染枚举下语言生成的条件;2. 比较了生成与密集生成的鲁棒性;3. 解决了Raman和Raman提出的开放问题;4. 提出了一个基于课程学习的模型,证明了在无限污染下密集生成的可行性。
Method: 通过理论分析,研究了不同污染条件下语言生成的可行性,并引入了一个课程学习启发的模型。
Result: 1. 生成在所有可数集合中可行,当污染比例趋近于零;2. 密集生成对污染的鲁棒性较低;3. 在有限污染下,仅需成员预言即可生成;4. 课程学习模型支持无限污染下的密集生成。
Insight: 课程学习可能对处理噪声数据至关重要,为语言生成的鲁棒性提供了新的理论支持。
Abstract: We study language generation in the limit, where an algorithm observes an adversarial enumeration of strings from an unknown target language $K$ and must eventually generate new, unseen strings from $K$. Kleinberg and Mullainathan [KM24] proved that generation is achievable in surprisingly general settings. But their generator suffers from mode collapse,'' producing from an ever-smaller subset of the target. To address this, Kleinberg and Wei [KW25] require the generator's output to be dense’’ in the target language. They showed that generation with density, surprisingly, remains achievable at the same generality. Both results assume perfect data: no noisy insertions and no omissions. This raises a central question: how much contamination can generation tolerate? Recent works made partial progress on this question by studying (non-dense) generation with either finite amounts of noise (but no omissions) or omissions (but no noise). We characterize robustness under contaminated enumerations: 1. Generation under Contamination: Language generation in the limit is achievable for all countable collections iff the fraction of contaminated examples converges to zero. When this fails, we characterize which collections are generable. 2. Dense Generation under Contamination: Dense generation is strictly less robust to contamination than generation. As a byproduct, we resolve an open question of Raman and Raman [ICML25] by showing that generation is possible with only membership oracle access under finitely many contaminated examples. Finally, we introduce a beyond-worst-case model inspired by curriculum learning and prove that dense generation is achievable even with infinite contamination provided the fraction of contaminated examples converges to zero. This suggests curriculum learning may be crucial for learning from noisy web data.
eess.AS [Back]
[255] Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
Umberto Cappellazzo,Xubo Liu,Pingchuan Ma,Stavros Petridis,Maja Pantic
Main category: eess.AS
TL;DR: Omni-AVSR提出了一种统一的多模态语音识别框架,利用LLM高效训练和参数适应技术,支持ASR、VSR和AVSR任务,并在资源消耗和性能之间取得平衡。
Details
Motivation: 当前LLM在多模态语音识别(ASR、VSR、AVSR)中独立训练模型,导致资源浪费且缺乏跨任务协同效应。Omni-AVSR旨在提供一个统一框架,实现任务整合与弹性推理。Contribution: 1. 提出Omni-AVSR,首次将ASR、VSR和AVSR任务统一到单一LLM中;2. 采用多粒度训练和参数高效适应技术,降低资源消耗;3. 通过实验证明其性能与基线相当或更优,同时资源消耗更低。
Method: 1. 采用matryoshka表示学习范式,支持多粒度的音频和视觉训练;2. 探索三种基于LoRA的策略(低秩适应),平衡共享与任务特定参数。
Result: 在LRS2和LRS3数据集上,Omni-AVSR性能与最优基线相当或更好,训练和部署资源显著降低;模型在噪声环境下表现出鲁棒性,并分析了LLM规模的性能-效率权衡。
Insight: 1. 统一多模态任务可提升资源利用率和任务协同效应;2. LoRA等参数适应技术在LLM中高效可行;3. 多粒度学习为平衡性能与效率提供了新思路。
Abstract: Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.
q-bio.NC [Back]
[256] Approximating the Mathematical Structure of Psychodynamics
Bryce-Allen Bagley,Navin Khoshnan
Main category: q-bio.NC
TL;DR: 该论文通过过程理论的图示框架形式化人类心理动力学,旨在为心理学和AI安全领域提供精确的数学表达。
Details
Motivation: 心理学研究长期以来依赖理论模型,缺乏精确的数学表达。本文旨在填补这一空白,为心理学、精神病学及AI安全领域的认知研究提供量化工具。Contribution: 提出了一种基于过程理论的图示框架,将人类心理动力学形式化为数学结构,并解释了其在多领域中的链接和应用。
Method: 利用过程理论的图示方法,形式化心理动力学的关键特性,并与认知分析的核心概念建立联系。
Result: 实现了心理动力学的数学表达,为其在心理治疗、神经技术、AI对齐等领域的应用提供支持。
Insight: 图示框架为跨学科研究提供了通用的语言,有助于推动心理学与AI领域的深度结合。
Abstract: The complexity of human cognition has meant that psychology makes more use of theory and conceptual models than perhaps any other biomedical field. To enable precise quantitative study of the full breadth of phenomena in psychological and psychiatric medicine as well as cognitive aspects of AI safety, there is a need for a mathematical formulation which is both mathematically precise and equally accessible to experts from numerous fields. In this paper we formalize human psychodynamics via the diagrammatic framework of process theory, describe its key properties, and explain the links between a diagrammatic representation and central concepts in analysis of cognitive processes in contexts such as psychotherapy, neurotechnology, AI alignment, AI agent representation of individuals in autonomous negotiations, developing human-like AI systems, and other aspects of AI safety.
[257] ConnectomeBench: Can LLMs Proofread the Connectome?
Jeff Brown,Andrew Kirjner Annika Vivekananthan,Ed Boyden
Main category: q-bio.NC
TL;DR: 论文提出了ConnectomeBench,一个用于评估大语言模型(LLM)在神经连接组校对任务中能力的多模态基准测试,包括片段类型识别、分割错误修正和合并错误检测。
Details
Motivation: 神经连接组学(Connectomics)的数据校对需要大量人工,作者探讨是否可以使用AI系统自动化这一科学任务,从而提高效率。Contribution: 设计了一个多模态基准测试ConnectomeBench,评估了多种专有和开源LLM在神经连接组校对任务中的表现。
Method: 使用来自小鼠视觉皮层和果蝇大脑的专家标注数据,测试LLM在三种校对任务中的表现。
Result: 当前模型在片段识别和分割错误修正中表现优异(52-82%和75-85%准确率),但在合并错误检测中表现不佳。
Insight: 尽管最优模型还未达到专家水平,但其潜力表明LLM未来可能辅助或替代人工校对。
Abstract: Connectomics - the mapping of neural connections in an organism’s brain - currently requires extraordinary human effort to proofread the data collected from imaging and machine-learning assisted segmentation. With the growing excitement around using AI agents to automate important scientific tasks, we explore whether current AI systems can perform multiple tasks necessary for data proofreading. We introduce ConnectomeBench, a multimodal benchmark evaluating large language model (LLM) capabilities in three critical proofreading tasks: segment type identification, split error correction, and merge error detection. Using expert annotated data from two large open-source datasets - a cubic millimeter of mouse visual cortex and the complete Drosophila brain - we evaluate proprietary multimodal LLMs including Claude 3.7/4 Sonnet, o4-mini, GPT-4.1, GPT-4o, as well as open source models like InternVL-3 and NVLM. Our results demonstrate that current models achieve surprisingly high performance in segment identification (52-82% balanced accuracy vs. 20-25% chance) and binary/multiple choice split error correction (75-85% accuracy vs. 50% chance) while generally struggling on merge error identification tasks. Overall, while the best models still lag behind expert performance, they demonstrate promising capabilities that could eventually enable them to augment and potentially replace human proofreading in connectomics. Project page: https://github.com/jffbrwn2/ConnectomeBench and Dataset https://huggingface.co/datasets/jeffbbrown2/ConnectomeBench/tree/main
cs.HC [Back]
[258] Pinching Visuo-haptic Display: Investigating Cross-Modal Effects of Visual Textures on Electrostatic Cloth Tactile Sensations
Takekazu Kitagishi,Chun-Wei Ooi,Yuichi Hiroi,Jun Rekimoto
Main category: cs.HC
TL;DR: 这篇论文研究了视觉纹理呈现如何影响用户在静电布料显示器上的触觉感知,提出了一种视觉-触觉系统,并通过用户实验验证了视觉粗糙度对触觉摩擦力的跨模态增强效果。
Details
Motivation: 研究视觉纹理对触觉感知的影响,尤其是在静电触觉反馈系统中,这对于虚拟材料界面的设计具有重要意义。Contribution: 揭示了视觉粗糙度对触觉摩擦力的跨模态增强作用,为多模态纹理感知提供了新见解,并提出了虚拟材料界面触觉反馈的设计指导。
Method: 提出了一种视觉-触觉系统,允许用户通过捏和摩擦虚拟布料感受静电驱动的触觉反馈,并通过用户实验分析了视觉纹理对触觉感知的影响。
Result: 实验结果表明,即使静电刺激相同,视觉粗糙纹理也会显著增强感知的摩擦力。
Insight: 视觉和触觉感知之间存在跨模态效应,设计虚拟材料界面时可以综合利用视觉反馈来增强触觉体验。
Abstract: This paper investigates how visual texture presentation influences tactile perception when interacting with electrostatic cloth displays. We propose a visuo-haptic system that allows users to pinch and rub virtual fabrics while feeling realistic frictional sensations modulated by electrostatic actuation. Through a user study, we examined the cross-modal effects between visual roughness and perceived tactile friction. The results demonstrate that visually rough textures amplify the perceived frictional force, even under identical electrostatic stimuli. These findings contribute to the understanding of multimodal texture perception and provide design insights for haptic feedback in virtual material interfaces.
[259] Achieving Effective Virtual Reality Interactions via Acoustic Gesture Recognition based on Large Language Models
Xijie Zhang,Fengliang He,Hong-Ning Dai
Main category: cs.HC
TL;DR: 论文提出了一种基于大语言模型(LLM)的声学手势识别框架,解决了传统视觉手势识别的高计算成本、光照敏感和隐私问题,实现了VR/AR系统中的高效交互。
Details
Motivation: VR/AR系统中自然高效的交互是关键挑战。传统视觉手势识别存在计算成本高、对光照敏感和隐私泄露问题,而声学传感提供了一种低成本、透明的替代方案,但现有方法依赖大量标注数据训练,不适合少样本VR场景。Contribution: 1. 首次提出利用LLM进行基于CIR的手势识别框架;2. 通过差分CIR数据而非原始CIR数据解决特征不明显的问题;3. 构建了一个包含多类手势的真实数据集。
Method: 1. 收集差分CIR数据;2. 利用LLM构建分类器;3. 在包含10名参与者、15种手势的数据集上进行实验。
Result: LLM框架无需领域特定重训练,即可达到与传统机器学习基线相当的准确率。
Insight: LLM可用于解决少样本和零样本学习问题,尤其是在特征不明显的情况下,差分数据提取是关键。
Abstract: Natural and efficient interaction remains a critical challenge for virtual reality and augmented reality (VR/AR) systems. Vision-based gesture recognition suffers from high computational cost, sensitivity to lighting conditions, and privacy leakage concerns. Acoustic sensing provides an attractive alternative: by emitting inaudible high-frequency signals and capturing their reflections, channel impulse response (CIR) encodes how gestures perturb the acoustic field in a low-cost and user-transparent manner. However, existing CIR-based gesture recognition methods often rely on extensive training of models on large labeled datasets, making them unsuitable for few-shot VR scenarios. In this work, we propose the first framework that leverages large language models (LLMs) for CIR-based gesture recognition in VR/AR systems. Despite LLMs’ strengths, it is non-trivial to achieve few-shot and zero-shot learning of CIR gestures due to their inconspicuous features. To tackle this challenge, we collect differential CIR rather than original CIR data. Moreover, we construct a real-world dataset collected from 10 participants performing 15 gestures across three categories (digits, letters, and shapes), with 10 repetitions each. We then conduct extensive experiments on this dataset using an LLM-adopted classifier. Results show that our LLM-based framework achieves accuracy comparable to classical machine learning baselines, while requiring no domain-specific retraining.