Table of Contents
- cs.CL [Total: 12]
- cs.CV [Total: 48]
- cs.LG [Total: 3]
- cs.SI [Total: 1]
- cs.SD [Total: 1]
- eess.IV [Total: 4]
- cs.RO [Total: 5]
cs.CL [Back]
[1] AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs
Debdeep Sanyal,Manodeep Ray,Murari Mandal
Main category: cs.CL
TL;DR: 这篇论文提出了AntiDote,一种双层对抗训练方法,旨在使大语言模型(LLMs)在开放权重条件下能够抵抗恶意微调攻击,同时保持其通用能力。
Details
Motivation: 开放权重LLMs的研究与潜在滥用(如恶意微调以生成有害内容)之间产生了矛盾。当前的安全措施难以在保持模型通用能力的同时抵御对权重和架构有完全访问权的攻击者。Contribution: AntiDote通过双层优化训练方法提高了LLMs的抗干扰能力,引入了一个辅助的对抗性超网络,生成恶意LoRA权重,并通过训练使模型能抵御这些攻击。
Method: 采用双层优化策略,包括一个超网络生成恶意LoRA权重,以及一个防御模型训练目标以消除这些权重的影响。
Result: 在与52种红队攻击的对抗中,AntiDote比基线方法提高了27.4%的鲁棒性,且在能力基准(如MMLU、HellaSwag)中性能下降小于0.5%。
Insight: AntiDote展示了如何在开放权重模型中嵌入更具弹性的安全性,同时几乎不影响模型的实用性,为安全研究提供了高效的计算方法。
Abstract: The release of open-weight large language models (LLMs) creates a tension between advancing accessible research and preventing misuse, such as malicious fine-tuning to elicit harmful content. Current safety measures struggle to preserve the general capabilities of the LLM while resisting a determined adversary with full access to the model’s weights and architecture, who can use full-parameter fine-tuning to erase existing safeguards. To address this, we introduce AntiDote, a bi-level optimization procedure for training LLMs to be resistant to such tampering. AntiDote involves an auxiliary adversary hypernetwork that learns to generate malicious Low-Rank Adaptation (LoRA) weights conditioned on the defender model’s internal activations. The defender LLM is then trained with an objective to nullify the effect of these adversarial weight additions, forcing it to maintain its safety alignment. We validate this approach against a diverse suite of 52 red-teaming attacks, including jailbreak prompting, latent space manipulation, and direct weight-space attacks. AntiDote is upto 27.4% more robust against adversarial attacks compared to both tamper-resistance and unlearning baselines. Crucially, this robustness is achieved with a minimal trade-off in utility, incurring a performance degradation of upto less than 0.5% across capability benchmarks including MMLU, HellaSwag, and GSM8K. Our work offers a practical and compute efficient methodology for building open-weight models where safety is a more integral and resilient property.
[2] NOWJ@COLIEE 2025: A Multi-stage Framework Integrating Embedding Models and Large Language Models for Legal Retrieval and Entailment
Hoang-Trung Nguyen,Tan-Minh Nguyen,Xuan-Bach Le,Tuan-Kiet Le,Khanh-Huyen Nguyen,Ha-Thanh Nguyen,Thi-Hai-Yen Vuong,Le-Minh Nguyen
Main category: cs.CL
TL;DR: NOWJ团队在COLIEE 2025竞赛中提出了一种多阶段框架,结合嵌入模型和大语言模型(LLM)完成法律检索与蕴含任务,尤其是在Legal Case Entailment任务中获得第一名。
Details
Motivation: 解决法律信息处理中的检索与蕴含挑战,结合传统信息检索技术与现代生成模型的优势。Contribution: 提出了一种集成BM25、BERT、monoT5等预排名模型与BGE-m3、LLM2Vec等嵌入模型及LLM(如Qwen-2)的多阶段框架,实现了法律任务的高效处理。
Method: 采用两阶段检索系统:1)结合词法-语义过滤;2)利用上下文化的LLM分析。在其他任务中,通过集成策略和基于提示的推理实现鲁棒性能。
Result: 在Legal Case Entailment任务中获得F1分数0.3195,排名第一;其他任务也表现优异。
Insight: 混合模型(传统IR技术与生成模型结合)在法律信息处理中具有潜力,为未来研究提供了参考。
Abstract: This paper presents the methodologies and results of the NOWJ team’s participation across all five tasks at the COLIEE 2025 competition, emphasizing advancements in the Legal Case Entailment task (Task 2). Our comprehensive approach systematically integrates pre-ranking models (BM25, BERT, monoT5), embedding-based semantic representations (BGE-m3, LLM2Vec), and advanced Large Language Models (Qwen-2, QwQ-32B, DeepSeek-V3) for summarization, relevance scoring, and contextual re-ranking. Specifically, in Task 2, our two-stage retrieval system combined lexical-semantic filtering with contextualized LLM analysis, achieving first place with an F1 score of 0.3195. Additionally, in other tasks–including Legal Case Retrieval, Statute Law Retrieval, Legal Textual Entailment, and Legal Judgment Prediction–we demonstrated robust performance through carefully engineered ensembles and effective prompt-based reasoning strategies. Our findings highlight the potential of hybrid models integrating traditional IR techniques with contemporary generative models, providing a valuable reference for future advancements in legal information processing.
[3] SciGPT: A Large Language Model for Scientific Literature Understanding and Knowledge Discovery
Fengyu She,Nan Wang,Hongfei Wu,Ziyi Wan,Jingmian Wang,Chang Wang
Main category: cs.CL
TL;DR: SciGPT是一个针对科学文献理解的大语言模型,通过领域适应技术和创新的注意力机制,在科学任务中超越了GPT-4o的表现。
Details
Motivation: 科学文献的快速增长使得研究人员难以高效提取知识,而通用LLMs难以处理科学领域的技术细节和复杂任务。Contribution: 1. 提出了SciGPT模型和ScienceBench基准;2. 采用低成本的领域蒸馏管道;3. 设计了稀疏混合专家注意力机制;4. 结合领域本体进行知识感知适应。
Method: 1. 两阶段领域蒸馏平衡性能与效率;2. 稀疏混合专家注意力机制降低内存消耗;3. 知识感知适配整合领域本体。
Result: 在ScienceBench上,SciGPT在序列标注、生成和推理任务上超越GPT-4o,且对未见过的科学任务表现出强鲁棒性。
Insight: 通过领域适配和专家注意力机制,可以显著提升LLMs在科学任务中的表现,为AI辅助科学发现提供了新的可能性。
Abstract: Scientific literature is growing exponentially, creating a critical bottleneck for researchers to efficiently synthesize knowledge. While general-purpose Large Language Models (LLMs) show potential in text processing, they often fail to capture scientific domain-specific nuances (e.g., technical jargon, methodological rigor) and struggle with complex scientific tasks, limiting their utility for interdisciplinary research. To address these gaps, this paper presents SciGPT, a domain-adapted foundation model for scientific literature understanding and ScienceBench, an open source benchmark tailored to evaluate scientific LLMs. Built on the Qwen3 architecture, SciGPT incorporates three key innovations: (1) low-cost domain distillation via a two-stage pipeline to balance performance and efficiency; (2) a Sparse Mixture-of-Experts (SMoE) attention mechanism that cuts memory consumption by 55% for 32,000-token long-document reasoning; and (3) knowledge-aware adaptation integrating domain ontologies to bridge interdisciplinary knowledge gaps. Experimental results on ScienceBench show that SciGPT outperforms GPT-4o in core scientific tasks including sequence labeling, generation, and inference. It also exhibits strong robustness in unseen scientific tasks, validating its potential to facilitate AI-augmented scientific discovery.
[4] No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models
Flor Miriam Plaza-del-Arco,Paul Röttger,Nino Scherrer,Emanuele Borgonovo,Elmar Plischke,Dirk Hovy
Main category: cs.CL
TL;DR: 该研究量化了15种社会人口学形象提示对语言模型虚假拒绝率的影响,发现模型能力和任务类型对虚假拒绝的影响可能大于人物形象提示,表明先前的估计可能过高。
Details
Motivation: 大型语言模型(LLMs)的个性化可能导致虚假拒绝用户请求的问题,但此前的研究未充分量化这一现象。本文旨在填补这一空白,探究人物形象提示及其他因素对虚假拒绝的影响。Contribution: 1. 量化了15种社会人口学形象对虚假拒绝的影响;2. 提出了一种基于蒙特卡洛的高效量化方法;3. 发现模型能力和任务类型对虚假拒绝的影响更大。
Method: 1. 测试15种社会人口学形象提示;2. 控制其他变量(16种模型、3种任务、9种提示转述);3. 提出蒙特卡洛方法高效量化影响。
Result: 更强大的模型受人物形象影响较小;某些社会人口学形象会增加部分模型的虚假拒绝;模型选择和任务类型(尤其是敏感内容任务)显著影响虚假拒绝。
Insight: 人物形象提示对虚假拒绝的影响可能被高估,模型能力、任务类型和安全机制中的偏见是更重要的因素。
Abstract: Large language models (LLMs) are increasingly integrated into our daily lives and personalized. However, LLM personalization might also increase unintended side effects. Recent work suggests that persona prompting can lead models to falsely refuse user requests. However, no work has fully quantified the extent of this issue. To address this gap, we measure the impact of 15 sociodemographic personas (based on gender, race, religion, and disability) on false refusal. To control for other factors, we also test 16 different models, 3 tasks (Natural Language Inference, politeness, and offensiveness classification), and nine prompt paraphrases. We propose a Monte Carlo-based method to quantify this issue in a sample-efficient manner. Our results show that as models become more capable, personas impact the refusal rate less and less. Certain sociodemographic personas increase false refusal in some models, which suggests underlying biases in the alignment strategies or safety mechanisms. However, we find that the model choice and task significantly influence false refusals, especially in sensitive content tasks. Our findings suggest that persona effects have been overestimated, and might be due to other factors.
[5] MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion
Kosei Uemura,David Guzmán,Quang Phuoc Nguyen,Jesujoba Oluwadara Alabi,En-shiun Annie Lee,David Ifeoluwa Adelani
Main category: cs.CL
TL;DR: MERLIN是一个两阶段的模型堆叠框架,通过课程学习策略从通用双语数据到任务特定数据,仅调整少量DoRA权重,显著提升了低资源语言(LRLs)的推理能力。在AfriMGSM基准上,MERLIN比MindMerger准确率提高了12.9个百分点,甚至超越了GPT-4o-mini。
Details
Motivation: 大型语言模型在英语表现优异,但在低资源语言中的复杂推理任务上表现不足。现有的编码器-解码器方法对中高资源语言有效但对LRLs仍有较大差距。Contribution: 提出了MERLIN框架,通过课程学习和调整少量DoRA权重,显著提升LRLs的推理能力,并在多种语言环境下表现出色。
Method: 采用两阶段的模型堆叠框架,结合从通用双语数据到任务特定数据的课程学习策略,仅优化少量DoRA权重。
Result: 在AfriMGSM上准确率提升12.9个百分点,超越MindMerger和GPT-4o-mini,在MGSM和MSVAMP上也分别提升0.9和2.8个百分点。
Insight: 通过课程学习和轻量级权重调整,可以有效弥合LRLs与高资源语言之间的性能差距,且方法具有广泛的适用性。
Abstract: Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy – from general bilingual bitext to task-specific data – and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.
[6] Bias after Prompting: Persistent Discrimination in Large Language Models
Nivedha Sivakumar,Natalie Mackraz,Samira Khorshidi,Krishna Patel,Barry-John Theobald,Luca Zappella,Nicholas Apostoloff
Main category: cs.CL
TL;DR: 论文通过研究发现,大规模语言模型(LLMs)中的偏见会通过提示(prompting)传递到下游任务,且现有的提示去偏见方法无法一致性地减少这种传递。
Details
Motivation: 研究动机在于揭示提示适应过程中偏见传递的持续性,挑战了先前关于偏见不会从预训练模型传递到下游任务的假设。Contribution: 主要贡献包括:1. 证明偏见通过提示传递的现象;2. 揭示现有提示去偏见方法的局限性;3. 提供不同任务和人口统计群体中偏见传递的具体数据。
Method: 方法包括:1. 在因果模型中研究偏见传递假设(BTH);2. 分析提示适应后偏见的持续性;3. 评估多种提示去偏见策略的有效性。
Result: 结果显示,内在偏见与提示适应后的偏见之间存在中等到强相关性(如性别rho ≥ 0.94),且现有方法无法一致性减少偏见传递。
Insight: 深入观点包括:1. 修正内在模型偏见可能有助于阻止向下游任务的传播;2. 提示去偏见方法需针对不同任务和群体优化。
Abstract: A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks – for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.
[7] Verbalized Algorithms
Supriya Lall,Christian Farrell,Hari Pathanjaly,Marko Pavic,Sarvesh Chezhian,Masataro Asai
Main category: cs.CL
TL;DR: 论文提出了一种名为‘语言化算法’(VAs)的新范式,通过将任务分解为简单的自然语言操作,限制LLM的作用范围,从而提高推理任务的可靠性。
Details
Motivation: 传统的单次查询LLMs方法存在不可靠性,作者希望通过结合经典算法,将任务分解为LLMs能够可靠处理的简单操作。Contribution: 提出了‘语言化算法’概念,通过将复杂任务分解为简单操作,并用LLMs作为基本操作的‘预言机’,例如在排序任务中使用LLM进行二元比较。
Method: 利用经典算法(如比特排序网络)的框架,将任务分解为自然语言字符串的简单操作,并限定LLMs仅处理这些子任务。
Result: 在排序和聚类任务中验证了该方法的有效性。
Insight: 将LLMs与经典算法结合,能够提高任务的可控性和可靠性,同时也为LLMs的定向优化提供了一种新思路。
Abstract: Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into simple elementary operations on natural language strings that they should be able to answer reliably, and limit the scope of LLMs to only those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.
[8] Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions
Eve Fleisig,Matthias Orlikowski,Philipp Cimiano,Dan Klein
Main category: cs.CL
TL;DR: 论文研究了如何在标注数据过滤垃圾标注(spam filtering)时平衡标注质量和多样性,发现传统过滤方法可能误删持不同意见的标注者而非真正的垃圾标注者,提出了保守过滤策略的建议。
Details
Motivation: 在主观任务中,数据标注需要保留多样性以反映真实意见分布,但传统垃圾标注过滤方法可能将不同意见误判为低质量标注,导致数据偏差。本文旨在研究如何平衡标注可靠性和多样性。Contribution: 1. 实证评估了多种垃圾标注过滤启发式方法对标签多样性的影响;2. 发现传统方法在过滤垃圾标注时会误删持不同意见的标注者;3. 提出保守过滤(<5%)的建议以减少误差;4. 揭示垃圾标注者行为特征(固定答案而非随机)。
Method: 通过合成垃圾标注和真实标注数据,对比多种垃圾标注过滤启发式方法(如基于一致性或随机性的指标)对标签多样性和准确性的影响。
Result: 传统过滤方法会显著增加标签分布误差,尤其是在过滤比例超过5%时;垃圾标注者多为固定答案而非随机行为,且多数与正常标注者分布相似。
Insight: 在需要保留多样性的任务中,传统过滤方法(假设多样性为噪声)表现不佳,需设计新的过滤方法以区分真正垃圾标注与合理多样性。
Abstract: For machine learning datasets to accurately represent diverse opinions in a population, they must preserve variation in data labels while filtering out spam or low-quality responses. How can we balance annotator reliability and representation? We empirically evaluate how a range of heuristics for annotator filtering affect the preservation of variation on subjective tasks. We find that these methods, designed for contexts in which variation from a single ground-truth label is considered noise, often remove annotators who disagree instead of spam annotators, introducing suboptimal tradeoffs between accuracy and label diversity. We find that conservative settings for annotator removal (<5%) are best, after which all tested methods increase the mean absolute error from the true average label. We analyze performance on synthetic spam to observe that these methods often assume spam annotators are less random than real spammers tend to be: most spammers are distributionally indistinguishable from real annotators, and the minority that are distinguishable tend to give fixed answers, not random ones. Thus, tasks requiring the preservation of variation reverse the intuition of existing spam filtering methods: spammers tend to be less random than non-spammers, so metrics that assume variation is spam fare worse. These results highlight the need for spam removal methods that account for label diversity.
[9] Towards Knowledge-Aware Document Systems: Modeling Semantic Coverage Relations via Answerability Detection
Yehudit Aperstein,Alon Gottlib,Gal Benita,Alexander Apartsin
Main category: cs.CL
TL;DR: 本文提出了一种基于问答(QA)的框架来建模语义覆盖关系(SCR),通过分析文档对之间信息内容的对齐程度。研究构建了一个合成数据集,并测试了生成式模型和判别式模型在SCR预测上的表现。
Details
Motivation: 理解跨文档信息共享对于信息检索、摘要生成和内容对齐等任务至关重要。目前缺乏系统的方法来量化文档之间的语义关系,尤其是不同表达形式之间的信息重叠。Contribution: 1. 提出了一种基于QA的SCR建模框架,定义了三类核心关系(等价、包含和语义重叠)。
2. 构建了一个合成数据集,通过控制内容重叠来精确评估模型性能。
3. 验证了判别式模型在SCR预测中的优越性,为语义关系分析提供了新方法。
Method: 1. 使用SQuAD数据集生成合成数据,通过复述和选择性信息删除控制语义关系。
2. 采用问答能力作为语义覆盖指标,训练生成式和判别式模型进行SCR分类。
Result: 判别式模型显著优于生成式模型,RoBERTa-base模型准确率达61.4%,随机森林模型在宏F1分数上表现最佳(52.9%)。
Insight: QA方法为分析语义关系提供了有效工具,揭示了当前模型在处理超越表面相似性的信息推理能力。判别式模型更适合此类任务。
Abstract: Understanding how information is shared across documents, regardless of the format in which it is expressed, is critical for tasks such as information retrieval, summarization, and content alignment. In this work, we introduce a novel framework for modelling Semantic Coverage Relations (SCR), which classifies document pairs based on how their informational content aligns. We define three core relation types: equivalence, where both texts convey the same information using different textual forms or styles; inclusion, where one document fully contains the information of another and adds more; and semantic overlap, where each document presents partially overlapping content. To capture these relations, we adopt a question answering (QA)-based approach, using the answerability of shared questions across documents as an indicator of semantic coverage. We construct a synthetic dataset derived from the SQuAD corpus by paraphrasing source passages and selectively omitting information, enabling precise control over content overlap. This dataset allows us to benchmark generative language models and train transformer-based classifiers for SCR prediction. Our findings demonstrate that discriminative models significantly outperform generative approaches, with the RoBERTa-base model achieving the highest accuracy of 61.4% and the Random Forest-based model showing the best balance with a macro-F1 score of 52.9%. The results show that QA provides an effective lens for assessing semantic relations across stylistically diverse texts, offering insights into the capacity of current models to reason about information beyond surface similarity. The dataset and code developed in this study are publicly available to support reproducibility.
[10] Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications
Anran Li,Lingfei Qian,Mengmeng Du,Yu Yin,Yan Hu,Zihao Sun,Yihang Fu,Erica Stutz,Xuguang Ai,Qianqian Xie,Rui Zhu,Jimin Huang,Yifan Yang,Siru Liu,Yih-Chung Tham,Lucila Ohno-Machado,Hyunghoon Cho,Zhiyong Lu,Hua Xu,Qingyu Chen
Main category: cs.CL
TL;DR: 该研究首次全面评估了大型语言模型(LLMs)在医学领域的记忆行为,揭示了其在医学训练数据中的普遍性、特征及潜在影响,并提出了优化建议。
Details
Motivation: LLMs在医学领域应用广泛,但其对训练数据的记忆行为尚未被系统研究,这可能影响模型的开发与应用。Contribution: 首次系统评估了医学领域LLMs的记忆行为,分析了其在不同适应场景下的表现,并提出了分类和优化建议。
Method: 研究通过三种适应场景(继续预训练、微调标准基准、微调真实临床数据)系统评估记忆行为,分析了13000多份患者记录。
Result: 记忆行为在所有场景中普遍存在且高于通用领域,可分为有益、无用和有害三类,直接影响医学应用的开发与采用。
Insight: 记忆行为对医学LLMs具有双面性,需针对性优化以实现准确性提升、减少无意义记忆,并防止敏感信息泄漏。
Abstract: Large Language Models (LLMs) have demonstrated significant potential in medicine. To date, LLMs have been widely applied to tasks such as diagnostic assistance, medical question answering, and clinical information synthesis. However, a key open question remains: to what extent do LLMs memorize medical training data. In this study, we present the first comprehensive evaluation of memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications). We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than reported in the general domain. Memorization affects both the development and adoption of LLMs in medicine and can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines and biomedical references), uninformative (e.g., repeated disclaimers or templated medical document language), and harmful (e.g., regeneration of dataset-specific or sensitive clinical content). Based on these findings, we offer practical recommendations to facilitate beneficial memorization that enhances domain-specific reasoning and factual accuracy, minimize uninformative memorization to promote deeper learning beyond surface-level patterns, and mitigate harmful memorization to prevent the leakage of sensitive or identifiable patient information.
[11] Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling
Neil Zeghidour,Eugene Kharitonov,Manu Orsini,Václav Volhejn,Gabriel de Marmiesse,Edouard Grave,Patrick Pérez,Laurent Mazaré,Alexandre Défossez
Main category: cs.CL
TL;DR: 论文提出了Delayed Streams Modeling (DSM),一种灵活的流式多模态序列到序列学习方法,通过预处理的延迟对齐实现高效的流式推断。
Details
Motivation: 传统的序列到序列方法多为离线方式,无法适应流式场景的需求。DSM旨在解决这一问题,支持任意长度的输入输出序列组合。Contribution: 提出了DSM框架,通过延迟对齐和多模态流式处理,实现了高效的流式序列到序列生成,并在ASR和TTS任务中展示了优越性能。
Method: DSM将时间对齐移至预处理步骤,引入延迟机制,利用解码器语言模型处理已对齐的流式输入输出。
Result: 实验显示,DSM在ASR和TTS任务中达到最先进的性能和延迟,甚至可与离线基准竞争。
Insight: DSM的创新在于将复杂的流式对齐问题简化为预处理延迟设计,为多模态序列任务提供了灵活的解决方案。
Abstract: We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling
[12] A Survey of Reinforcement Learning for Large Reasoning Models
Kaiyan Zhang,Yuxin Zuo,Bingxiang He,Youbang Sun,Runze Liu,Che Jiang,Yuchen Fan,Kai Tian,Guoli Jia,Pengfei Li,Yu Fu,Xingtai Lv,Yuchen Zhang,Sihang Zeng,Shang Qu,Haozhan Li,Shijie Wang,Yuru Wang,Xinwei Long,Fangfu Liu,Xiang Xu,Jiaze Ma,Xuekai Zhu,Ermo Hua,Yihao Liu,Zonglin Li,Huayu Chen,Xiaoye Qu,Yafu Li,Weize Chen,Zhenzhao Yuan,Junqi Gao,Dong Li,Zhiyuan Ma,Ganqu Cui,Zhiyuan Liu,Biqing Qi,Ning Ding,Bowen Zhou
Main category: cs.CL
TL;DR: 本文综述了强化学习(RL)在大模型推理(LRMs)中的应用进展,探讨其在提升大型语言模型(LLMs)逻辑推理能力方面的成功及面临的挑战。
Details
Motivation: 随着RL在LLMs(如数学和编程任务)中的成功应用,需解决其扩展性问题以推动LRMs发展,进而实现人工超级智能(ASI)。Contribution: 提供了RL在LLMs和LRMs中应用的综合综述,涵盖基础组件、核心问题、训练资源和下游应用,为未来研究方向提供指导。
Method: 通过文献回顾和分析,总结了RL在LLMs和LRMs中的关键技术和挑战。
Result: 指出了RL在扩展性方面的资源、算法和基础设施挑战,并探讨了未来发展方向。
Insight: RL是提升LRMs推理能力的关键方法,但其大规模应用仍需解决多维度挑战。
Abstract: In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs
cs.CV [Back]
[13] 3D and 4D World Modeling: A Survey
Lingdong Kong,Wesley Yang,Jianbiao Mei,Youquan Liu,Ao Liang,Dekai Zhu,Dongyue Lu,Wei Yin,Xiaotao Hu,Mingkai Jia,Junyuan Deng,Kaiwen Zhang,Yang Wu,Tianyi Yan,Shenyuan Gao,Song Wang,Linfeng Li,Liang Pan,Yong Liu,Jianke Zhu,Wei Tsang Ooi,Steven C. H. Hoi,Ziwei Liu
Main category: cs.CV
TL;DR: 这篇论文是对3D和4D世界建模领域的首次全面综述,提出了明确定义和分类法,总结了相关数据集和评估指标,并探讨了实际应用与未来研究方向。
Details
Motivation: 现有研究多集中于生成2D图像和视频的方法,而忽略了3D和4D表示(如RGB-D图像、占据栅格和LiDAR点云)的应用。同时,缺乏对‘世界模型’的标准化定义和分类法,导致文献中的主张分散且不一致。Contribution: 1. 提出了3D和4D世界建模的明确定义和结构化分类法(VideoGen、OccGen、LiDARGen);2. 总结了适应3D/4D设置的专用数据集和评估指标;3. 探讨了实际应用与未来挑战。
Method: 论文采用了系统化的文献综述方法,整理分析了3D和4D世界建模领域的研究,并提出了基于视频、占据栅格和LiDAR的分类框架。
Result: 论文提供了对3D和4D世界建模领域的全面概述,包括分类法、数据集、评估指标和未来研究方向。
Insight: 3D和4D表示在动态环境建模中具有显著优势;标准化定义和评估指标将推动领域发展;未来研究方向包括多模态融合和实时建模等。
Abstract: World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models’’ has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey
[14] Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Hyungjin Chung,Hyelin Nam,Jiyeon Kim,Hyojun Go,Byeongjun Park,Junho Kim,Joonseok Lee,Seongsu Ha,Byung-Hoon Kim
Main category: cs.CV
TL;DR: Video Parallel Scaling (VPS) 是一种推理时方法,通过并行处理视频帧的不同子集并聚合结果,扩展了 VideoLLMs 的感知能力,而不增加上下文窗口。
Details
Motivation: VideoLLMs 在处理更多帧以捕捉细粒度时间细节时,面临计算成本过高和性能下降的问题。需要一种方法在不增加计算负担的情况下提升性能。Contribution: 提出了 VPS,通过并行推理流和概率聚合,提升模型性能,理论证明了其通过利用不相关的视觉证据有效扩展了 Chinchilla 缩放定律。
Method: VPS 运行多个并行推理流,每个流处理视频帧的不同子集,然后聚合输出概率以整合更丰富的视觉信息。
Result: 在多种模型架构和规模(2B-32B)上的实验显示,VPS 在 Video-MME 和 EventHallusion 等基准上显著提升了性能,且比其他并行方法更具扩展性。
Insight: VPS 是一种内存高效且稳健的框架,能增强 VideoLLMs 的时间推理能力,且与其他解码策略互补。
Abstract: Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model’s perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video’s frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.
[15] Two Stage Context Learning with Large Language Models for Multimodal Stance Detection on Climate Change
Lata Pangtey,Omkar Kabde,Shahid Shafi Dar,Nagendra Kumar
Main category: cs.CV
TL;DR: 本文提出了一种基于大型语言模型的分阶段多模态立场检测框架,旨在结合文本和视觉信息,实现对社交媒体内容中气候相关立场的准确分类。该方法在MultiClimate数据集上表现优于现有技术。
Details
Motivation: 社交媒体内容日益多模态化,而现有立场检测方法主要依赖文本数据。为了填补这一空白,本文提出了一种结合文本和视觉信息的先进多模态方法。Contribution: 1. 提出了一种分阶段的多模态立场检测框架;2. 结合大型语言模型和领域感知图像标题生成器处理多模态数据;3. 在MultiClimate数据集上取得了优于现有方法的性能。
Method: 1. 使用大型语言模型生成文本摘要;2. 通过领域感知图像标题生成器解释视觉内容;3. 设计专门的Transformer模块联合建模多模态数据的交互。
Result: 在MultiClimate数据集上的准确率为76.2%,精确率、召回率和F1-score均为76.3%,优于现有技术。
Insight: 多模态数据的联合建模可以显著提升立场检测任务的性能,尤其是在复杂话题(如气候变化)上。
Abstract: With the rapid proliferation of information across digital platforms, stance detection has emerged as a pivotal challenge in social media analysis. While most of the existing approaches focus solely on textual data, real-world social media content increasingly combines text with visual elements creating a need for advanced multimodal methods. To address this gap, we propose a multimodal stance detection framework that integrates textual and visual information through a hierarchical fusion approach. Our method first employs a Large Language Model to retrieve stance-relevant summaries from source text, while a domain-aware image caption generator interprets visual content in the context of the target topic. These modalities are then jointly modeled along with the reply text, through a specialized transformer module that captures interactions between the texts and images. The proposed modality fusion framework integrates diverse modalities to facilitate robust stance classification. We evaluate our approach on the MultiClimate dataset, a benchmark for climate change-related stance detection containing aligned video frames and transcripts. We achieve accuracy of 76.2%, precision of 76.3%, recall of 76.2% and F1-score of 76.2%, respectively, outperforming existing state-of-the-art approaches.
[16] Sparse Transformer for Ultra-sparse Sampled Video Compressive Sensing
Miao Cao,Siming Zheng,Lishun Wang,Ziyang Chen,David Brady,Xin Yuan
Main category: cs.CV
TL;DR: 该论文提出了一种超稀疏采样(USS)策略和BSTFormer稀疏Transformer,用于视频压缩感知,显著提升了稀疏采样下的重建性能。
Details
Motivation: 为了解决高分辨率、高帧率视频采集的功耗问题,当前基于随机采样(RS)的视频压缩感知方法效率不足,提出超稀疏采样策略以减少功耗并提升动态范围。Contribution: 1. 提出超稀疏采样(USS)策略,比随机采样更高效,动态范围更高;2. 提出BSTFormer稀疏Transformer,利用局部块注意力、全局稀疏注意力和全局时间注意力解决USS测量分解不匹配问题。
Method: 1. 设计USS策略,每个空间位置仅一个子帧为1;2. 构建BSTFormer,结合局部块注意力、全局稀疏注意力和时间注意力;3. 通过DMD编码系统验证USS策略。
Result: 在仿真和真实数据上,BSTFormer显著优于现有方法,且USS策略具有更高的动态范围和固定曝光时间优势。
Insight: 1. 超稀疏采样在压缩感知中具有更高的能效和动态范围;2. 稀疏Transformer通过多粒度注意力机制有效解决稀疏测量分解问题。
Abstract: Digital cameras consume ~0.1 microjoule per pixel to capture and encode video, resulting in a power usage of ~20W for a 4K sensor operating at 30 fps. Imagining gigapixel cameras operating at 100-1000 fps, the current processing model is unsustainable. To address this, physical layer compressive measurement has been proposed to reduce power consumption per pixel by 10-100X. Video Snapshot Compressive Imaging (SCI) introduces high frequency modulation in the optical sensor layer to increase effective frame rate. A commonly used sampling strategy of video SCI is Random Sampling (RS) where each mask element value is randomly set to be 0 or 1. Similarly, image inpainting (I2P) has demonstrated that images can be recovered from a fraction of the image pixels. Inspired by I2P, we propose Ultra-Sparse Sampling (USS) regime, where at each spatial location, only one sub-frame is set to 1 and all others are set to 0. We then build a Digital Micro-mirror Device (DMD) encoding system to verify the effectiveness of our USS strategy. Ideally, we can decompose the USS measurement into sub-measurements for which we can utilize I2P algorithms to recover high-speed frames. However, due to the mismatch between the DMD and CCD, the USS measurement cannot be perfectly decomposed. To this end, we propose BSTFormer, a sparse TransFormer that utilizes local Block attention, global Sparse attention, and global Temporal attention to exploit the sparsity of the USS measurement. Extensive results on both simulated and real-world data show that our method significantly outperforms all previous state-of-the-art algorithms. Additionally, an essential advantage of the USS strategy is its higher dynamic range than that of the RS strategy. Finally, from the application perspective, the USS strategy is a good choice to implement a complete video SCI system on chip due to its fixed exposure time.
[17] GTA-Crime: A Synthetic Dataset and Generation Framework for Fatal Violence Detection with Adversarial Snippet-Level Domain Adaptation
Seongho Kim,Sejong Ryu,Hyoukjun You,Je Hyeong Hong
Main category: cs.CV
TL;DR: GTA-Crime是一个基于GTA5合成的致命暴力检测数据集和生成框架,解决了真实场景中数据稀缺和伦理问题。通过片段级域适应策略(Wasserstein对抗训练),该方法提升了真实数据集(如UCF-Crime)上的检测精度。
Details
Motivation: 真实世界中致命暴力事件(如枪击和刺伤)的数据难以获取且存在伦理问题,而现有视频异常检测方法对这些场景的检测效果有限。Contribution: 提出了GTA-Crime数据集和生成框架,并通过Wasserstein对抗训练实现了合成数据与真实数据(如UCF-Crime)的特征对齐。
Method: 利用GTA5生成合成数据,并采用片段级域适应策略(Wasserstein对抗训练)优化合成与真实数据的特征一致性。
Result: 实验表明,GTA-Crime及其域适应策略显著提升了真实世界致命暴力检测的准确率。
Insight: 合成数据可以作为真实数据稀缺场景的有效补充,对抗训练在小样本域适应中表现出色。
Abstract: Recent advancements in video anomaly detection (VAD) have enabled identification of various criminal activities in surveillance videos, but detecting fatal incidents such as shootings and stabbings remains difficult due to their rarity and ethical issues in data collection. Recognizing this limitation, we introduce GTA-Crime, a fatal video anomaly dataset and generation framework using Grand Theft Auto 5 (GTA5). Our dataset contains fatal situations such as shootings and stabbings, captured from CCTV multiview perspectives under diverse conditions including action types, weather, time of day, and viewpoints. To address the rarity of such scenarios, we also release a framework for generating these types of videos. Additionally, we propose a snippet-level domain adaptation strategy using Wasserstein adversarial training to bridge the gap between synthetic GTA-Crime features and real-world features like UCF-Crime. Experimental results validate our GTA-Crime dataset and demonstrate that incorporating GTA-Crime with our domain adaptation strategy consistently enhances real world fatal violence detection accuracy. Our dataset and the data generation framework are publicly available at https://github.com/ta-ho/GTA-Crime.
[18] RepViT-CXR: A Channel Replication Strategy for Vision Transformers in Chest X-ray Tuberculosis and Pneumonia Classification
Faisal Ahmed
Main category: cs.CV
TL;DR: RepViT-CXR提出了一种通道复制策略,将单通道的胸部X光图像适配到ViT架构中,显著提升了结核病和肺炎的分类性能。
Details
Motivation: 胸部X光图像(CXR)是检测结核病和肺炎的重要工具,但大多数ViT模型是基于三通道的自然图像训练的,无法直接处理单通道的CXR图像,因此需要一种适配方法。Contribution: 提出了RepViT-CXR,一种简单的通道复制策略,将单通道CXR图像转化为ViT兼容的三通道输入,且不引入信息损失,显著提升了分类性能。
Method: 通过复制单通道CXR图像的灰度信息生成三通道输入,适配ViT架构,并在三个基准数据集上验证其性能。
Result: 在TB-CXR数据集上取得了99.9%的准确率和AUC,优于Topo-CXR;在儿科肺炎数据集上召回率和精确率均超过99%;在深圳结核病数据集上也表现优于CNN方法。
Insight: 简单的通道复制策略可以有效适配ViT模型到单通道医学图像任务,展现出ViT在医学图像分析中的强大潜力。
Abstract: Chest X-ray (CXR) imaging remains one of the most widely used diagnostic tools for detecting pulmonary diseases such as tuberculosis (TB) and pneumonia. Recent advances in deep learning, particularly Vision Transformers (ViTs), have shown strong potential for automated medical image analysis. However, most ViT architectures are pretrained on natural images and require three-channel inputs, while CXR scans are inherently grayscale. To address this gap, we propose RepViT-CXR, a channel replication strategy that adapts single-channel CXR images into a ViT-compatible format without introducing additional information loss. We evaluate RepViT-CXR on three benchmark datasets. On the TB-CXR dataset,our method achieved an accuracy of 99.9% and an AUC of 99.9%, surpassing prior state-of-the-art methods such as Topo-CXR (99.3% accuracy, 99.8% AUC). For the Pediatric Pneumonia dataset, RepViT-CXR obtained 99.0% accuracy, with 99.2% recall, 99.3% precision, and an AUC of 99.0%, outperforming strong baselines including DCNN and VGG16. On the Shenzhen TB dataset, our approach achieved 91.1% accuracy and an AUC of 91.2%, marking a performance improvement over previously reported CNN-based methods. These results demonstrate that a simple yet effective channel replication strategy allows ViTs to fully leverage their representational power on grayscale medical imaging tasks. RepViT-CXR establishes a new state of the art for TB and pneumonia detection from chest X-rays, showing strong potential for deployment in real-world clinical screening systems.
[19] Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimer’s Disease Using Structural MRI
Zheng Yang,Yanteng Zhang,Xupeng Kou,Yang Liu,Chao Ren
Main category: cs.CV
TL;DR: 该论文提出了一种结合3D CNN编码器和对称交互Transformer(SIT)的网络,用于通过sMRI诊断阿尔茨海默病(AD),重点关注大脑左右半球的不对称特征,提升了诊断准确性。
Details
Motivation: 现有的深度学习方法在sMRI诊断AD中忽视了由脑部疾病引起的不对称特征,因此作者提出了一种新的网络结构,以捕捉并利用这种不对称性来改进诊断性能。Contribution: 论文的主要贡献是提出了对称交互Transformer(SIT)与3D CNN编码器的结合,通过聚焦左右大脑半球的不对称特征,显著提升了AD的诊断准确率(92.5%)。
Method: 方法包括3D CNN编码器提取特征,以及对称交互Transformer(SIT)模块,通过特征对齐和交互学习,捕捉由AD引起的结构不对称性。
Result: 在ADNI数据集上,该方法取得了92.5%的诊断准确率,优于其他CNN和通用Transformer方法。可视化结果也显示出网络能有效关注脑萎缩区域,尤其是AD引起的不对称病理特征。
Insight: 论文揭示了大脑左右半球不对称特征在AD诊断中的重要性,并提出了一种有效的方法来捕捉和利用这种不对称性,为深度学习在医学影像分析中的应用提供了新思路。
Abstract: Structural magnetic resonance imaging (sMRI) combined with deep learning has achieved remarkable progress in the prediction and diagnosis of Alzheimer’s disease (AD). Existing studies have used CNN and transformer to build a well-performing network, but most of them are based on pretraining or ignoring the asymmetrical character caused by brain disorders. We propose an end-to-end network for the detection of disease-based asymmetric induced by left and right brain atrophy which consist of 3D CNN Encoder and Symmetry Interactive Transformer (SIT). Following the inter-equal grid block fetch operation, the corresponding left and right hemisphere features are aligned and subsequently fed into the SIT for diagnostic analysis. SIT can help the model focus more on the regions of asymmetry caused by structural changes, thus improving diagnostic performance. We evaluated our method based on the ADNI dataset, and the results show that the method achieves better diagnostic accuracy (92.5%) compared to several CNN methods and CNNs combined with a general transformer. The visualization results show that our network pays more attention in regions of brain atrophy, especially for the asymmetric pathological characteristics induced by AD, demonstrating the interpretability and effectiveness of the method.
[20] EVDI++: Event-based Video Deblurring and Interpolation via Self-Supervised Learning
Chi Zhang,Xiang Zhang,Chenxu Jiang,Gui-Song Xia,Lei Yu
Main category: cs.CV
TL;DR: EVDI++提出了一种自监督学习框架,结合事件相机的高时间分辨率,解决帧模糊和帧间插值问题,使用Learnable Double Integral网络和自适应融合策略,在合成和真实数据集上表现优异。
Details
Motivation: 传统帧相机在长曝光时间下会产生明显的运动模糊和帧间信息丢失,事件相机的高时间分辨率为解决这一问题提供了可能。Contribution: 1. 提出了EVDI++框架,统一处理视频去模糊和插值任务。2. 设计了Learnable Double Integral网络和自适应融合策略。3. 提出了自监督学习框架,利用真实模糊视频和事件数据进行训练。4. 构建了一个真实世界的数据集。
Method: 1. 使用Learnable Double Integral网络估计参考帧与潜在清晰图像的映射关系。2. 引入基于学习的重构模块优化结果。3. 设计自适应融合策略整合事件数据。4. 通过自监督学习利用模糊帧、潜在图像和事件流的相互约束进行训练。
Result: 在合成和真实数据集上,EVDI++在视频去模糊和插值任务中达到了最先进的性能。
Insight: 事件相机的高时间分辨率可以有效解决帧相机的运动模糊问题;自监督学习框架可以缓解真实数据标注不足的问题。
Abstract: Frame-based cameras with extended exposure times often produce perceptible visual blurring and information loss between frames, significantly degrading video quality. To address this challenge, we introduce EVDI++, a unified self-supervised framework for Event-based Video Deblurring and Interpolation that leverages the high temporal resolution of event cameras to mitigate motion blur and enable intermediate frame prediction. Specifically, the Learnable Double Integral (LDI) network is designed to estimate the mapping relation between reference frames and sharp latent images. Then, we refine the coarse results and optimize overall training efficiency by introducing a learning-based division reconstruction module, enabling images to be converted with varying exposure intervals. We devise an adaptive parameter-free fusion strategy to obtain the final results, utilizing the confidence embedded in the LDI outputs of concurrent events. A self-supervised learning framework is proposed to enable network training with real-world blurry videos and events by exploring the mutual constraints among blurry frames, latent images, and event streams. We further construct a dataset with real-world blurry images and events using a DAVIS346c camera, demonstrating the generalizability of the proposed EVDI++ in real-world scenarios. Extensive experiments on both synthetic and real-world datasets show that our method achieves state-of-the-art performance in video deblurring and interpolation tasks.
[21] Hyperspectral Mamba for Hyperspectral Object Tracking
Long Gao,Yunhe Zhang,Yan Jiang,Weiying Xie,Yunsong Li
Main category: cs.CV
TL;DR: 该论文提出了一种新的超光谱目标跟踪网络HyMamba,通过状态空间模块统一光谱、跨深度和时间建模,利用Spectral State Integration模块和Hyperspectral Mamba模块同步学习空间和光谱信息,在多个基准数据集上实现了最先进的性能。
Details
Motivation: 超光谱目标跟踪在复杂场景中因丰富的光谱信息而具有潜力,但现有方法难以捕捉内在光谱信息、时间依赖性和跨深度交互。Contribution: 提出HyMamba网络,引入Spectral State Integration模块和Hyperspectral Mamba模块,统一光谱、跨深度和时间建模,显著提升跟踪性能。
Method: 通过状态空间模块(SSMs)构建联合特征,逐步优化和传播光谱信息,包括三种方向扫描的SSMs。
Result: 在七个基准数据集上表现优异,例如在HOTC2020数据集上AUC得分为73.0%,DP@20得分为96.3%。
Insight: 结合原始光谱特征和假彩色输入,通过跨深度和时间建模,可以显著提升超光谱目标跟踪的性能。
Abstract: Hyperspectral object tracking holds great promise due to the rich spectral information and fine-grained material distinctions in hyperspectral images, which are beneficial in challenging scenarios. While existing hyperspectral trackers have made progress by either transforming hyperspectral data into false-color images or incorporating modality fusion strategies, they often fail to capture the intrinsic spectral information, temporal dependencies, and cross-depth interactions. To address these limitations, a new hyperspectral object tracking network equipped with Mamba (HyMamba), is proposed. It unifies spectral, cross-depth, and temporal modeling through state space modules (SSMs). The core of HyMamba lies in the Spectral State Integration (SSI) module, which enables progressive refinement and propagation of spectral features with cross-depth and temporal spectral information. Embedded within each SSI, the Hyperspectral Mamba (HSM) module is introduced to learn spatial and spectral information synchronously via three directional scanning SSMs. Based on SSI and HSM, HyMamba constructs joint features from false-color and hyperspectral inputs, and enhances them through interaction with original spectral features extracted from raw hyperspectral images. Extensive experiments conducted on seven benchmark datasets demonstrate that HyMamba achieves state-of-the-art performance. For instance, it achieves 73.0% of the AUC score and 96.3% of the DP@20 score on the HOTC2020 dataset. The code will be released at https://github.com/lgao001/HyMamba.
[22] Examining Vision Language Models through Multi-dimensional Experiments with Vision and Text Features
Saurav Sengupta,Nazanin Moradinasab,Jiebei Liu,Donald E. Brown
Main category: cs.CV
TL;DR: 该论文通过多维实验框架研究视觉语言模型(VLMs)的性能,发现模型对输入数据(如图像大小、物体数量、背景颜色和提示语特异性)的特性高度敏感,这些小变化会导致答案生成和性能显著差异。
Details
Motivation: 现有研究表明VLMs依赖训练时的固有偏见回答问题,尤其是在需要聚焦图像细节的特定问题上表现不佳。论文旨在系统研究输入数据的哪些特性导致这种性能差异。Contribution: 提出了一个多维实验框架,用于系统性分析VLMs的性能变化原因,并揭示了图像和提示语的细微修改如何显著影响模型的回答和性能。
Method: 使用开源VLMs,通过调整图像大小、物体数量、背景颜色和提示语特异性等参数,观察模型注意力值和性能的变化。
Result: 研究表明,即使输入数据的微小变化(如图像特性或提示语特异性)也会导致VLMs的回答方式和整体性能发生显著变化。
Insight: VLMs对输入数据的特性高度敏感,未来的改进需关注如何减少模型对固有偏见的依赖,并提升其对视觉细节的捕捉能力。
Abstract: Recent research on Vision Language Models (VLMs) suggests that they rely on inherent biases learned during training to respond to questions about visual properties of an image. These biases are exacerbated when VLMs are asked highly specific questions that require focusing on specific areas of the image. For example, a VLM tasked with counting stars on a modified American flag (e.g., with more than 50 stars) will often disregard the visual evidence and fail to answer accurately. We build upon this research and develop a multi-dimensional examination framework to systematically determine which characteristics of the input data, including both the image and the accompanying prompt, lead to such differences in performance. Using open-source VLMs, we further examine how attention values fluctuate with varying input parameters (e.g., image size, number of objects in the image, background color, prompt specificity). This research aims to learn how the behavior of vision language models changes and to explore methods for characterizing such changes. Our results suggest, among other things, that even minor modifications in image characteristics and prompt specificity can lead to large changes in how a VLM formulates its answer and, subsequently, its overall performance.
[23] Generalized Zero-Shot Learning for Point Cloud Segmentation with Evidence-Based Dynamic Calibration
Hyeonseok Kim,Byeongkeun Kang,Yeejin Lee
Main category: cs.CV
TL;DR: 提出了一种名为E3DPC-GZSL的新方法,通过证据基不确定性估计器和动态校准策略解决点云分割中广义零样本学习的偏见预测问题,并在ScanNet v2和S3DIS数据集上实现最优性能。
Details
Motivation: 3D点云的广义零样本语义分割中,模型倾向于偏向训练中见过的类别,尤其在数据规模较小的3D任务中更为严重。Contribution: 提出了E3DPC-GZSL方法,集成了证据基不确定性估计器,动态校准预测概率,并改进了语义空间的学习策略。
Method: 1. 引入证据基不确定性估计器;2. 使用动态校准因子调整预测概率;3. 结合可学习参数与文本特征优化语义空间。
Result: 在ScanNet v2和S3DIS数据集上实现了最优性能。
Insight: 证据基方法和语义空间优化可以有效减少模型对已知类别的偏好,提升对未知类别的泛化能力。
Abstract: Generalized zero-shot semantic segmentation of 3D point clouds aims to classify each point into both seen and unseen classes. A significant challenge with these models is their tendency to make biased predictions, often favoring the classes encountered during training. This problem is more pronounced in 3D applications, where the scale of the training data is typically smaller than in image-based tasks. To address this problem, we propose a novel method called E3DPC-GZSL, which reduces overconfident predictions towards seen classes without relying on separate classifiers for seen and unseen data. E3DPC-GZSL tackles the overconfidence problem by integrating an evidence-based uncertainty estimator into a classifier. This estimator is then used to adjust prediction probabilities using a dynamic calibrated stacking factor that accounts for pointwise prediction uncertainty. In addition, E3DPC-GZSL introduces a novel training strategy that improves uncertainty estimation by refining the semantic space. This is achieved by merging learnable parameters with text-derived features, thereby improving model optimization for unseen data. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on generalized zero-shot semantic segmentation datasets, including ScanNet v2 and S3DIS.
[24] Dual-Thresholding Heatmaps to Cluster Proposals for Weakly Supervised Object Detection
Yuelin Guo,Haoyu He,Zhiyuan Chen,Zitong Huang,Renhao Lu,Lu Shi,Zejun Wang,Weizhe Zhang
Main category: cs.CV
TL;DR: 论文提出了一种基于双阈值热图的弱监督目标检测方法,通过改进提案选择和网络架构,解决了现有方法的三个主要问题。
Details
Motivation: 当前弱监督目标检测方法存在提案选择不足、背景类别缺失以及收敛速度慢的问题,论文旨在改进这些局限性。Contribution: 提出了热图引导的提案选择算法(HGPS),设计了增强的弱监督基础检测网络(WSBDN),并引入负确定性监督损失以加速收敛。
Method: 采用双阈值热图预选提案以生成更准确的伪GT框,并通过增强背景类别表示和热图预监督来优化网络架构。
Result: 在PASCAL VOC 2007和2012上分别达到58.5%/81.8%和55.6%/80.5%的mAP/mCorLoc分数,优于现有方法。
Insight: 双阈值热图和背景类别的引入显著提升了弱监督目标检测的性能和收敛速度。
Abstract: Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we first design a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then present a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC 2007 and 2012 datasets demonstrate the effectiveness of our framework. We achieve mAP/mCorLoc scores of 58.5%/81.8% on VOC 2007 and 55.6%/80.5% on VOC 2012, performing favorably against the state-of-the-art WSOD methods. Our code is publicly available at https://github.com/gyl2565309278/DTH-CP.
[25] An Open Benchmark Dataset for GeoAI Foundation Models for Oil Palm Mapping in Indonesia
M. Warizmi Wafiq,Peter Cutter,Ate Poortinga,Daniel Marc G. dela Torre,Karis Tenneson,Vanna Teck,Enikoe Bihari,Chanarun Saisaward,Weraphong Suaruang,Andrea McMahon,Andi Vika Faradiba Muin,Karno B. Batiran,Chairil A,Nurul Qomar,Arya Arismaya Metananda,David Ganz,David Saah
Main category: cs.CV
TL;DR: 该论文提出了一个开放的地理空间基准数据集,用于支持印度尼西亚油棕榈种植的可持续性监测和法规实施。
Details
Motivation: 油棕榈种植是印度尼西亚森林砍伐的主要原因之一,缺乏高质量的训练数据限制了遥感技术的应用,阻碍了可持续性监测和法规实施。Contribution: 提供了一个开放访问的高分辨率卫星影像数据集,覆盖多种生态区,包含油棕榈种植的不同阶段和其他类似多年生作物的详细标注。
Method: 采用专家标注和多解释共识机制,结合实地验证,生成了基于多边形的全面标注数据集,适用于训练CNN和地理空间基础模型。
Result: 数据集填补了遥感领域高质量训练数据的空白,支持透明监测油棕榈扩张,有助于全球减少森林砍伐的目标。
Insight: 开放的高质量数据集推动了GeoAI和遥感技术的发展,为可持续性监测提供了重要工具。
Abstract: Oil palm cultivation remains one of the leading causes of deforestation in Indonesia. To better track and address this challenge, detailed and reliable mapping is needed to support sustainability efforts and emerging regulatory frameworks. We present an open-access geospatial dataset of oil palm plantations and related land cover types in Indonesia, produced through expert labeling of high-resolution satellite imagery from 2020 to 2024. The dataset provides polygon-based, wall-to-wall annotations across a range of agro-ecological zones and includes a hierarchical typology that distinguishes oil palm planting stages as well as similar perennial crops. Quality was ensured through multi-interpreter consensus and field validation. The dataset was created using wall-to-wall digitization over large grids, making it suitable for training and benchmarking both conventional convolutional neural networks and newer geospatial foundation models. Released under a CC-BY license, it fills a key gap in training data for remote sensing and aims to improve the accuracy of land cover types mapping. By supporting transparent monitoring of oil palm expansion, the resource contributes to global deforestation reduction goals and follows FAIR data principles.
[26] SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training
Rongsheng Wang,Fenghe Tang,Qingsong Yao,Rui Yan,Xu Zhang,Zhen Huang,Haoran Lai,Zhiyang He,Xiaodong Tao,Zihang Jiang,Shaohua Kevin Zhou
Main category: cs.CV
TL;DR: SimCroP框架通过相似性驱动的跨粒度预训练优化CT影像与报告的对齐与融合,提升稀疏病灶表征学习,在多个下游任务中性能超越现有方法。
Details
Motivation: CT影像中病灶分布稀疏且结构复杂,同时影像与报告间的多粒度关系难以捕捉,亟需一种有效的预训练方法提升表征学习能力。Contribution: 1. 提出SimCroP框架,结合相似性驱动的对齐与跨粒度融合;2. 设计多模态掩码建模优化低层语义理解;3. 实现跨粒度信息整合,提升稀疏病灶的表征能力。
Method: 1. 多模态掩码建模学习影像低层语义;2. 相似性驱动对齐匹配影像块与报告句子;3. 跨粒度融合模块整合实例级与词-块级信息。
Result: 在五个公开数据集上的图像分类与分割任务中,SimCroP超越现有自监督和跨模态预训练方法。
Insight: 通过跨粒度对齐与融合,可有效缓解CT影像稀疏性问题,同时多模态掩码建模能捕捉细粒度语义信息。
Abstract: Medical vision-language pre-training shows great potential in learning representative features from massive paired radiographs and reports. However, in computed tomography (CT) scans, the distribution of lesions which contain intricate structures is characterized by spatial sparsity. Besides, the complex and implicit relationships between different pathological descriptions in each sentence of the report and their corresponding sub-regions in radiographs pose additional challenges. In this paper, we propose a Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework on chest CTs, which combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation. We first leverage multi-modal masked modeling to optimize the encoder for understanding precise low-level semantics from radiographs. Then, similarity-driven alignment is designed to pre-train the encoder to adaptively select and align the correct patches corresponding to each sentence in reports. The cross-granularity fusion module integrates multimodal information across instance level and word-patch level, which helps the model better capture key pathology structures in sparse radiographs, resulting in improved performance for multi-scale downstream tasks. SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks across five public datasets. Experimental results demonstrate that SimCroP outperforms both cutting-edge medical self-supervised learning methods and medical vision-language pre-training methods. Codes and models are available at https://github.com/ToniChopp/SimCroP.
[27] Boosted Training of Lightweight Early Exits for Optimizing CNN Image Classification Inference
Yehudit Aperstein,Alexander Apartsin
Main category: cs.CV
TL;DR: 论文提出了一种名为Boosted Training Scheme for Early Exits (BTS-EE)的训练方法,通过顺序训练中间分类器来解决传统早期退出策略中的协方差偏移问题,并结合轻量级分支架构和类精度边际校准方法,显著提升了CNN在资源受限平台上的推理效率。
Details
Motivation: 在资源受限平台上实现实时图像分类需要平衡准确性和计算开销。传统早期退出策略在训练和推理时存在数据分布不匹配的问题(协方差偏移),限制了效率与准确性的权衡。Contribution: 1. 提出了BTS-EE训练方法,通过顺序训练中间分类器,解决协方差偏移问题;2. 设计了一种基于1D卷积的轻量级分支架构;3. 提出了类精度边际(CPM)校准方法,实现可靠的退出决策。
Method: 1. 采用顺序训练方法(BTS-EE),逐层训练并校准分支;2. 使用1D卷积设计轻量级分支;3. 通过CPM方法为每类样本调整阈值。
Result: 在CINIC-10数据集和ResNet18上,BTS-EE在64种配置中均优于非增强训练方法,计算量减少45%,而准确率仅下降2%。
Insight: BTS-EE不仅提升了推理效率,还为资源受限平台上的CNN部署提供了新的设计思路,适用于工业检测、嵌入式视觉和无人机监控等领域。
Abstract: Real-time image classification on resource-constrained platforms demands inference methods that balance accuracy with strict latency and power budgets. Early-exit strategies address this need by attaching auxiliary classifiers to intermediate layers of convolutional neural networks (CNNs), allowing “easy” samples to terminate inference early. However, conventional training of early exits introduces a covariance shift: downstream branches are trained on full datasets, while at inference they process only the harder, non-exited samples. This mismatch limits efficiency–accuracy trade-offs in practice. We introduce the Boosted Training Scheme for Early Exits (BTS-EE), a sequential training approach that aligns branch training with inference-time data distributions. Each branch is trained and calibrated before the next, ensuring robustness under selective inference conditions. To further support embedded deployment, we propose a lightweight branch architecture based on 1D convolutions and a Class Precision Margin (CPM) calibration method that enables per-class threshold tuning for reliable exit decisions. Experiments on the CINIC-10 dataset with a ResNet18 backbone demonstrate that BTS-EE consistently outperforms non-boosted training across 64 configurations, achieving up to 45 percent reduction in computation with only 2 percent accuracy degradation. These results expand the design space for deploying CNNs in real-time image processing systems, offering practical efficiency gains for applications such as industrial inspection, embedded vision, and UAV-based monitoring.
[28] Retrieval-Augmented VLMs for Multimodal Melanoma Diagnosis
Jihyun Moon,Charmgil Hong
Main category: cs.CV
TL;DR: 该论文提出了一种检索增强的视觉语言模型(VLM)框架,用于多模态黑色素瘤诊断,通过结合语义相似的病例数据提升诊断准确性。
Details
Motivation: 现有的卷积神经网络(CNN)在皮肤镜图像分析中忽略了临床元数据且需要大量预处理,而通用领域的视觉语言模型(VLM)难以捕捉临床特异性。Contribution: 提出了一个检索增强的VLM框架,通过引入语义相似的病例数据,无需微调即可实现更准确的诊断和错误纠正。
Method: 利用检索技术从数据库中获取相似病例,将其整合到诊断提示中,直接指导模型生成更可靠的预测结果。
Result: 该方法显著提升了分类准确性,并在错误纠正方面优于传统基线。
Insight: 检索增强的提示策略为临床决策支持提供了一种鲁棒的解决方案,尤其是在缺乏领域特定训练数据时。
Abstract: Accurate and early diagnosis of malignant melanoma is critical for improving patient outcomes. While convolutional neural networks (CNNs) have shown promise in dermoscopic image analysis, they often neglect clinical metadata and require extensive preprocessing. Vision-language models (VLMs) offer a multimodal alternative but struggle to capture clinical specificity when trained on general-domain data. To address this, we propose a retrieval-augmented VLM framework that incorporates semantically similar patient cases into the diagnostic prompt. Our method enables informed predictions without fine-tuning and significantly improves classification accuracy and error correction over conventional baselines. These results demonstrate that retrieval-augmented prompting provides a robust strategy for clinical decision support.
[29] InsFusion: Rethink Instance-level LiDAR-Camera Fusion for 3D Object Detection
Zhongyu Xia,Hansong Yang,Yongtao Wang
Main category: cs.CV
TL;DR: InsFusion提出了一种新的LiDAR-相机融合方法,通过从原始和融合特征中提取提案并利用注意力机制,减少3D目标检测中的误差累积。
Details
Motivation: 多视角相机和LiDAR在3D目标检测中特征提取、视角变换和特征融合过程中会导致噪声和误差累积,影响检测性能。Contribution: 提出了InsFusion,能够从原始和融合特征中提取提案,并通过注意力机制减轻误差累积的影响。
Method: 从原始和融合特征中提取提案,并利用这些提案查询原始特征,同时引入注意力机制。
Result: 在nuScenes数据集上,InsFusion兼容多种先进基线方法,实现了新的SOTA性能。
Insight: 通过直接查询原始特征和注意力机制,可以有效减少特征融合过程中的误差累积,提升3D目标检测的准确性。
Abstract: Three-dimensional Object Detection from multi-view cameras and LiDAR is a crucial component for autonomous driving and smart transportation. However, in the process of basic feature extraction, perspective transformation, and feature fusion, noise and error will gradually accumulate. To address this issue, we propose InsFusion, which can extract proposals from both raw and fused features and utilizes these proposals to query the raw features, thereby mitigating the impact of accumulated errors. Additionally, by incorporating attention mechanisms applied to the raw features, it thereby mitigates the impact of accumulated errors. Experiments on the nuScenes dataset demonstrate that InsFusion is compatible with various advanced baseline methods and delivers new state-of-the-art performance for 3D object detection.
[30] Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video
Xiao Li,Qi Chen,Xiulian Peng,Kai Yu,Xie Chen,Yan Lu
Main category: cs.CV
TL;DR: 本文提出了一种新的框架,通过自监督学习将视频数据解耦为动态运动(motion)和静态内容(content)两部分,并通过低比特率向量量化促进解耦。
Details
Motivation: 视频数据的动态运动和静态内容通常是纠缠在一起的,传统的解耦方法依赖于强假设或归纳偏置。本文旨在提出一种更通用的自监督框架,减少对先验知识的依赖。Contribution: 1. 提出了一种基于transformer的自监督框架,解耦视频中的运动与内容;2. 引入低比特率向量量化作为信息瓶颈,促进解耦;3. 证明了该方法在真实世界视频(如说话头部数据)和其他视频类型(如2D卡通角色)上的有效性。
Method: 1. 使用transformer架构生成帧级运动和片段级内容的隐式特征;2. 通过低比特率向量量化形成离散的运动空间;3. 将解耦后的运动与内容作为条件输入到去噪扩散模型中,支持自监督表示学习。
Result: 在运动迁移和自回归运动生成任务上验证了框架的有效性,且能推广到多种视频类型。
Insight: 通过控制比特率可以更有效地促进解耦,同时自监督学习在缺乏标记数据时仍能学习到有意义的视频表示。
Abstract: We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.
[31] Semantic Causality-Aware Vision-Based 3D Occupancy Prediction
Dubing Chen,Huan Zheng,Yucheng Zhou,Xianfei Li,Wenlong Liao,Tao He,Pai Peng,Jianbing Shen
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于语义因果关系的视觉3D占用预测方法,通过设计新颖的因果损失函数,实现端到端的模块化2D到3D转换管道的整体监督。
Details
Motivation: 现有方法依赖模块化管道,独立优化或使用预配置输入,导致级联错误。通过引入语义因果关系的监督机制,解决这一问题。Contribution: 1. 提出因果损失函数,统一学习过程,使不可训练的组件可学习;2. 设计语义因果关系感知的2D到3D转换方法。
Method: 方法由三部分组成:通道分组提升(Channel-Grouped Lifting)、可学习相机偏移(Learnable Camera Offsets)和归一化卷积(Normalized Convolution)。
Result: 在Occ3D基准测试中达到最优性能,显著提升了鲁棒性和2D到3D语义一致性。
Insight: 语义因果关系提供了一种新的监督机制,能够有效解决模块化管道中的级联错误问题。
Abstract: Vision-based 3D semantic occupancy prediction is a critical task in 3D vision that integrates volumetric 3D reconstruction with semantic understanding. Existing methods, however, often rely on modular pipelines. These modules are typically optimized independently or use pre-configured inputs, leading to cascading errors. In this paper, we address this limitation by designing a novel causal loss that enables holistic, end-to-end supervision of the modular 2D-to-3D transformation pipeline. Grounded in the principle of 2D-to-3D semantic causality, this loss regulates the gradient flow from 3D voxel representations back to the 2D features. Consequently, it renders the entire pipeline differentiable, unifying the learning process and making previously non-trainable components fully learnable. Building on this principle, we propose the Semantic Causality-Aware 2D-to-3D Transformation, which comprises three components guided by our causal loss: Channel-Grouped Lifting for adaptive semantic mapping, Learnable Camera Offsets for enhanced robustness against camera perturbations, and Normalized Convolution for effective feature propagation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Occ3D benchmark, demonstrating significant robustness to camera perturbations and improved 2D-to-3D semantic consistency.
[32] VRAE: Vertical Residual Autoencoder for License Plate Denoising and Deblurring
Cuong Nguyen,Dung T. Tran,Hong Nguyen,Xuan-Vu Phan,Nam-Phong Nguyen
Main category: cs.CV
TL;DR: 论文提出了一种垂直残差自编码器(VRAE),用于交通监控中车牌图像的去噪和去模糊任务,显著提升了性能。
Details
Motivation: 在恶劣天气、低光照或高速运动条件下,交通监控中的车牌图像常受噪声和模糊影响,现有方法在保留信息方面不足,需要更高效的解决方案。Contribution: 提出了VRAE架构,通过引入输入感知的辅助模块,改善了信息保留能力,显著提升了图像恢复质量。
Method: 采用垂直残差结构和辅助模块,在编码阶段注入输入相关特征,优化了自编码器的性能。
Result: 相比传统自编码器(AE)、生成对抗网络(GAN)和基于流的方法(FB),VRAE在PSNR、NMSE和SSIM指标上均有显著提升,且参数增加较少。
Insight: 引入输入感知的特征注入机制可以有效改善自编码器的信息保留能力,适用于小目标图像恢复任务。
Abstract: In real-world traffic surveillance, vehicle images captured under adverse weather, poor lighting, or high-speed motion often suffer from severe noise and blur. Such degradations significantly reduce the accuracy of license plate recognition systems, especially when the plate occupies only a small region within the full vehicle image. Restoring these degraded images a fast realtime manner is thus a crucial pre-processing step to enhance recognition performance. In this work, we propose a Vertical Residual Autoencoder (VRAE) architecture designed for the image enhancement task in traffic surveillance. The method incorporates an enhancement strategy that employs an auxiliary block, which injects input-aware features at each encoding stage to guide the representation learning process, enabling better general information preservation throughout the network compared to conventional autoencoders. Experiments on a vehicle image dataset with visible license plates demonstrate that our method consistently outperforms Autoencoder (AE), Generative Adversarial Network (GAN), and Flow-Based (FB) approaches. Compared with AE at the same depth, it improves PSNR by about 20%, reduces NMSE by around 50%, and enhances SSIM by 1%, while requiring only a marginal increase of roughly 1% in parameters.
[33] Sparse BEV Fusion with Self-View Consistency for Multi-View Detection and Tracking
Keisuke Toida,Taigo Sakai,Naoki Kato,Kazutoyo Yokota,Takeshi Nakamura,Kazuhiro Hotta
Main category: cs.CV
TL;DR: 该论文提出了SCFusion框架,通过稀疏变换、密度感知加权和多视角一致性损失,改进了多视角特征融合,提升了多视角目标检测与跟踪的性能。
Details
Motivation: 多视角多目标跟踪(MVMOT)在应用中常因视角变化、光照差异和遮挡等问题导致物体身份不一致,现有方法通过BEV投影虽提升了鲁棒性,但存在特征扭曲和非均匀密度问题。Contribution: 1. 提出稀疏变换避免投影中的插值问题;
2. 设计密度感知加权自适应融合特征;
3. 引入多视角一致性损失提升特征判别性。
Method: 结合稀疏变换、密度感知加权和多视角一致性损失,优化BEV空间中的多视角特征融合。
Result: 在WildTrack上IDF1达到95.9%,MultiviewX上MODP为89.2%,优于基准方法TrackTacular。
Insight: SCFusion通过稀疏化与一致性约束,有效缓解了BEV投影的局限性,为多视角跟踪提供了更鲁棒的解决方案。
Abstract: Multi-View Multi-Object Tracking (MVMOT) is essential for applications such as surveillance, autonomous driving, and sports analytics. However, maintaining consistent object identities across multiple cameras remains challenging due to viewpoint changes, lighting variations, and occlusions, which often lead to tracking errors.Recent methods project features from multiple cameras into a unified Bird’s-Eye-View (BEV) space to improve robustness against occlusion. However, this projection introduces feature distortion and non-uniform density caused by variations in object scale with distance. These issues degrade the quality of the fused representation and reduce detection and tracking accuracy.To address these problems, we propose SCFusion, a framework that combines three techniques to improve multi-view feature integration. First, it applies a sparse transformation to avoid unnatural interpolation during projection. Next, it performs density-aware weighting to adaptively fuse features based on spatial confidence and camera distance. Finally, it introduces a multi-view consistency loss that encourages each camera to learn discriminative features independently before fusion.Experiments show that SCFusion achieves state-of-the-art performance, reaching an IDF1 score of 95.9% on WildTrack and a MODP of 89.2% on MultiviewX, outperforming the baseline method TrackTacular. These results demonstrate that SCFusion effectively mitigates the limitations of conventional BEV projection and provides a robust and accurate solution for multi-view object detection and tracking.
[34] LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations
Payal Varshney,Adriano Lucieri,Christoph Balada,Sheraz Ahmed,Andreas Dengel
Main category: cs.CV
TL;DR: LD-ViCE提出了一种基于潜在扩散模型的视频反事实解释框架,旨在解决视频AI系统解释性不足的问题,通过降低计算成本并提高语义保真度,在三个数据集上表现优于现有方法。
Details
Motivation: 视频AI系统在安全关键领域(如自动驾驶和医疗)的广泛应用需要更高的解释性,当前解释方法在时间一致性、鲁棒性和因果洞察方面存在不足。Contribution: 提出了LD-ViCE框架,结合潜在扩散模型生成高效、语义保真且时间一致的反事实解释,显著提升了性能(R2分数提升68%)并减少推理时间。
Method: 利用潜在扩散模型在隐空间生成解释,并通过细化步骤提升反事实的逼真度和可解释性。
Result: 在EchoNet-Dynamic、FERV39k和Something-Something V2数据集上表现优于现有方法,推理时间减半。
Insight: 在隐空间操作和细化步骤的结合是提高视频反事实解释质量和效率的关键。
Abstract: Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence, insufficient robustness, and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Our experiments demonstrate the effectiveness of LD-ViCE across three diverse video datasets, including EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition). LD-ViCE outperforms a recent state-of-the-art method, achieving an increase in R2 score of up to 68% while reducing inference time by half. Qualitative analysis confirms that LD-ViCE generates semantically meaningful and temporally coherent explanations, offering valuable insights into the target model behavior. LD-ViCE represents a valuable step toward the trustworthy deployment of AI in safety-critical domains.
[35] Beyond Distribution Shifts: Adaptive Hyperspectral Image Classification at Test Time
Xia Yue,Anfeng Liu,Ning Chen,Chenjia Huang,Hui Liu,Zhou Huang,Leyuan Fang
Main category: cs.CV
TL;DR: 论文提出HyperTTA框架,用于增强高光谱图像(HSI)分类模型在多种退化条件下的鲁棒性。通过构建多退化数据集、设计光谱-空间变换器分类器(SSTC),并提出轻量级测试时适应策略(CELA),实现了动态适应且无需源数据或目标标注。
Details
Motivation: 高光谱图像分类模型对噪声、模糊等现实退化条件非常敏感,现有方法难以应对多样化的分布偏移。Contribution: 1) 构建多退化高光谱数据集;2) 设计SSTC分类器,结合多级感受野机制和标签平滑正则化;3) 提出CELA轻量级测试时适应策略。
Method: 1) 数据:模拟九种退化类型;2) SSTC:光谱-空间变换器,捕获多尺度空间上下文;3) CELA:基于置信度的熵最小化LayerNorm适配器,仅更新仿射参数。
Result: 在两种基准数据集上验证了HyperTTA在多种退化场景下优于现有基线。
Insight: 轻量级测试时适应策略可以在不依赖源数据或目标标注的情况下,高效适应动态退化条件。
Abstract: Hyperspectral image (HSI) classification models are highly sensitive to distribution shifts caused by various real-world degradations such as noise, blur, compression, and atmospheric effects. To address this challenge, we propose HyperTTA, a unified framework designed to enhance model robustness under diverse degradation conditions. Specifically, we first construct a multi-degradation hyperspectral dataset that systematically simulates nine representative types of degradations, providing a comprehensive benchmark for robust classification evaluation. Based on this, we design a spectral-spatial transformer classifier (SSTC) enhanced with a multi-level receptive field mechanism and label smoothing regularization to jointly capture multi-scale spatial context and improve generalization. Furthermore, HyperTTA incorporates a lightweight test-time adaptation (TTA) strategy, the confidence-aware entropy-minimized LayerNorm adapter (CELA), which updates only the affine parameters of LayerNorm layers by minimizing prediction entropy on high-confidence unlabeled target samples. This confidence-aware adaptation prevents unreliable updates from noisy predictions, enabling robust and dynamic adaptation without access to source data or target annotations. Extensive experiments on two benchmark datasets demonstrate that HyperTTA outperforms existing baselines across a wide range of degradation scenarios, validating the effectiveness of both its classification backbone and the proposed TTA scheme. Code will be made available publicly.
[36] Spherical Brownian Bridge Diffusion Models for Conditional Cortical Thickness Forecasting
Ivan Stoyanov,Fabian Bongratz,Christian Wachinger
Main category: cs.CV
TL;DR: 该论文提出了一种名为Spherical Brownian Bridge Diffusion Model (SBDM)的新方法,用于预测个性化的脑皮质厚度(CTh)轨迹,解决了非欧几何和多模态数据整合的挑战。
Details
Motivation: 准确预测高分辨率的脑皮质厚度(CTh)轨迹对检测神经退行性变化和早期干预至关重要,但由于皮质的复杂几何和需整合多模态数据,这一任务极具挑战性。Contribution: 提出了SBDM和条件球形U-Net (CoS-UNet)去噪模型,结合球形卷积和密集交叉注意力机制,显著降低了预测误差。
Method: 使用双向条件布朗桥扩散过程预测CTh轨迹,并通过CoS-UNet整合皮质表面和表格条件数据。
Result: 在ADNI和OASIS数据集上,SBDM显著优于先前方法,并能生成个体的事实和反事实CTh轨迹。
Insight: SBDM为探索皮质发育的假设情景提供了新框架,展示了其在神经科学研究中的潜力。
Abstract: Accurate forecasting of individualized, high-resolution cortical thickness (CTh) trajectories is essential for detecting subtle cortical changes, providing invaluable insights into neurodegenerative processes and facilitating earlier and more precise intervention strategies. However, CTh forecasting is a challenging task due to the intricate non-Euclidean geometry of the cerebral cortex and the need to integrate multi-modal data for subject-specific predictions. To address these challenges, we introduce the Spherical Brownian Bridge Diffusion Model (SBDM). Specifically, we propose a bidirectional conditional Brownian bridge diffusion process to forecast CTh trajectories at the vertex level of registered cortical surfaces. Our technical contribution includes a new denoising model, the conditional spherical U-Net (CoS-UNet), which combines spherical convolutions and dense cross-attention to integrate cortical surfaces and tabular conditions seamlessly. Compared to previous approaches, SBDM achieves significantly reduced prediction errors, as demonstrated by our experiments based on longitudinal datasets from the ADNI and OASIS. Additionally, we demonstrate SBDM’s ability to generate individual factual and counterfactual CTh trajectories, offering a novel framework for exploring hypothetical scenarios of cortical development.
[37] First-order State Space Model for Lightweight Image Super-resolution
Yujie Zhu,Xinyi Zhang,Yekai Lu,Guang Yang,Faming Fang,Guixu Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种改进的状态空间模型(FSSM),用于轻量级图像超分辨率任务,通过引入一阶保持条件和改进SSM模块的计算过程,提升了性能且未增加参数数量。
Details
Motivation: 状态空间模型(SSMs)在NLP任务中表现突出,但在视觉任务中的应用较少。作者希望探索SSM在轻量级图像超分辨率任务中的潜力,尤其是改进SSM模块的性能。Contribution: 主要贡献是提出了FSSM,通过改进Mamba模块的计算过程和一阶保持条件的应用,提升了图像超分辨率的性能,同时未增加参数数量。
Method: 方法包括:1)修改SSM的计算过程;2)引入一阶保持条件以改进SSM的离散化形式;3)分析累积误差。
Result: 实验结果表明,FSSM在五个基准数据集上提升了MambaIR的性能,超过了当前轻量级SR方法,达到了最先进的结果。
Insight: SSM模块的改进在视觉任务中仍有潜力,尤其是通过细粒度的离散化和误差分析可以提升性能。
Abstract: State space models (SSMs), particularly Mamba, have shown promise in NLP tasks and are increasingly applied to vision tasks. However, most Mamba-based vision models focus on network architecture and scan paths, with little attention to the SSM module. In order to explore the potential of SSMs, we modified the calculation process of SSM without increasing the number of parameters to improve the performance on lightweight super-resolution tasks. In this paper, we introduce the First-order State Space Model (FSSM) to improve the original Mamba module, enhancing performance by incorporating token correlations. We apply a first-order hold condition in SSMs, derive the new discretized form, and analyzed cumulative error. Extensive experimental results demonstrate that FSSM improves the performance of MambaIR on five benchmark datasets without additionally increasing the number of parameters, and surpasses current lightweight SR methods, achieving state-of-the-art results.
[38] Maximally Useful and Minimally Redundant: The Key to Self Supervised Learning for Imbalanced Data
Yash Kumar Sharma,Vineet Nair,Wilson Naik
Main category: cs.CV
TL;DR: 论文提出了一种基于多视图互信息的自监督学习方法,用于解决不平衡数据集中的特征学习问题,并取得了显著的性能提升。
Details
Motivation: 当前对比自监督学习(CSSL)在平衡数据集上表现良好,但对于不平衡数据集的鲁棒性未被充分研究。受Yann LeCun多视图框架的启发,本文探索如何借鉴互信息理论来改进不平衡数据集的自监督学习。Contribution: 1. 提出了一种基于多视图互信息的理论框架,支持使用多于两个视图的目标。2. 设计了一种损失函数,通过区分类内和类间判别特征,帮助提取尾部类的代表性特征。3. 在多种自监督框架(对比和非对比)上验证了方法的有效性。
Method: 1. 引入多视图互信息理论来指导特征学习。2. 设计了一种新的损失函数,过滤极端特征并优化特征表示。3. 采用对比和非对比框架结合多视图目标进行实验验证。
Result: 在多个不平衡数据集上实现了显著的性能提升:Cifar10-LT(ResNet-18)提升2%,Cifar100-LT(ResNet-18)提升5%,Imagenet-LT(1k, ResNet-50)提升3%,达到新SOTA。
Insight: 多视图互信息理论在不平衡数据集中具有潜力,能够有效提取尾部类的特征。损失函数的设计通过过滤极端特征,进一步提升了模型的鲁棒性和泛化能力。
Abstract: The robustness of contrastive self-supervised learning (CSSL) for imbalanced datasets is largely unexplored. CSSL usually makes use of \emph{multi-view} assumptions to learn discriminatory features via similar and dissimilar data samples. CSSL works well on balanced datasets, but does not generalize well for imbalanced datasets. In a very recent paper, as part of future work, Yann LeCun pointed out that the self-supervised multiview framework can be extended to cases involving \emph{more than two views}. Taking a cue from this insight we propose a theoretical justification based on the concept of \emph{mutual information} to support the \emph{more than two views} objective and apply it to the problem of dataset imbalance in self-supervised learning. The proposed method helps extract representative characteristics of the tail classes by segregating between \emph{intra} and \emph{inter} discriminatory characteristics. We introduce a loss function that helps us to learn better representations by filtering out extreme features. Experimental evaluation on a variety of self-supervised frameworks (both contrastive and non-contrastive) also prove that the \emph{more than two view} objective works well for imbalanced datasets. We achieve a new state-of-the-art accuracy in self-supervised imbalanced dataset classification (2% improvement in Cifar10-LT using Resnet-18, 5% improvement in Cifar100-LT using Resnet-18, 3% improvement in Imagenet-LT (1k) using Resnet-50).
[39] Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation
Kaleem Ahmad
Main category: cs.CV
TL;DR: 论文介绍了一种基于多模态生成AI的提示驱动图像分析流程,结合开放词汇检测、可提示分割、文本条件修复和视觉语言描述,实现透明化调试和高效运行。
Details
Motivation: 通过整合多模态AI和视觉模型,简化图像分析的复杂流程,提升透明性和可靠性,适用于对象替换、场景增强和删除等任务。Contribution: 提出了一个统一的端到端工作流,通过单一提示完成检测、分割、修复和描述,提供交互式UI和脚本化CLI,增强了操作透明性和可靠性。
Method: 结合开放词汇检测、可提示分割、文本条件修复和视觉语言描述技术,通过阈值调整、掩膜形态学检查和资源感知默认值优化流程性能。
Result: 单次提示的分割和检测准确率达85%以上,修复占运行时60-75%,需精细调参。
Insight: 透明化调试和多模态整合是关键,操作实践(如版本固定和种子控制)对可靠性和一致性至关重要。
Abstract: Prompt-driven image analysis converts a single natural-language instruction into multiple steps: locate, segment, edit, and describe. We present a practical case study of a unified pipeline that combines open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description into a single workflow. The system works end to end from a single prompt, retains intermediate artifacts for transparent debugging (such as detections, masks, overlays, edited images, and before and after composites), and provides the same functionality through an interactive UI and a scriptable CLI for consistent, repeatable runs. We highlight integration choices that reduce brittleness, including threshold adjustments, mask inspection with light morphology, and resource-aware defaults. In a small, single-word prompt segment, detection and segmentation produced usable masks in over 90% of cases with an accuracy above 85% based on our criteria. On a high-end GPU, inpainting makes up 60 to 75% of total runtime under typical guidance and sampling settings, which highlights the need for careful tuning. The study offers implementation-guided advice on thresholds, mask tightness, and diffusion parameters, and details version pinning, artifact logging, and seed control to support replay. Our contribution is a transparent, reliable pattern for assembling modern vision and multimodal models behind a single prompt, with clear guardrails and operational practices that improve reliability in object replacement, scene augmentation, and removal.
[40] A Structured Review of Underwater Object Detection Challenges and Solutions: From Traditional to Large Vision Language Models
Edwine Nabahirwa,Wei Song,Minghua Zhang,Yi Fang,Zhou Ni
Main category: cs.CV
TL;DR: 这篇综述系统分析了水下目标检测(UOD)的挑战与解决方案,涵盖了从传统方法到大型视觉语言模型(LVLMs)的进展,并提出了未来研究方向。
Details
Motivation: 水下目标检测对海洋应用至关重要,但由于水下环境的复杂性,现有方法难以完全解决其挑战。Contribution: 1. 将UOD挑战系统分为五大类;2. 分析了从传统方法到LVLMs的演进;3. 展示了LVLMs在UOD中的潜力;4. 提出了未来研究方向。
Method: 1. 文献综述与挑战分类;2. 案例分析(如DALL-E 3生成合成数据、Florence-2 LVLM的微调);3. LVLMs的多模态能力探索。
Result: 1. 现有方法难以完全应对水下环境的动态性和图像退化问题;2. 合成数据生成有潜力但需优化;3. LVLMs在UOD中前景广阔,但实时应用仍需研究。
Insight: 1. LVLMs可能是解决UOD复杂挑战的关键;2. 合成数据的生成与优化是未来研究重点;3. LVLMs的实际应用需进一步优化。
Abstract: Underwater object detection (UOD) is vital to diverse marine applications, including oceanographic research, underwater robotics, and marine conservation. However, UOD faces numerous challenges that compromise its performance. Over the years, various methods have been proposed to address these issues, but they often fail to fully capture the complexities of underwater environments. This review systematically categorizes UOD challenges into five key areas: Image quality degradation, target-related issues, data-related challenges, computational and processing constraints, and limitations in detection methodologies. To address these challenges, we analyze the progression from traditional image processing and object detection techniques to modern approaches. Additionally, we explore the potential of large vision-language models (LVLMs) in UOD, leveraging their multi-modal capabilities demonstrated in other domains. We also present case studies, including synthetic dataset generation using DALL-E 3 and fine-tuning Florence-2 LVLM for UOD. This review identifies three key insights: (i) Current UOD methods are insufficient to fully address challenges like image degradation and small object detection in dynamic underwater environments. (ii) Synthetic data generation using LVLMs shows potential for augmenting datasets but requires further refinement to ensure realism and applicability. (iii) LVLMs hold significant promise for UOD, but their real-time application remains under-explored, requiring further research on optimization techniques.
[41] Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening
Piyush Bagad,Andrew Zisserman
Main category: cs.CV
TL;DR: 该论文提出了一种时间敏感的视频表示学习方法,通过引入手性动作识别任务(区分时间相反的动作)和自监督适应方法,构建了一个紧凑且对时间敏感的视频嵌入模型。
Details
Motivation: 现有视频嵌入模型在区分时间相反的动作(如“开门与关门”)上表现不佳,这类动作在日常中频繁出现且需要理解时间上的视觉变化。Contribution: 1. 引入手性动作识别任务;2. 提出了一种自监督适应方法,将时间敏感性注入冻结的图像特征序列中;3. 设计了基于感知直线化的自编码模型。
Method: 使用自编码器结构,在潜在空间中引入感知直线化的归纳偏置,通过学习时间敏感的特征表示。
Result: 在多个数据集(Something-Something、EPIC-Kitchens、Charade)上表现优异,超越了大规模预训练的视频模型,并能提升现有模型的分类性能。
Insight: 通过时间敏感的特征学习,可以更有效地捕捉视频中简单的视觉变化,从而提升对时间相关任务的性能。
Abstract: Our objective is to develop compact video representations that are sensitive to visual change over time. To measure such time-sensitivity, we introduce a new task: chiral action recognition, where one needs to distinguish between a pair of temporally opposite actions, such as “opening vs. closing a door”, “approaching vs. moving away from something”, “folding vs. unfolding paper”, etc. Such actions (i) occur frequently in everyday life, (ii) require understanding of simple visual change over time (in object state, size, spatial position, count . . . ), and (iii) are known to be poorly represented by many video embeddings. Our goal is to build time aware video representations which offer linear separability between these chiral pairs. To that end, we propose a self-supervised adaptation recipe to inject time-sensitivity into a sequence of frozen image features. Our model is based on an auto-encoder with a latent space with inductive bias inspired by perceptual straightening. We show that this results in a compact but time-sensitive video representation for the proposed task across three datasets: Something-Something, EPIC-Kitchens, and Charade. Our method (i) outperforms much larger video models pre-trained on large-scale video datasets, and (ii) leads to an improvement in classification performance on standard benchmarks when combined with these existing models.
[42] HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
Liyang Chen,Tianxiang Ma,Jiawei Liu,Bingchuan Li,Zhuowei Chen,Lijie Liu,Xu He,Gen Li,Qian He,Zhiyong Wu
Main category: cs.CV
TL;DR: HuMo提出了一种统一的人类中心视频生成框架,通过两阶段训练和任务特定策略解决多模态输入协调问题,并在实验中表现优异。
Details
Motivation: 现有方法难以协调多模态输入,且缺乏高质量训练数据和有效的任务协作机制。Contribution: 1)构建高质量多模态数据集;2)提出两阶段训练范式;3)设计任务特定策略(如最小侵入图像注入和基于预测的音频引导);4)动态调整的Classifier-Free Guidance策略。
Method: 两阶段训练:第一阶段聚焦主体保留任务,第二阶段逐步引入音频-视觉同步任务。采用最小侵入图像注入策略和基于预测的音频引导方法。
Result: HuMo在子任务中优于现有方法,实现了多模态输入的统一协作控制。
Insight: 1)高质量数据对多模态任务至关重要;2)分阶段训练和任务特定策略可提升模型性能;3)动态调整的引导策略增强灵活性。
Abstract: Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.
[43] MESH – Understanding Videos Like Human: Measuring Hallucinations in Large Video Models
Garry Yang,Zizhe Chen,Man Hon Wong,Haoyu Lei,Yongqiang Chen,Zhenguo Li,Kaiwen Zhou,James Cheng
Main category: cs.CV
TL;DR: 论文提出了MESH评测基准,通过问答框架系统评估大型视频模型(LVMs)中的幻觉问题,揭示其在小细节和多动作对齐方面的局限性。
Details
Motivation: 现有评测基准依赖视频内容的手动分类,忽略了人类感知视频的自然过程。MESH旨在填补这一空白。Contribution: 提出了MESH评测基准,采用问答框架结合目标与陷阱实例,系统评估LVMs的幻觉问题。
Method: MESH采用自底向上方法,评估基本对象、主体特征和主体-动作对,模拟人类视频理解过程。
Result: LVMs在识别基本对象和特征上表现良好,但在处理小细节或多动作对齐时易产生幻觉。
Insight: MESH为评测LVMs提供了更接近人类理解的全面方法,揭示了其在复杂场景中的局限性。
Abstract: Large Video Models (LVMs) build on the semantic capabilities of Large Language Models (LLMs) and vision modules by integrating temporal information to better understand dynamic video content. Despite their progress, LVMs are prone to hallucinations-producing inaccurate or irrelevant descriptions. Current benchmarks for video hallucination depend heavily on manual categorization of video content, neglecting the perception-based processes through which humans naturally interpret videos. We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically. MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances. It follows a bottom-up approach, evaluating basic objects, coarse-to-fine subject features, and subject-action pairs, aligning with human video understanding. We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos. Our evaluations show that while LVMs excel at recognizing basic objects and features, their susceptibility to hallucinations increases markedly when handling fine details or aligning multiple actions involving various subjects in longer videos.
[44] Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles
Eric Slyman,Mehrab Tanjim,Kushal Kafle,Stefan Lee
Main category: cs.CV
TL;DR: 该论文提出了一种名为MMB的多模态贝叶斯提示集成方法,用于校准多模态大语言模型(MLLM)作为评判者对文本到图像生成系统的评估,解决了其存在的偏见、过度自信和性能不一致问题。
Details
Motivation: 多模态大语言模型作为评判者在评估文本到图像生成系统时存在偏见、过度自信和跨领域性能不一致的问题,现有提示集成方法在单模态文本任务中表现良好,但在多模态任务中表现不佳。Contribution: 提出了一种新的多模态感知方法MMB,通过贝叶斯提示集成和图像聚类动态分配提示权重,显著提升了评判的准确性和校准性。
Method: MMB结合了贝叶斯提示集成和图像聚类技术,根据视觉特征动态调整提示权重,从而在多模态任务中实现更可靠的评判。
Result: 在两个文本到图像基准测试(HPSv2和MJBench)中,MMB在人类标注对齐和校准性方面优于现有基线。
Insight: 多模态特定的校准策略对实现可靠的评判至关重要,MMB为大规模文本到图像评估提供了一条可行路径。
Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these “judge” models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge’s true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.
[45] Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation
Wenjun Yu,Yinchen Zhou,Jia-Xuan Jiang,Shubin Zeng,Yuee Li,Zhong Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于语义聚合的视觉-语言模型,通过EM Aggregation机制和Text-Guided Pixel Decoder解决医学图像分割中多模态融合的语义鸿沟和特征分散问题,显著提升模型的泛化能力。
Details
Motivation: 多模态模型在自然图像分割中表现优异,但在医学领域效果不佳,原因在于抽象文本提示与细粒度医学视觉特征间的语义鸿沟及特征分散问题。Contribution: 1. 提出EM Aggregation机制动态聚类特征以减少分散;2. 设计Text-Guided Pixel Decoder利用文本知识指导视觉表征;3. 实验证明方法在多个医学数据集上优于现有SOTA。
Method: 1. EM Aggregation机制通过动态聚类增强跨模态对应;2. Text-Guided Pixel Decoder利用领域不变的文本知识指导视觉表征。
Result: 在公共心脏和眼底数据集上的实验表明,该方法在多领域泛化基准上一致优于现有SOTA方法。
Insight: 语义聚合是解决医学图像分割中多模态融合问题的有效途径,文本引导的视觉表征学习能显著提高模型的泛化能力。
Abstract: Multimodal models have achieved remarkable success in natural image segmentation, yet they often underperform when applied to the medical domain. Through extensive study, we attribute this performance gap to the challenges of multimodal fusion, primarily the significant semantic gap between abstract textual prompts and fine-grained medical visual features, as well as the resulting feature dispersion. To address these issues, we revisit the problem from the perspective of semantic aggregation. Specifically, we propose an Expectation-Maximization (EM) Aggregation mechanism and a Text-Guided Pixel Decoder. The former mitigates feature dispersion by dynamically clustering features into compact semantic centers to enhance cross-modal correspondence. The latter is designed to bridge the semantic gap by leveraging domain-invariant textual knowledge to effectively guide deep visual representations. The synergy between these two mechanisms significantly improves the model’s generalization ability. Extensive experiments on public cardiac and fundus datasets demonstrate that our method consistently outperforms existing SOTA approaches across multiple domain generalization benchmarks.
[46] Improving Greenland Bed Topography Mapping with Uncertainty-Aware Graph Learning on Sparse Radar Data
Bayu Adhi Tama,Homayra Alam,Mostafa Cham,Omar Faruque,Jianwu Wang,Vandana Janeja
Main category: cs.CV
TL;DR: 作者提出了GraphTopoNet,一种基于图学习的框架,通过融合异质监督和蒙特卡洛dropout建模不确定性,提升了格陵兰冰床地形的映射精度。该方法结合了空间图和动态平衡正则化,显著降低了误差。
Details
Motivation: 格陵兰冰床地形的精确映射对海平面预测至关重要,但雷达数据稀疏且分布不均,现有方法难以充分捕捉地形特征。Contribution: 1. 提出了GraphTopoNet框架,融合异质监督和不确定性建模;2. 设计了结合空间图和动态平衡正则化的混合损失函数;3. 在格陵兰三个子区域上显著提升了映射精度。
Method: 通过构建空间图(包含高程、流速等特征)并引入梯度特征和多项式趋势,捕捉局部和全局结构。采用蒙特卡洛dropout显式建模不确定性,并通过混合损失函数处理数据稀疏性。
Result: 在三个格陵兰子区域中,GraphTopoNet较基线方法误差降低了60%,并保留了冰川细节特征。
Insight: 图机器学习可将稀疏、不确定的地球物理观测数据转化为有价值的全球尺度知识,为气候预测和决策提供支持。
Abstract: Accurate maps of Greenland’s subglacial bed are essential for sea-level projections, but radar observations are sparse and uneven. We introduce GraphTopoNet, a graph-learning framework that fuses heterogeneous supervision and explicitly models uncertainty via Monte Carlo dropout. Spatial graphs built from surface observables (elevation, velocity, mass balance) are augmented with gradient features and polynomial trends to capture both local variability and broad structure. To handle data gaps, we employ a hybrid loss that combines confidence-weighted radar supervision with dynamically balanced regularization. Applied to three Greenland subregions, GraphTopoNet outperforms interpolation, convolutional, and graph-based baselines, reducing error by up to 60 percent while preserving fine-scale glacial features. The resulting bed maps improve reliability for operational modeling, supporting agencies engaged in climate forecasting and policy. More broadly, GraphTopoNet shows how graph machine learning can convert sparse, uncertain geophysical observations into actionable knowledge at continental scale.
[47] EfficientIML: Efficient High-Resolution Image Manipulation Localization
Jinhan Li,Haoyang He,Lei Xie,Jiangning Zhang
Main category: cs.CV
TL;DR: 论文提出了一种高效的高分辨率图像篡改定位方法EfficientIML,通过新型数据集和轻量级网络EfficientRWKV解决现有方法在计算资源上的限制。
Details
Motivation: 随着高分辨率图像和基于扩散的伪造方法的普及,传统篡改检测方法无法应对新型伪造类型,且计算复杂度高。Contribution: 1) 提出了包含1200+扩散生成篡改的高分辨率数据集SIF;2) 设计了轻量级三阶段EfficientRWKV网络;3) 多尺度监督策略提升性能。
Method: 采用混合状态空间和注意力机制的EfficientRWKV网络,并行捕捉全局上下文和局部细节,并结合多尺度监督。
Result: 在数据集和标准基准测试中优于ViT和其他轻量级基线,定位性能、计算量和推理速度均表现优异。
Insight: 轻量级混合网络结构在高分辨率图像处理中具有高效性和实用性,适合实时取证应用。
Abstract: With imaging devices delivering ever-higher resolutions and the emerging diffusion-based forgery methods, current detectors trained only on traditional datasets (with splicing, copy-moving and object removal forgeries) lack exposure to this new manipulation type. To address this, we propose a novel high-resolution SIF dataset of 1200+ diffusion-generated manipulations with semantically extracted masks. However, this also imposes a challenge on existing methods, as they face significant computational resource constraints due to their prohibitive computational complexities. Therefore, we propose a novel EfficientIML model with a lightweight, three-stage EfficientRWKV backbone. EfficientRWKV’s hybrid state-space and attention network captures global context and local details in parallel, while a multi-scale supervision strategy enforces consistency across hierarchical predictions. Extensive evaluations on our dataset and standard benchmarks demonstrate that our approach outperforms ViT-based and other SOTA lightweight baselines in localization performance, FLOPs and inference speed, underscoring its suitability for real-time forensic applications.
[48] CLAPS: A CLIP-Unified Auto-Prompt Segmentation for Multi-Modal Retinal Imaging
Zhihao Zhao,Yinzheng Zhao,Junjie Yang,Xiangtong Yao,Quanmin Liang,Shahrooz Faghihroohi,Kai Huang,Nassir Navab,M. Ali Nasseri
Main category: cs.CV
TL;DR: CLAPS是一种基于CLIP和SAM的统一自动提示分割方法,针对多模态视网膜图像,解决了当前方法中的模态模糊性、手动提示依赖以及缺乏统一框架的问题。
Details
Motivation: 当前视网膜图像分割方法面临模态模糊性、依赖手动提示和缺乏统一框架的挑战,CLAPS旨在通过自动化和多模态统一解决这些问题。Contribution: 1) 提出CLIP-unified Auto-Prompt Segmentation (CLAPS),实现多模态和任务的统一分割;2) 通过CLIP预训练解决数据稀缺和分布不平衡问题;3) 使用GroundingDINO自动生成空间提示,结合文本提示增强模态特征。
Method: 1) 预训练CLIP图像编码器;2) 用GroundingDINO自动生成空间边界框提示;3) 通过文本提示结合模态签名统一任务;4) 利用SAM进行精确分割。
Result: 在12个数据集和11个分割任务上的实验表明,CLAPS性能与专家模型相当,并超越现有基准,展示了其广泛的泛化能力。
Insight: CLAPS通过自动化提示和多模态统一,为医学图像分割提供了高效且通用的解决方案,具有成为基础模型的潜力。
Abstract: Recent advancements in foundation models, such as the Segment Anything Model (SAM), have significantly impacted medical image segmentation, especially in retinal imaging, where precise segmentation is vital for diagnosis. Despite this progress, current methods face critical challenges: 1) modality ambiguity in textual disease descriptions, 2) a continued reliance on manual prompting for SAM-based workflows, and 3) a lack of a unified framework, with most methods being modality- and task-specific. To overcome these hurdles, we propose CLIP-unified Auto-Prompt Segmentation (\CLAPS), a novel method for unified segmentation across diverse tasks and modalities in retinal imaging. Our approach begins by pre-training a CLIP-based image encoder on a large, multi-modal retinal dataset to handle data scarcity and distribution imbalance. We then leverage GroundingDINO to automatically generate spatial bounding box prompts by detecting local lesions. To unify tasks and resolve ambiguity, we use text prompts enhanced with a unique “modality signature” for each imaging modality. Ultimately, these automated textual and spatial prompts guide SAM to execute precise segmentation, creating a fully automated and unified pipeline. Extensive experiments on 12 diverse datasets across 11 critical segmentation categories show that CLAPS achieves performance on par with specialized expert models while surpassing existing benchmarks across most metrics, demonstrating its broad generalizability as a foundation model.
[49] AdsQA: Towards Advertisement Video Understanding
Xinwei Long,Kai Tian,Peng Xu,Guoli Jia,Jingxuan Li,Sa Yang,Yihua Shao,Kaiyan Zhang,Che Jiang,Hao Xu,Yang Liu,Jiaheng Ma,Bowen Zhou
Main category: cs.CV
TL;DR: 论文提出了AdsQA,首个基于广告视频的问答基准,用于评估大型语言模型在广告视频理解上的能力。同时提出了ReAd-R模型,通过奖励驱动的优化生成答案,并在基准测试中表现优异。
Details
Motivation: 探索如何利用广告视频丰富且信息密集的特性(如营销逻辑、说服策略和观众参与),测试大型语言模型在超越常见视觉领域内容理解上的能力。Contribution: 1) 提出AdsQA基准,包含1,544个广告视频和10,962个片段,共22.7小时,提供5项挑战性任务;2) 提出ReAd-R模型,通过奖励驱动优化生成答案;3) 对14种顶级LLM进行基准测试,ReAd-R表现最优。
Method: 基于广告视频设计问答任务,构建AdsQA基准;提出ReAd-R模型,结合RL进行奖励驱动的优化。
Result: ReAd-R在AdsQA基准上表现优异,显著优于其他具备长链推理能力的模型。
Insight: 广告视频为测试LLM的多维理解能力提供了挑战性平台,结合RL的奖励驱动优化能有效提升模型表现。
Abstract: Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domain-specific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos’ traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold: (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 22.7 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 top-tier LLMs on AdsQA, and our \texttt{ReAd-R}~achieves the state-of-the-art outperforming strong competitors equipped with long-chain reasoning capabilities by a clear margin.
[50] UOPSL: Unpaired OCT Predilection Sites Learning for Fundus Image Diagnosis Augmentation
Zhihao Zhao,Yinzheng Zhao,Junjie Yang,Xiangtong Yao,Quanmin Liang,Daniel Zapp,Kai Huang,Nassir Navab,M. Ali Nasseri
Main category: cs.CV
TL;DR: 论文提出了一种名为UOPSL的新型多模态框架,利用未配对的OCT和眼底图像学习病变偏好位点,以增强仅基于眼底图像的疾病诊断能力。
Details
Motivation: 多模态眼科图像配对的成本高昂,且眼底图像与OCT数据之间模态不平衡,传统方法难以捕获细粒度的空间信息。Contribution: 提出了UOPSL框架通过对比学习和病变偏好位点矩阵,动态识别病变位点,提升了仅基于眼底图像的疾病诊断性能。
Method: 1. 通过对比学习在未配对的OCT和眼底图像上训练;2. 学习OCT潜在空间中的病变偏好位点矩阵;3. 在仅使用眼底图像的下游任务中应用该矩阵辅助分类。
Result: 在9个数据集上的28个关键类别中,UOPSL显著优于现有基准方法。
Insight: 通过OCT的空间先验增强眼底图像诊断能力,解决了模态不平衡问题,为多模态医学图像分析提供了新思路。
Abstract: Significant advancements in AI-driven multimodal medical image diagnosis have led to substantial improvements in ophthalmic disease identification in recent years. However, acquiring paired multimodal ophthalmic images remains prohibitively expensive. While fundus photography is simple and cost-effective, the limited availability of OCT data and inherent modality imbalance hinder further progress. Conventional approaches that rely solely on fundus or textual features often fail to capture fine-grained spatial information, as each imaging modality provides distinct cues about lesion predilection sites. In this study, we propose a novel unpaired multimodal framework \UOPSL that utilizes extensive OCT-derived spatial priors to dynamically identify predilection sites, enhancing fundus image-based disease recognition. Our approach bridges unpaired fundus and OCTs via extended disease text descriptions. Initially, we employ contrastive learning on a large corpus of unpaired OCT and fundus images while simultaneously learning the predilection sites matrix in the OCT latent space. Through extensive optimization, this matrix captures lesion localization patterns within the OCT feature space. During the fine-tuning or inference phase of the downstream classification task based solely on fundus images, where paired OCT data is unavailable, we eliminate OCT input and utilize the predilection sites matrix to assist in fundus image classification learning. Extensive experiments conducted on 9 diverse datasets across 28 critical categories demonstrate that our framework outperforms existing benchmarks.
[51] LADB: Latent Aligned Diffusion Bridges for Semi-Supervised Domain Translation
Xuqin Wang,Tao Wu,Yanfeng Zhang,Lu Liu,Dong Wang,Mingwei Sun,Yongliang Wang,Niclas Zeller,Daniel Cremers
Main category: cs.CV
TL;DR: LADB是一种半监督领域转换框架,利用部分配对数据在共享潜在空间中对齐源域和目标域分布,结合预训练的扩散模型和目标域潜在对齐扩散模型,实现高效领域转换。
Details
Motivation: 扩散模型在数据稀缺领域表现不佳,且需要大量配对数据。LADB旨在通过部分配对数据解决这一问题,提升领域转换的效率和可控性。Contribution: 提出了LADB框架,结合预训练的扩散模型和目标域潜在对齐扩散模型,利用部分配对数据实现高效领域转换,并在深度到图像转换等任务中展现优越性能。
Method: LADB通过在共享潜在空间中对齐源域和目标域分布,结合预训练的扩散模型和目标域潜在对齐扩散模型(LADM),实现半监督领域转换。
Result: 实验表明,LADB在部分监督的深度到图像转换任务中表现优异,并可扩展至多源和多目标转换任务。
Insight: LADB通过部分配对数据实现了高效且可控的领域转换,特别适用于标注成本高或不完整的场景。
Abstract: Diffusion models excel at generating high-quality outputs but face challenges in data-scarce domains, where exhaustive retraining or costly paired data are often required. To address these limitations, we propose Latent Aligned Diffusion Bridges (LADB), a semi-supervised framework for sample-to-sample translation that effectively bridges domain gaps using partially paired data. By aligning source and target distributions within a shared latent space, LADB seamlessly integrates pretrained source-domain diffusion models with a target-domain Latent Aligned Diffusion Model (LADM), trained on partially paired latent representations. This approach enables deterministic domain mapping without the need for full supervision. Compared to unpaired methods, which often lack controllability, and fully paired approaches that require large, domain-specific datasets, LADB strikes a balance between fidelity and diversity by leveraging a mixture of paired and unpaired latent-target couplings. Our experimental results demonstrate superior performance in depth-to-image translation under partial supervision. Furthermore, we extend LADB to handle multi-source translation (from depth maps and segmentation masks) and multi-target translation in a class-conditioned style transfer task, showcasing its versatility in handling diverse and heterogeneous use cases. Ultimately, we present LADB as a scalable and versatile solution for real-world domain translation, particularly in scenarios where data annotation is costly or incomplete.
[52] FractalPINN-Flow: A Fractal-Inspired Network for Unsupervised Optical Flow Estimation with Total Variation Regularization
Sara Behnamian,Rasoul Khaksarinezhad,Andreas Langer
Main category: cs.CV
TL;DR: FractalPINN-Flow提出了一种基于分形几何的无监督光流估计框架,通过分形变形网络(FDN)和总变差(TV)正则化实现了高分辨率数据的准确光流估计。
Details
Motivation: 传统光流估计方法依赖有标注数据且难以处理高分辨率和大运动范围,而FractalPINN-Flow旨在通过无监督学习和分形结构克服这些限制。Contribution: 1. 提出分形变形网络(FDN),通过递归编码器-解码器结构捕捉多层次运动模式;2. 结合TV正则化和亮度一致性约束的无监督目标函数。
Method: FDN采用嵌套的编码器-解码器结构,利用分形自相似性提取特征;训练目标结合$L^1$/$L^2$数据保真度和TV正则化,优化光流场。
Result: 实验表明,FractalPINN-Flow在高分辨率数据上表现优异,边缘保持效果好,适用于标注有限的场景。
Insight: 分形结构的递归设计能有效捕捉光流的多尺度特征,TV正则化在无监督设置下显著提升了光流场的平滑性和一致性。
Abstract: We present FractalPINN-Flow, an unsupervised deep learning framework for dense optical flow estimation that learns directly from consecutive grayscale frames without requiring ground truth. The architecture centers on the Fractal Deformation Network (FDN) - a recursive encoder-decoder inspired by fractal geometry and self-similarity. Unlike traditional CNNs with sequential downsampling, FDN uses repeated encoder-decoder nesting with skip connections to capture both fine-grained details and long-range motion patterns. The training objective is based on a classical variational formulation using total variation (TV) regularization. Specifically, we minimize an energy functional that combines $L^1$ and $L^2$ data fidelity terms to enforce brightness constancy, along with a TV term that promotes spatial smoothness and coherent flow fields. Experiments on synthetic and benchmark datasets show that FractalPINN-Flow produces accurate, smooth, and edge-preserving optical flow fields. The model is especially effective for high-resolution data and scenarios with limited annotations.
[53] Multi-Modal Robust Enhancement for Coastal Water Segmentation: A Systematic HSV-Guided Framework
Zhen Tian,Christos Anagnostopoulos,Qiyuan Wang,Zhiwei Gao
Main category: cs.CV
TL;DR: 该论文提出了一种基于HSV色彩空间监督和多模态约束的系统性鲁棒增强框架(Robust U-Net),用于提升海岸水域分割的准确性和稳定性。
Details
Motivation: 海岸水域分割在卫星图像中面临复杂光谱特征和不规则边界模式的挑战,传统RGB方法在多样化海洋环境中表现不稳定且泛化能力差。Contribution: 主要贡献包括:(1)提出HSV色彩空间监督框架;(2)整合梯度优化、形态学后处理等五个组件;(3)展示训练稳定性提升和分割质量改进。
Method: 方法结合HSV监督、基于梯度的海岸线优化、形态学后处理、海区域清理和连通性控制,通过系统性框架提升分割性能。
Result: 实验表明HSV监督影响最大(影响分数0.85),完整框架显著提升训练稳定性(方差减少84%)和分割质量。
Insight: HSV色彩空间监督在处理复杂光照和光谱变化时更有效,多模态约束显著改善分割的鲁棒性和泛化能力。
Abstract: Coastal water segmentation from satellite imagery presents unique challenges due to complex spectral characteristics and irregular boundary patterns. Traditional RGB-based approaches often suffer from training instability and poor generalization in diverse maritime environments. This paper introduces a systematic robust enhancement framework, referred to as Robust U-Net, that leverages HSV color space supervision and multi-modal constraints for improved coastal water segmentation. Our approach integrates five synergistic components: HSV-guided color supervision, gradient-based coastline optimization, morphological post-processing, sea area cleanup, and connectivity control. Through comprehensive ablation studies, we demonstrate that HSV supervision provides the highest impact (0.85 influence score), while the complete framework achieves superior training stability (84% variance reduction) and enhanced segmentation quality. Our method shows consistent improvements across multiple evaluation metrics while maintaining computational efficiency. For reproducibility, our training configurations and code are available here: https://github.com/UofgCoastline/ICASSP-2026-Robust-Unet.
[54] Computational Imaging for Enhanced Computer Vision
Humera Shaikh,Kaur Jashanpreet
Main category: cs.CV
TL;DR: 这篇论文综述了计算成像(CI)技术及其对计算机视觉(CV)应用的变革性影响,探讨了CI如何通过改进图像获取和重建过程,应对低光、运动模糊等高难度场景中的挑战,从而提升CV任务的表现。
Details
Motivation: 传统成像方法在高难度场景下(如低光、运动模糊)难以获取高质量视觉数据,限制了计算机视觉系统的性能。计算成像技术通过改进图像获取和重建过程,为解决这些问题提供了新途径。Contribution: 论文系统综述了计算成像技术的分类和应用,探讨了CI与CV核心任务(如目标检测、深度估计)的协同作用,并提出了任务特定、自适应的成像管道潜力。
Method: 通过文献调研和系统性分析,论文总结了CI技术(如光场成像、HDR成像、去模糊)及其对CV任务的影响。
Result: 研究强调了CI技术在提升CV任务鲁棒性和准确性方面的潜力,尤其是在自动驾驶、监控、AR和机器人等实际应用中。
Insight: 未来研究方向包括开发自适应成像管道,进一步推动CI技术在复杂场景中的应用。
Abstract: This paper presents a comprehensive survey of computational imaging (CI) techniques and their transformative impact on computer vision (CV) applications. Conventional imaging methods often fail to deliver high-fidelity visual data in challenging conditions, such as low light, motion blur, or high dynamic range scenes, thereby limiting the performance of state-of-the-art CV systems. Computational imaging techniques, including light field imaging, high dynamic range (HDR) imaging, deblurring, high-speed imaging, and glare mitigation, address these limitations by enhancing image acquisition and reconstruction processes. This survey systematically explores the synergies between CI techniques and core CV tasks, including object detection, depth estimation, optical flow, face recognition, and keypoint detection. By analyzing the relationships between CI methods and their practical contributions to CV applications, this work highlights emerging opportunities, challenges, and future research directions. We emphasize the potential for task-specific, adaptive imaging pipelines that improve robustness, accuracy, and efficiency in real-world scenarios, such as autonomous navigation, surveillance, augmented reality, and robotics.
[55] BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion
Sike Xiang,Shuang Chen,Amir Atapour-Abarghouei
Main category: cs.CV
TL;DR: Error
Details
Motivation: ErrorContribution: Error
Method: Error
Result: Error
Insight: Error
Abstract: As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. The modular and extensible design enables generalisation to broader multimodal tasks. The proposed lightweight vision-language framework is denoted as BcQLM (BreezeCLIP-enhanced Q-Gated Multimodal Language Model). It offers a promising path toward deployable MLLMs under practical hardware constraints. The source code is available at https://github.com/thico0224/BcQLM.
[56] ArgoTweak: Towards Self-Updating HD Maps through Structured Priors
Lena Wild,Rafael Valencia,Patric Jensfelt
Main category: cs.CV
TL;DR: ArgoTweak提出了一种新型数据集,填补了现有高清地图研究中缺乏真实地图先验的空白,并引入了双射映射框架来精确检测和集成地图变化。
Details
Motivation: 现有高清地图研究缺乏包含真实地图先验、当前地图和传感器数据的公开数据集,导致现有方法依赖合成先验,引发不一致性和显著的模拟到现实差距。Contribution: 1)首个完成真实地图先验三元组的数据集ArgoTweak;2)提出双射映射框架,细粒度原子级修改地图元素;3)建立可解释的高清地图基准。
Method: 采用双射映射框架,将大尺度修改分解为细粒度的原子变化,从而确保高保真度和可解释性。
Result: 实验表明,ArgoTweak显著缩小了模拟到现实差距,并通过消融研究验证了结构化先验和详细变化标注的作用。
Insight: 结构化先验和细粒度原子变化是实现自更新高清地图的关键,同时表明真实数据集对提升模型性能的重要性。
Abstract: Reliable integration of prior information is crucial for self-verifying and self-updating HD maps. However, no public dataset includes the required triplet of prior maps, current maps, and sensor data. As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. At its core, ArgoTweak employs a bijective mapping framework, breaking down large-scale modifications into fine-grained atomic changes at the map element level, thus ensuring interpretability. This paradigm shift enables accurate change detection and integration while preserving unchanged elements with high fidelity. Experiments show that training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors. Extensive ablations further highlight the impact of structured priors and detailed change annotations. By establishing a benchmark for explainable, prior-aided HD mapping, ArgoTweak advances scalable, self-improving mapping solutions. The dataset, baselines, map modification toolbox, and further resources are available at https://kth-rpl.github.io/ArgoTweak/.
[57] Quantifying Accuracy of an Event-Based Star Tracker via Earth’s Rotation
Dennis Melamed,Connor Hashemi,Scott McCloskey
Main category: cs.CV
TL;DR: 论文通过利用地球自转作为真实基准,量化了基于事件相机(EBC)的星跟踪系统的精度,展示了其在低成本、低延迟星跟踪中的实用性。
Details
Motivation: 事件相机在星跟踪中有潜力,但缺乏真实数据基准。通过地球自转这一规律性运动,提供了一种量化精度的新方法。Contribution: 提出了以地球自转为基准的方法,量化了EBC星跟踪系统的精度(RMS误差18.47角秒),并展示了其实际应用价值。
Method: 静态固定事件相机并通过望远镜指向夜空,利用地球自转诱导的运动生成事件流,与国际地球自转参考系统(IERS)数据对比评估精度。
Result: 事件相机系统达到18.47角秒的RMS误差,展示了其在星跟踪中的潜力。
Insight: 事件相机因其稀疏数据流、高动态范围和低能耗等优势,适合低成本、低延迟的星跟踪应用。
Abstract: Event-based cameras (EBCs) are a promising new technology for star tracking-based attitude determination, but prior studies have struggled to determine accurate ground truth for real data. We analyze the accuracy of an EBC star tracking system utilizing the Earth’s motion as the ground truth for comparison. The Earth rotates in a regular way with very small irregularities which are measured to the level of milli-arcseconds. By keeping an event camera static and pointing it through a ground-based telescope at the night sky, we create a system where the only camera motion in the celestial reference frame is that induced by the Earth’s rotation. The resulting event stream is processed to generate estimates of orientation which we compare to the International Earth Rotation and Reference System (IERS) measured orientation of the Earth. The event camera system is able to achieve a root mean squared across error of 18.47 arcseconds and an about error of 78.84 arcseconds. Combined with the other benefits of event cameras over framing sensors (reduced computation due to sparser data streams, higher dynamic range, lower energy consumption, faster update rates), this level of accuracy suggests the utility of event cameras for low-cost and low-latency star tracking. We provide all code and data used to generate our results: https://gitlab.kitware.com/nest-public/telescope_accuracy_quantification.
[58] GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts
Jenna Kang,Maria Silva,Patsorn Sangkloy,Kenneth Chen,Niall Williams,Qi Sun
Main category: cs.CV
TL;DR: GeneVA是一个大规模的人工标注数据集,专注于从文本生成的视频中存在的时空伪影,旨在填补现有基准主要集中于生成图像的不足。
Details
Motivation: 生成模型在文本驱动视频生成方面取得了进展,但其随机性可能导致不可预测的伪影。现有数据集主要关注静态图像,缺乏对视频时空复杂性的系统性评估。Contribution: GeneVA是首个大规模标注生成视频伪影的数据集,专注于时空不一致性,为模型性能评估和生成视频质量改进提供了工具。
Method: 通过自然文本提示生成视频,并收集人工标注的时空伪影数据,形成系统性评估基准。
Result: GeneVA数据集为生成视频质量评估和模型改进提供了重要资源。
Insight: 视频生成中的时空一致性是核心挑战,GeneVA的标注数据为未来研究提供了关键支持。
Abstract: Recent advances in probabilistic generative models have extended capabilities from static image synthesis to text-driven video generation. However, the inherent randomness of their generation process can lead to unpredictable artifacts, such as impossible physics and temporal inconsistency. Progress in addressing these challenges requires systematic benchmarks, yet existing datasets primarily focus on generative images due to the unique spatio-temporal complexities of videos. To bridge this gap, we introduce GeneVA, a large-scale artifact dataset with rich human annotations that focuses on spatio-temporal artifacts in videos generated from natural text prompts. We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.
[59] RewardDance: Reward Scaling in Visual Generation
Jie Wu,Yu Gao,Zilyu Ye,Ming Li,Liang Li,Hanzhong Guo,Jie Liu,Zeyue Xue,Xiaoxia Hou,Wei Liu,Yan Zeng,Weilin Huang
Main category: cs.CV
TL;DR: RewardDance提出了一种可扩展的奖励建模框架,通过生成式奖励范式解决了视觉生成中奖励模型(RM)的扩展问题,并有效避免了奖励破解问题。
Details
Motivation: 现有CLIP-based奖励模型存在架构和输入模态限制,而Bradley-Terry损失与视觉语言模型(VLM)的下一个token预测机制不匹配,导致奖励模型难以扩展。此外,RLHF优化过程中的奖励破解问题阻碍了模型质量的提升。Contribution: 1. 提出RewardDance,通过生成式奖励范式将奖励目标与VLM架构对齐;2. 实现模型和任务的扩展性;3. 解决了奖励破解问题,避免了模式崩溃。
Method: RewardDance将奖励分数重新定义为模型预测’yes’ token的概率,表示生成的图像在特定标准下优于参考图像。这种方法实现了奖励目标与VLM架构的内在对齐。
Result: RewardDance在文本到图像、文本到视频和图像到视频生成任务中显著优于现有方法。大规模的RM在RL微调期间表现出高奖励方差,有效抵抗奖励破解,生成多样且高质量的输出。
Insight: 通过生成式奖励范式,RewardDance成功解决了视觉生成中奖励模型的扩展问题,并为避免模式崩溃提供了新的方向。
Abstract: Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model’s probability of predicting a “yes” token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of “reward hacking”: Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.
[60] SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video
David Stotko,Reinhard Klein
Main category: cs.CV
TL;DR: 这篇论文提出了一种新颖的方法SAFT,通过单目RGB视频序列重建织物的3D几何形状和外观,结合物理模拟和可微分渲染技术。通过引入两种新的正则化项,解决了单目视频中的深度模糊问题,并将3D重建误差降低了2.64倍,同时平均每场景耗时30分钟。优化的运动质量足以进行外观估计,恢复织物变形中的锐利细节。
Details
Motivation: 动态3D场景重建是计算机视觉领域的核心挑战之一。现有方法在单目视频中难以处理织物的高质量几何重建和外观估计,尤其是在深度模糊问题方面表现不佳。因此,作者提出了一种结合物理模拟和可微分渲染的方法,以提升重建的精度和真实性。Contribution: 1. 结合3D几何重建和外观估计,提出了一种新颖的系统SAFT。2. 引入了两种新的正则化项,显著解决了单目视频中的深度模糊问题。3. 在3D重建误差上比现有方法降低了2.64倍,同时运行时耗合理。4. 展示了优化运动足以支持高质量的外观估计,从单目视频中恢复细节。
Method: 1. 利用物理模拟生成织物的3D几何形状。2. 结合可微分渲染技术优化几何和外观。3. 提出了两种正则化项,分别针对深度模糊和运动一致性。4. 通过单目RGB视频序列输入,完成3D重建和外观估计的一体化流程。
Result: 1. 3D重建误差比现有方法降低了2.64倍。2. 平均每场景运行时间为30分钟。3. 优化的运动质量支持高质量的外观估计,恢复织物变形中的锐利细节。
Insight: 1. 物理模拟与可微分渲染的结合是解决单目视频重建问题的有效途径。2. 正则化项的设计对提升重建质量至关重要,尤其是在处理深度模糊问题时。3. 该方法展示了从单目视频中同时完成几何和外观估计的潜力。
Abstract: The reconstruction of three-dimensional dynamic scenes is a well-established yet challenging task within the domain of computer vision. In this paper, we propose a novel approach that combines the domains of 3D geometry reconstruction and appearance estimation for physically based rendering and present a system that is able to perform both tasks for fabrics, utilizing only a single monocular RGB video sequence as input. In order to obtain realistic and high-quality deformations and renderings, a physical simulation of the cloth geometry and differentiable rendering are employed. In this paper, we introduce two novel regularization terms for the 3D reconstruction task that improve the plausibility of the reconstruction by addressing the depth ambiguity problem in monocular video. In comparison with the most recent methods in the field, we have reduced the error in the 3D reconstruction by a factor of 2.64 while requiring a medium runtime of 30 min per scene. Furthermore, the optimized motion achieves sufficient quality to perform an appearance estimation of the deforming object, recovering sharp details from this single monocular RGB video.
cs.LG [Back]
[61] AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
Zhiheng Xi,Jixuan Huang,Chenyang Liao,Baodai Huang,Honglin Guo,Jiaqi Liu,Rui Zheng,Junjie Ye,Jiazheng Zhang,Wenxiang Chen,Wei He,Yiwen Ding,Guanyu Li,Zehui Chen,Zhengyin Du,Xuesong Yao,Yufei Xu,Jiecao Chen,Tao Gui,Zuxuan Wu,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang
Main category: cs.LG
TL;DR: 该论文提出了AgentGym-RL框架,通过多轮强化学习训练LLM智能体进行长期决策,并提出了ScalingInter-RL训练方法以实现探索与开发的平衡。
Details
Motivation: 现有方法缺乏统一的交互式强化学习框架,无法在不依赖监督微调的情况下,训练智能体在多样化环境中进行长期决策。Contribution: 提出了模块化、解耦的AgentGym-RL框架和支持主流RL算法的ScalingInter-RL训练方法,以实现稳定且多样化的智能体行为。
Method: 结合模块化架构和ScalingInter-RL训练方法,逐步从开发转向探索,平衡稳定性和多样性。
Result: 在27个多样化任务中,训练出的智能体表现优于或与商业模型相当。
Insight: 通过逐步扩展交互范围,可以有效避免智能体在长期决策中的崩溃问题。
Abstract: Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch – without relying on supervised fine-tuning (SFT) – across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework – including code and datasets – to empower the research community in developing the next generation of intelligent agents.
[62] Merge-of-Thought Distillation
Zhanming Shen,Zeyu Qin,Zenan Huang,Hao Chen,Jiaqi Hu,Yihong Zhuang,Guoshan Lu,Gang Chen,Junbo Zhao
Main category: cs.LG
TL;DR: Merge-of-Thought Distillation (MoT) is proposed to efficiently distill reasoning abilities from multiple teachers into a compact student model, outperforming single-teacher methods and naive multi-teacher unions with significant performance gains on math benchmarks.
Details
Motivation: 当前的推理蒸馏方法通常假设存在单一的完美教师模型,而忽视了实际中存在多个候选教师和不断增长的思维链(CoT)数据集的现实。因此,需要一种方法来整合多个教师的推理能力,同时解决不同教师监督之间的冲突。Contribution: 提出了Merge-of-Thought Distillation (MoT)框架,通过交替进行教师特定的监督微调和权重空间合并,实现多教师推理能力的统一蒸馏。
Method: MoT通过教师特定的监督微调分支和权重空间合并的交替操作,将多个教师模型的推理能力整合到学生模型中。该方法仅需约200个高质量CoT样本。
Result: 在数学竞赛基准测试中,MoT显著优于单教师蒸馏方法和简单的多教师联合方法,超越了包括DEEPSEEK-R1、QWEN3-30B-A3B等强模型。此外,MoT减少了灾难性遗忘,并在数学以外的领域提升了推理能力。
Insight: MoT展示了通过共识筛选的推理特征具有广泛的迁移能力,表明轻量级框架可以高效地整合多教师的推理能力。
Abstract: Efficient reasoning distillation for long chain-of-thought (CoT) models is increasingly constrained by the assumption of a single oracle teacher, despite practical availability of multiple candidate teachers and growing CoT corpora. We revisit teacher selection and observe that different students have different “best teachers,” and even for the same student the best teacher can vary across datasets. Therefore, to unify multiple teachers’ reasoning abilities into student with overcoming conflicts among various teachers’ supervision, we propose Merge-of-Thought Distillation (MoT), a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. On competition math benchmarks, using only about 200 high-quality CoT samples, applying MoT to a Qwen3-14B student surpasses strong models including DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1, demonstrating substantial gains. Besides, MoT consistently outperforms the best single-teacher distillation and the naive multi-teacher union, raises the performance ceiling while mitigating overfitting, and shows robustness to distribution-shifted and peer-level teachers. Moreover, MoT reduces catastrophic forgetting, improves general reasoning beyond mathematics and even cultivates a better teacher, indicating that consensus-filtered reasoning features transfer broadly. These results position MoT as a simple, scalable route to efficiently distilling long CoT capabilities from diverse teachers into compact students.
[63] Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics
Dikshant Sagar,Kaiwen Yu,Alejandro Yankelevich,Jianming Bian,Pierre Baldi
Main category: cs.LG
TL;DR: 该论文探讨了如何将视觉语言模型(VLM)应用于高能物理中中微子事件的分类任务,并通过微调LLaMa 3.2模型,证明了VLM在性能和可解释性上优于传统卷积神经网络(CNN)。
Details
Motivation: 近年来,大型语言模型(LLM)在多模态数据处理方面表现出色。论文旨在利用视觉语言模型(VLM)的优势,解决高能物理实验中中微子事件的分类问题,以提升分类性能和模型的可解释性。Contribution: 论文的主要贡献在于展示了VLM在高能物理事件分类中的潜力,通过实验验证了VLM在性能和可解释性上优于传统CNN,为多模态推理在实验物理中的应用开辟了新途径。
Method: 论文采用了微调的LLaMa 3.2模型作为VLM,并将其与NOvA和DUNE实验中使用的CNN架构进行对比,评估了分类性能和预测可解释性。
Result: 实验结果表明,VLM在分类任务中表现优于CNN,同时提供了更强的灵活性和可解释性,能够更好地整合辅助文本或语义信息。
Insight: VLM因其高性能、可解释性和泛化能力,有潜力成为物理事件分类的通用框架,推动了多模态推理在实验物理学中的应用。
Abstract: Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa 3.2, to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events. Our evaluation considers both the classification performance and interpretability of the model predictions. We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions. This work highlights the potential of VLMs as a general-purpose backbone for physics event classification, due to their high performance, interpretability, and generalizability, which opens new avenues for integrating multimodal reasoning in experimental neutrino physics.
cs.SI [Back]
[64] Scaling Truth: The Confidence Paradox in AI Fact-Checking
Ihsan A. Qazi,Zohaib Khan,Abdullah Ghani,Agha A. Raza,Zafar A. Qazi,Wassay Sajjad,Ayesha Ali,Asher Javaid,Muhammad Abdullah Sohail,Abdul H. Azeemi
Main category: cs.SI
TL;DR: 这篇论文系统地评估了9种大型语言模型(LLM)在全球范围内的多语言事实核查任务中的表现,揭示了模型规模与置信度之间的反向关系,可能加剧信息不平等。
Details
Motivation: 随着错误信息的泛滥,需要可扩展且可靠的事实核查解决方案。大型语言模型(LLM)在自动化事实核查方面展现出潜力,但其在多语言和全球背景下的有效性尚不明确。Contribution: 1. 建立了一个多语言事实核查基准,包含5000条来自47种语言的claims。2. 揭示了LLM在事实核查中的Dunning-Kruger效应(小模型自信但低准确,大模型高准确但低自信)。3. 指出了模型在非英语和Global South地区的性能差距,可能加剧信息不平等。
Method: 1. 评估了9种不同规模、架构和来源的LLM。2. 使用5,000条claims和174个专业事实核查组织的标注作为基准。3. 测试了模型在训练截止日期后的claims上的泛化能力。4. 比较了4种prompting策略(模拟公民和专业事实核查者的交互)。5. 基于240,000多条人工标注进行验证。
Result: 1. 小模型(如开源模型)自信度高但准确性低,大模型(如闭源模型)准确性高但置信度低。2. 非英语和Global South地区的性能差距显著。3. 开源和资源受限组织使用的小模型可能导致系统偏见。
Insight: 1. 模型规模和置信度的不平衡可能影响事实核查的公平性。2. 需要政策和技术干预,确保全球范围内AI辅助事实核查的公平访问。3. 为未来研究提供了多语言基准。
Abstract: The rise of misinformation underscores the need for scalable and reliable fact-checking solutions. Large language models (LLMs) hold promise in automating fact verification, yet their effectiveness across global contexts remains uncertain. We systematically evaluate nine established LLMs across multiple categories (open/closed-source, multiple sizes, diverse architectures, reasoning-based) using 5,000 claims previously assessed by 174 professional fact-checking organizations across 47 languages. Our methodology tests model generalizability on claims postdating training cutoffs and four prompting strategies mirroring both citizen and professional fact-checker interactions, with over 240,000 human annotations as ground truth. Findings reveal a concerning pattern resembling the Dunning-Kruger effect: smaller, accessible models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence. This risks systemic bias in information verification, as resource-constrained organizations typically use smaller models. Performance gaps are most pronounced for non-English languages and claims originating from the Global South, threatening to widen existing information inequalities. These results establish a multilingual benchmark for future research and provide an evidence base for policy aimed at ensuring equitable access to trustworthy, AI-assisted fact-checking.
cs.SD [Back]
[65] PianoVAM: A Multimodal Piano Performance Dataset
Yonghyun Kim,Junhyung Park,Joonhyung Bae,Kirak Kim,Taegyun Kwon,Alexander Lerch,Juhan Nam
Main category: cs.SD
TL;DR: PianoVAM 是一个多模态钢琴演奏数据集,包含视频、音频、MIDI、手部关键点、指法标签和丰富的元数据,用于支持音乐信息检索任务。
Details
Motivation: 音乐表演的多模态特性促使 MIR 社区对音频以外的数据产生兴趣,PianoVAM 旨在填补这一领域的空白。Contribution: 提出了一个全面的钢琴演奏数据集 PianoVAM,并提供了手部关键点和指法标签的提取方法。
Method: 使用 Disklavier 钢琴采集数据,通过预训练的手部姿态估计模型和半自动化指法标注算法提取关键信息。
Result: 展示了音频和视听钢琴转录的基准测试结果,并讨论了潜在应用。
Insight: 多模态数据可以显著提升音乐转录和分析的效果,实际演奏环境的多样性对数据质量提出了新挑战。
Abstract: The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.
eess.IV [Back]
[66] STROKEVISION-BENCH: A Multimodal Video And 2D Pose Benchmark For Tracking Stroke Recovery
David Robinson,Animesh Gupta,Rizwan Quershi,Qiushi Fu,Mubarak Shah
Main category: eess.IV
TL;DR: 该论文介绍了StrokeVision-Bench数据集,这是首个专门用于中风康复评估的多模态视频与2D姿态数据集,填补了现有数据集的不足。
Details
Motivation: 当前中风康复评估主要依赖主观观察和粗糙评分系统,缺乏对细微运动改进的敏感性。尽管计算机视觉技术有潜力实现客观量化评估,但现有数据集多为日常生活活动,缺乏临床结构化任务。Contribution: 提出首个专为中风患者设计的临床结构化块转移任务数据集StrokeVision-Bench,包含1,000个标注视频,涵盖四种临床相关动作类别,并提供视频帧与2D骨骼关键点两种模态数据。
Method: 收集并标注中风患者执行块转移任务的视频数据,分为动作类别,并采用视频动作识别和基于骨架的动作分类方法进行基准测试。
Result: 论文为中风康复评估领域建立了性能基准,推动了自动化康复评估的研究。
Insight: 该数据集填补了临床结构化任务数据的空白,为计算机视觉在中风康复中的应用提供了重要资源。
Abstract: Despite advancements in rehabilitation protocols, clinical assessment of upper extremity (UE) function after stroke largely remains subjective, relying heavily on therapist observation and coarse scoring systems. This subjectivity limits the sensitivity of assessments to detect subtle motor improvements, which are critical for personalized rehabilitation planning. Recent progress in computer vision offers promising avenues for enabling objective, quantitative, and scalable assessment of UE motor function. Among standardized tests, the Box and Block Test (BBT) is widely utilized for measuring gross manual dexterity and tracking stroke recovery, providing a structured setting that lends itself well to computational analysis. However, existing datasets targeting stroke rehabilitation primarily focus on daily living activities and often fail to capture clinically structured assessments such as block transfer tasks. Furthermore, many available datasets include a mixture of healthy and stroke-affected individuals, limiting their specificity and clinical utility. To address these critical gaps, we introduce StrokeVision-Bench, the first-ever dedicated dataset of stroke patients performing clinically structured block transfer tasks. StrokeVision-Bench comprises 1,000 annotated videos categorized into four clinically meaningful action classes, with each sample represented in two modalities: raw video frames and 2D skeletal keypoints. We benchmark several state-of-the-art video action recognition and skeleton-based action classification methods to establish performance baselines for this domain and facilitate future research in automated stroke rehabilitation assessment.
[67] Expert-Guided Explainable Few-Shot Learning for Medical Image Diagnosis
Ifrat Ikhtear Uddin,Longwei Wang,KC Santosh
Main category: eess.IV
TL;DR: 该论文提出了一种专家引导的可解释少样本学习框架,通过集成放射科医生提供的感兴趣区域(ROIs)来提升医学图像诊断的分类性能和可解释性。
Details
Motivation: 医学图像分析因专家标注数据有限而面临挑战,影响模型泛化和临床应用。现有方法在性能和可解释性之间尚未取得平衡。Contribution: 1. 引入专家引导的ROIs,通过Grad-CAM空间注意力监督和Dice相似度的解释损失,对齐模型注意力与诊断相关区域。2. 将解释损失与原型网络目标联合优化,使模型在少样本条件下关注临床有意义特征。
Method: 1. 使用Grad-CAM监督模型的空间注意力。2. 提出基于Dice相似度的解释损失,并与原型网络损失联合优化。3. 在BraTS(MRI)和VinDr-CXR(胸部X光)数据集上验证。
Result: 在BraTS上准确率从77.09%提升至83.61%,在VinDr-CXR上从54.33%提升至73.29%。Grad-CAM可视化证实模型注意力更符合诊断区域。
Insight: 专家引导的注意力监督能有效弥合少样本医学图像诊断中性能与可解释性之间的差距,提升模型可靠性和临床可信度。
Abstract: Medical image analysis often faces significant challenges due to limited expert-annotated data, hindering both model generalization and clinical adoption. We propose an expert-guided explainable few-shot learning framework that integrates radiologist-provided regions-of-interests (ROIs) into model training to simultaneously enhance classification performance and interpretability. Leveraging Grad-CAM for spatial attention supervision, we introduce an explanation loss based on Dice similarity to align model attention with diagnostically relevant regions during training. This explanation loss is jointly optimized with a standard prototypical network objective, encouraging the model to focus on clinically meaningful features even under limited data conditions. We evaluate our framework on two distinct datasets: BraTS (MRI) and VinDr-CXR (Chest X-ray), achieving significant accuracy improvements from 77.09% to 83.61% on BraTS and from 54.33% to 73.29% on VinDr-CXR compared to non-guided models. Grad-CAM visualizations further confirm that expert-guided training consistently aligns attention with diagnostic regions, improving both predictive reliability and clinical trustworthiness. Our findings demonstrate the effectiveness of incorporating expert-guided attention supervision to bridge the gap between performance and interpretability in few-shot medical image diagnosis.
[68] CardioComposer: Flexible and Compositional Anatomical Structure Generation with Disentangled Geometric Guidance
Karim Kadry,Shoaib Goraya,Ajay Manicka,Abdalla Abdelwahed,Farhad Nezami,Elazer Edelman
Main category: eess.IV
TL;DR: CardioComposer是一个可编程、组合式的框架,用于指导无条件扩散模型生成三维人体解剖结构,通过解析几何原语实现控制解剖结构的真实性。
Details
Motivation: 当前生成解剖结构的模型在可控性和解剖真实性之间存在权衡,难以满足临床研究和医疗设备设计的需求。Contribution: 提出了一个基于几何矩损失的程序化、组合式框架,支持对解剖结构的尺寸、形状、位置以及多组件约束的独立控制。
Method: 通过在多组织分割图中选择特定组织,并应用几何矩损失指导反向扩散过程,实现解剖结构的灵活生成。
Result: 框架能够生成具有高解剖真实性的3D结构,同时支持灵活的控制和组合。
Insight: 解析几何原语的引入为生成模型中解剖结构的可控性和真实性提供了新的平衡点。
Abstract: Generative models of 3D anatomy, when integrated with biophysical simulators, enable the study of structure-function relationships for clinical research and medical device design. However, current models face a trade-off between controllability and anatomical realism. We propose a programmable and compositional framework for guiding unconditional diffusion models of human anatomy using interpretable ellipsoidal primitives embedded in 3D space. Our method involves the selection of certain tissues within multi-tissue segmentation maps, upon which we apply geometric moment losses to guide the reverse diffusion process. This framework supports the independent control over size, shape, and position, as well as the composition of multi-component constraints during inference.
[69] RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts
Lauren H. Cooke,Matthias Jung,Jan M. Brendel,Nora M. Kerkovits,Borek Foldyna,Michael T. Lu,Vineet K. Raghu
Main category: eess.IV
TL;DR: RoentMod是一个合成胸部X射线修改模型,用于识别和纠正图像解释模型中的捷径学习问题。它生成具有指定病理特征的逼真X射线图像,并通过实验证明其有效提升模型的鲁棒性和泛化能力。
Details
Motivation: 胸部X射线(CXRs)是最常见的医学检查之一,深度学习模型在CXR解释中表现出色,但容易依赖非临床相关的捷径学习。RoentMod旨在解决这一问题。Contribution: RoentMod提出了一种反事实图像编辑框架,生成具有指定病理特征的逼真CXR图像,验证了其在纠正多任务和基础模型捷径学习中的有效性。
Method: 结合开源医学图像生成器(RoentGen)和图像修改模型,RoentMod无需重新训练即可生成具有合成病理特征的CXR图像。
Result: RoentMod生成的图像93%的被认为逼真,89-99%的正确引入了指定病理特征。在训练中使用这些图像提升了模型特异性(AUC提升3-19%)。
Insight: RoentMod为医学AI提供了一种通用工具,能够通过反事实干预增强模型的鲁棒性和可解释性,并适用于其他医学影像任务。
Abstract: Chest radiographs (CXRs) are among the most common tests in medicine. Automated image interpretation may reduce radiologists' workload and expand access to diagnostic expertise. Deep learning multi-task and foundation models have shown strong performance for CXR interpretation but are vulnerable to shortcut learning, where models rely on spurious and off-target correlations rather than clinically relevant features to make decisions. We introduce RoentMod, a counterfactual image editing framework that generates anatomically realistic CXRs with user-specified, synthetic pathology while preserving unrelated anatomical features of the original scan. RoentMod combines an open-source medical image generator (RoentGen) with an image-to-image modification model without requiring retraining. In reader studies with board-certified radiologists and radiology residents, RoentMod-produced images appeared realistic in 93% of cases, correctly incorporated the specified finding in 89-99% of cases, and preserved native anatomy comparable to real follow-up CXRs. Using RoentMod, we demonstrate that state-of-the-art multi-task and foundation models frequently exploit off-target pathology as shortcuts, limiting their specificity. Incorporating RoentMod-generated counterfactual images during training mitigated this vulnerability, improving model discrimination across multiple pathologies by 3-19% AUC in internal validation and by 1-11% for 5 out of 6 tested pathologies in external testing. These findings establish RoentMod as a broadly applicable tool for probing and correcting shortcut learning in medical AI. By enabling controlled counterfactual interventions, RoentMod enhances the robustness and interpretability of CXR interpretation models and provides a generalizable strategy for improving foundation models in medical imaging.
cs.RO [Back]
[70] Quadrotor Navigation using Reinforcement Learning with Privileged Information
Jonathan Lee,Abhishek Rathod,Kshitij Goel,John Stecklein,Wennie Tabib
Main category: cs.RO
TL;DR: 这篇论文提出了一种基于强化学习的旋翼无人机导航方法,利用可微分模拟、新型损失函数和特权信息来绕过大障碍物导航,在复杂环境中表现优异。
Details
Motivation: 现有的基于学习的方法能够处理狭窄障碍物场景,但在大障碍物(如墙壁或地形)遮挡目标位置时表现不佳。因此,论文提出了一种利用特权信息的新型导航方法。Contribution: 1)提出了一种结合时间到达(ToA)地图作为特权信息的强化学习方法;2)设计了偏航角对齐损失函数以引导无人机绕过大型障碍物;3)方案在仿真和真实飞行测试中验证成功率和性能提升。
Method: 方法基于强化学习,利用可微分模拟训练策略,输入包括ToA地图作为特权信息,并使用偏航角对齐损失函数优化导航行为。
Result: 在仿真环境中达到86%的成功率,比基线方法高出34%;并在真实飞行中完成20次测试,总飞行距离589米,速度达4m/s,无碰撞记录。
Insight: 特权信息(如ToA地图)在复杂导航任务中具有显著优势,而偏航角对齐损失可以有效帮助无人机绕过大障碍物。
Abstract: This paper presents a reinforcement learning-based quadrotor navigation method that leverages efficient differentiable simulation, novel loss functions, and privileged information to navigate around large obstacles. Prior learning-based methods perform well in scenes that exhibit narrow obstacles, but struggle when the goal location is blocked by large walls or terrain. In contrast, the proposed method utilizes time-of-arrival (ToA) maps as privileged information and a yaw alignment loss to guide the robot around large obstacles. The policy is evaluated in photo-realistic simulation environments containing large obstacles, sharp corners, and dead-ends. Our approach achieves an 86% success rate and outperforms baseline strategies by 34%. We deploy the policy onboard a custom quadrotor in outdoor cluttered environments both during the day and night. The policy is validated across 20 flights, covering 589 meters without collisions at speeds up to 4 m/s.
[71] Foundation Models for Autonomous Driving Perception: A Survey Through Core Capabilities
Rajendramayavan Sathyam,Yueqi Li
Main category: cs.RO
TL;DR: 这篇论文是一篇关于基础模型在自动驾驶感知领域的综述,探讨了它们如何解决泛化性、可扩展性和分布偏移鲁棒性等核心挑战。论文提出了围绕四种关键能力的新分类法,并总结了当前的研究方法。
Details
Motivation: 自动驾驶感知领域正从特定任务的深度学习模型转向通用性强、基于大规模多样化数据集的基础模型。然而,如何整合这些模型的多种能力以实现动态驾驶环境中的鲁棒性能仍是一个重要挑战。Contribution: 论文的贡献包括:1) 提出了一种围绕四种核心能力的新分类法;2) 总结了当前的研究方法及其设计原则;3) 指出了未来研究方向,特别是在实时性和可靠性方面的挑战。
Method: 论文通过能力驱动的方法,将自动驾驶感知的核心能力分为四类:通用知识、空间理解、多传感器鲁棒性和时序推理。对每类能力,论文详细综述了前沿方法。
Result: 通过分类和综合分析,论文揭示了基础模型在自动驾驶感知中的潜力和局限,尤其是在实时性和可靠性方面的不足。
Insight: 论文强调了能力驱动的研究框架的重要性,并指出了未来需要更多关注模型在动态环境中实际部署的挑战,如计算需求和幻觉问题。
Abstract: Foundation models are revolutionizing autonomous driving perception, transitioning the field from narrow, task-specific deep learning models to versatile, general-purpose architectures trained on vast, diverse datasets. This survey examines how these models address critical challenges in autonomous perception, including limitations in generalization, scalability, and robustness to distributional shifts. The survey introduces a novel taxonomy structured around four essential capabilities for robust performance in dynamic driving environments: generalized knowledge, spatial understanding, multi-sensor robustness, and temporal reasoning. For each capability, the survey elucidates its significance and comprehensively reviews cutting-edge approaches. Diverging from traditional method-centric surveys, our unique framework prioritizes conceptual design principles, providing a capability-driven guide for model development and clearer insights into foundational aspects. We conclude by discussing key challenges, particularly those associated with the integration of these capabilities into real-time, scalable systems, and broader deployment challenges related to computational demands and ensuring model reliability against issues like hallucinations and out-of-distribution failures. The survey also outlines crucial future research directions to enable the safe and effective deployment of foundation models in autonomous driving systems.
[72] Good Deep Features to Track: Self-Supervised Feature Extraction and Tracking in Visual Odometry
Sai Puneeth Reddy Gottam,Haoming Zhang,Eivydas Keras
Main category: cs.RO
TL;DR: 这篇论文提出了一种通过自监督学习提升深度特征提取和跟踪能力的方法,以提高视觉里程计在复杂环境中的性能。
Details
Motivation: 视觉定位在大规模、户外和长期场景中常因光照变化、动态场景和低纹理区域等因素导致性能下降。传统学习型方法(如SuperPoint和SuperGlue)虽然在特征覆盖和鲁棒性上有所提升,但面对分布外数据时仍存在泛化问题。Contribution: 论文的主要贡献是通过自监督学习和任务特定反馈,增强深度特征的提取和跟踪能力,从而生成更稳定和信息丰富的特征,提升在挑战性环境中的泛化性和可靠性。
Method: 方法的核心是利用自监督学习框架优化特征提取和跟踪,通过任务特定的反馈机制(如运动估计误差)直接指导特征的优化过程。
Result: 实验结果表明,该方法能够显著提升特征提取和跟踪的稳定性,尤其在光照变化和低纹理区域等复杂场景中表现优异。
Insight: 论文揭示了自监督学习结合任务反馈可以有效地提升特征的泛化能力,为视觉里程计在复杂环境中的应用提供了新的思路和技术支持。
Abstract: Visual-based localization has made significant progress, yet its performance often drops in large-scale, outdoor, and long-term settings due to factors like lighting changes, dynamic scenes, and low-texture areas. These challenges degrade feature extraction and tracking, which are critical for accurate motion estimation. While learning-based methods such as SuperPoint and SuperGlue show improved feature coverage and robustness, they still face generalization issues with out-of-distribution data. We address this by enhancing deep feature extraction and tracking through self-supervised learning with task specific feedback. Our method promotes stable and informative features, improving generalization and reliability in challenging environments.
[73] TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals
Stefan Podgorski,Sourav Garg,Mehdi Hosseinzadeh,Lachlan Mares,Feras Dayoub,Ian Reid
Main category: cs.RO
TL;DR: TANGO提出了一种基于RGB图像的视觉导航方法,结合全局拓扑路径规划和局部轨迹控制,无需3D地图或预训练控制器,实现开放环境中的零样本长距离导航。
Details
Motivation: 传统视觉导航依赖全局3D地图或学习控制器,计算成本高且泛化性差。TANGO旨在通过对象级拓扑目标导航,解决这些问题,并提供开放环境中的适应性解决方案。Contribution: 主要贡献包括:1) 提出了一种RGB-only的零样本导航框架;2) 结合全局拓扑规划和局部轨迹控制;3) 加入自动切换机制以提升鲁棒性;4) 利用基础模型实现开放集适用性。
Method: 方法分为两部分:1) 全局拓扑路径规划确定对象级子目标;2) 局部度量控制通过单目深度和可通行性估计生成轨迹,并配备自动切换回退机制。
Result: 在仿真和真实环境测试中,TANGO表现优于现有方法,展示了在开放环境中的高适应性和有效性。代码开源。
Insight: 通过无监督学习和基础模型,TANGO避免了昂贵的计算需求和领域微调,为开放环境视觉导航提供了新思路。
Abstract: Visual navigation in robotics traditionally relies on globally-consistent 3D maps or learned controllers, which can be computationally expensive and difficult to generalize across diverse environments. In this work, we present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation without requiring 3D maps or pre-trained controllers. Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level sub-goals while avoiding obstacles. We address key limitations of previous methods by continuously predicting local trajectory using monocular depth and traversability estimation, and incorporating an auto-switching mechanism that falls back to a baseline controller when necessary. The system operates using foundational models, ensuring open-set applicability without the need for domain-specific fine-tuning. We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability. Our approach outperforms existing state-of-the-art methods, offering a more adaptable and effective solution for visual navigation in open-set environments. The source code is made publicly available: https://github.com/podgorki/TANGO.
[74] SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation
Michael J. Munje,Chen Tang,Shuijing Liu,Zichao Hu,Yifeng Zhu,Jiaxun Cui,Garrett Warnell,Joydeep Biswas,Peter Stone
Main category: cs.RO
TL;DR: 本文介绍了SocialNav-SUB,一个用于评估视觉语言模型(VLMs)在社交机器人导航场景中场景理解能力的基准测试集。实验表明,当前最先进的VLMs在复杂的社交场景理解中仍有不足。
Details
Motivation: 动态、以人为中心的环境中的机器人导航需要基于强大的场景理解做出社会合规的决策。VLMs虽展现出潜力,但其在复杂社交导航场景(如推断空间-时间关系和人类意图)中的能力尚未被系统评估。Contribution: 提出SocialNav-SUB,一个统一的VQA数据集和基准框架,用于评估VLMs在社交机器人导航中的场景理解能力,填补了现有研究的空白。
Method: 通过Visual Question Answering(VQA)任务构建SocialNav-SUB数据集,并在空间、时空和社会推理三个维度上评估VLMs的性能。
Result: 实验显示,当前最先进的VLMs在部分任务中表现尚可,但仍落后于简单的规则方法和人类共识基准,表明其在社交场景理解上存在关键缺陷。
Insight: VLMs在复杂社交场景中的应用仍需改进,SocialNav-SUB为未来研究提供了评估和改进的基础。
Abstract: Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding-capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. An overview of this paper along with the code and data can be found at https://larg.github.io/socialnav-sub .