Table of Contents
- cs.CL [Total: 34]
- cs.CV [Total: 115]
- eess.IV [Total: 4]
- cs.LG [Total: 5]
- cs.CY [Total: 1]
- cs.GR [Total: 1]
- cs.SD [Total: 1]
- cs.AI [Total: 10]
- cs.DB [Total: 1]
- cs.RO [Total: 7]
- cs.CR [Total: 3]
- astro-ph.IM [Total: 1]
cs.CL [Back]
[1] Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection
Jerry Li,Evangelos Papalexakis
Main category: cs.CL
TL;DR: 提出了一种基于N-Gram频率张量的新方法,通过张量分解提取特征,用于检测LLM生成的文本中的幻觉,显著优于传统基线方法。
Details
Motivation: LLM在自然语言任务中表现出色,但幻觉问题限制了其生成信息的可信度。现有方法如ROUGE等缺乏足够的语义深度,亟需更有效的检测手段。Contribution: 提出了一种基于N-Gram频率张量的新特征提取方法,通过张量分解和MLP分类器显著提升了幻觉检测的性能。
Method: 构建N-Gram频率张量,捕获共现模式;应用张量分解提取奇异值作为特征,训练MLP分类器进行幻觉检测。
Result: 在HaluEval数据集上评估,新方法显著优于传统基线方法,并与最新LLM评判器竞争性相当。
Insight: N-Gram张量能够更丰富地捕捉语义结构,为幻觉检测提供了新的特征空间,证明了低频共现模式的有效性。
Abstract: Large Language Models (LLMs) have demonstrated effectiveness across a wide variety of tasks involving natural language, however, a fundamental problem of hallucinations still plagues these models, limiting their trustworthiness in generating consistent, truthful information. Detecting hallucinations has quickly become an important topic, with various methods such as uncertainty estimation, LLM Judges, retrieval augmented generation (RAG), and consistency checks showing promise. Many of these methods build upon foundational metrics, such as ROUGE, BERTScore, or Perplexity, which often lack the semantic depth necessary to detect hallucinations effectively. In this work, we propose a novel approach inspired by ROUGE that constructs an N-Gram frequency tensor from LLM-generated text. This tensor captures richer semantic structure by encoding co-occurrence patterns, enabling better differentiation between factual and hallucinated content. We demonstrate this by applying tensor decomposition methods to extract singular values from each mode and use these as input features to train a multi-layer perceptron (MLP) binary classifier for hallucinations. Our method is evaluated on the HaluEval dataset and demonstrates significant improvements over traditional baselines, as well as competitive performance against state-of-the-art LLM judges.
[2] A Lightweight Framework for Trigger-Guided LoRA-Based Self-Adaptation in LLMs
Jiacheng Wei,Faguo Wu,Xiao Zhang
Main category: cs.CL
TL;DR: 该论文提出了一种名为SAGE的轻量级框架,通过动态微调技术(LoRA)在推理时实现LLMs的自适应学习。
Details
Motivation: 大型语言模型(LLMs)在推理时无法持续适应和学习新数据,限制了其实际应用的灵活性。为了解决这一问题,论文提出将复杂推理任务分解为原子子任务,并通过动态微调框架实现自适应更新。Contribution: 1) 提出了一种触发器引导的动态微调框架SAGE;2) 设计了实时检测失败推理的触发器模块;3) 引入了基于流式聚类和相似性合并的异常样本处理机制;4) 实现了基于LoRA的动态参数优化和知识保留。
Method: 1) Trigger模块通过多指标实时检测推理失败;2) Trigger Buffer模块使用HDBSCAN进行流式聚类,并通过稳定性检查和相似性合并处理异常样本;3) LoRA Store模块动态优化参数更新,利用适配器池保留知识。
Result: 实验表明,SAGE在原子子任务上通过动态知识更新表现出优异的准确性、鲁棒性和稳定性。
Insight: 动态微调和实时自适应的结合可以显著提升LLMs在推理时的灵活性和性能。
Abstract: Large language models are unable to continuously adapt and learn from new data during reasoning at inference time. To address this limitation, we propose that complex reasoning tasks be decomposed into atomic subtasks and introduce SAGE, a trigger-guided dynamic fine-tuning framework that enables adaptive updates during reasoning at inference time. SAGE consists of three key components: (1) a Trigger module that detects reasoning failures through multiple evaluation metrics in real time; (2) a Trigger Buffer module that clusters anomaly samples using a streaming clustering process with HDBSCAN, followed by stability checks and similarity-based merging; and (3) a Lora Store module that dynamically optimizes parameter updates with an adapter pool for knowledge retention. Evaluation results show that SAGE demonstrates excellent accuracy, robustness, and stability on the atomic reasoning subtask through dynamic knowledge updating during test time.
[3] Talk Isn’t Always Cheap: Understanding Failure Modes in Multi-Agent Debate
Andrea Wynn,Harsh Satija,Gillian Hadfield
Main category: cs.CL
TL;DR: 论文探讨多智能体辩论中的失败模式,尤其是在智能体能力多样化的情况下,辩论可能导致准确性下降,甚至更强模型占多数时也是如此。
Details
Motivation: 研究旨在揭示多智能体辩论中未被充分探讨的失败模式,特别关注智能体能力多样性对辩论动态和结果的影响。Contribution: 论文首次展示了辩论可能导致准确性下降的失败模式,尤其是在异质智能体群体中,揭示了辩论中智能体易受错误推理影响的现象。
Method: 通过一系列实验,研究对比了同质和异质智能体群体的辩论结果,分析模型在辩论中的推理转换行为。
Result: 实验表明,即使更强模型占多数,辩论仍可能导致准确性下降,智能体更倾向于达成共识而非挑战错误推理。
Insight: 辩论的简单应用可能因智能体缺乏抵制错误推理的激励或能力而导致性能退化,需要更细致的辩论机制设计。
Abstract: While multi-agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. The prior work has exclusively focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi-agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time – even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. These results highlight important failure modes in the exchange of reasons during multi-agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivized nor adequately equipped to resist persuasive but incorrect reasoning.
[4] No Translation Needed: Forecasting Quality from Fertility and Metadata
Jessica M. Lundin,Ada Zhang,David Adelani,Cody Carroll
Main category: cs.CL
TL;DR: 论文表明,无需实际运行翻译系统,仅通过少数特征(如词汇生育率比例、词汇计数和基础语言元数据)就能高精度预测翻译质量。
Details
Motivation: 目标是探索是否能在不依赖实际翻译过程的情况下,通过简单的特征预测翻译质量,以简化评估流程并提供新的质量估计方法。Contribution: 提出了一种无需翻译即可预测翻译质量的方法,通过梯度提升模型在203种语言上实现了较高的预测性能(R²=0.66-0.72),并揭示了影响预测的关键因素。
Method: 使用了词汇生育率比例、词汇计数和语言元数据(语系、文字和地区)作为特征,采用梯度提升模型进行预测。
Result: 在FLORES-200基准测试中,模型对GPT-4o翻译质量的预测性能优异(XX→英语R²=0.66,英语→XX R²=0.72),且特征重要性分析揭示了不同任务中主导因素(类型学或生育率)的差异。
Insight: 翻译质量不仅受词汇生育率影响,还受语言类型学因素驱动,这对多语言评估和质量估计提供了新视角。
Abstract: We show that translation quality can be predicted with surprising accuracy \textit{without ever running the translation system itself}. Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200 benchmark. Gradient boosting models achieve favorable performance ($R^{2}=0.66$ for XX$\rightarrow$English and $R^{2}=0.72$ for English$\rightarrow$XX). Feature importance analyses reveal that typological factors dominate predictions into English, while fertility plays a larger role for translations into diverse target languages. These findings suggest that translation quality is shaped by both token-level fertility and broader linguistic typology, offering new insights for multilingual evaluation and quality estimation.
[5] Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too
Logan Lawrence,Ashton Williamson,Alexander Shelton
Main category: cs.CL
TL;DR: 本文提出了一种直接评分方法,利用合成摘要作为成对机器排名,用于评估自然语言生成(NLG)任务。该方法在多个基准测试中表现与现有最佳成对评估器相当。
Details
Motivation: 尽管成对比较方法在评估机器生成文本时表现良好,但它们无法直接为单个摘要分配绝对分数,这在需要阈值的场景中非常重要。因此,本文旨在提出一种直接评分方法来解决这一限制。Contribution: 主要贡献是提出了一种基于合成摘要的直接评分方法,能够在保持成对比较性能的同时,直接为样本分配绝对分数。
Method: 方法的核心是使用合成摘要作为成对机器排名,在测试时通过直接评分对样本进行评估。
Result: 实验表明,该方法在SummEval、TopicalChat和HANNA基准测试中与现有最佳成对评估器表现相当(数值差异分别为+0.03、-0.03和+0.05)。
Insight: 研究显示,结合合成摘要的直接评分方法可以灵活应对需要绝对分数的评估场景,同时保持评估的准确性。
Abstract: As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For \textit{sample-level} performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval (\textbf{+0.03}), TopicalChat (\textbf{-0.03}), and HANNA (\textbf{+0.05}) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.
[6] From Staff Messages to Actionable Insights: A Multi-Stage LLM Classification Framework for Healthcare Analytics
Hajar Sakai,Yi-En Tseng,Mohammadsadegh Mikaeili,Joshua Bosire,Franziska Jovin
Main category: cs.CL
TL;DR: 该论文提出了一种多阶段LLM分类框架,用于从医院呼叫中心的员工消息中提取可操作的见解,并通过多种LLM模型评估,最终整合为可视化决策支持工具。
Details
Motivation: 医院呼叫中心产生大量员工消息,传统监督学习方法需大量标注数据和调参,LLMs提供了更高效的计算方法。Contribution: 提出了一个多阶段LLM框架,用于分类员工消息主题和原因,同时满足数据安全和HIPAA合规要求。
Method: 采用多阶段LLM分类方法,评估了推理型、通用型和轻量级模型,最佳模型o3达到78.4%加权F1分数。
Result: o3模型表现最佳(78.4% F1分数,79.2%准确率),框架成功集成可视化工具支持医疗决策。
Insight: LLM框架可高效处理医疗文本数据,支持员工培训和患者体验改进,同时满足合规性要求。
Abstract: Hospital call centers serve as the primary contact point for patients within a hospital system. They also generate substantial volumes of staff messages as navigators process patient requests and communicate with the hospital offices following the established protocol restrictions and guidelines. This continuously accumulated large amount of text data can be mined and processed to retrieve insights; however, traditional supervised learning approaches require annotated data, extensive training, and model tuning. Large Language Models (LLMs) offer a paradigm shift toward more computationally efficient methodologies for healthcare analytics. This paper presents a multi-stage LLM-based framework that identifies staff message topics and classifies messages by their reasons in a multi-class fashion. In the process, multiple LLM types, including reasoning, general-purpose, and lightweight models, were evaluated. The best-performing model was o3, achieving 78.4% weighted F1-score and 79.2% accuracy, followed closely by gpt-5 (75.3% Weighted F1-score and 76.2% accuracy). The proposed methodology incorporates data security measures and HIPAA compliance requirements essential for healthcare environments. The processed LLM outputs are integrated into a visualization decision support tool that transforms the staff messages into actionable insights accessible to healthcare professionals. This approach enables more efficient utilization of the collected staff messaging data, identifies navigator training opportunities, and supports improved patient experience and care quality.
[7] The Token Tax: Systematic Bias in Multilingual Tokenization
Jessica M. Lundin,Ada Zhang,Nihal Karim,Hamza Louzan,Victor Wei,David Adelani,Cody Carroll
Main category: cs.CL
TL;DR: 论文揭示了多语言分词中的系统性偏差问题,指出复杂的形态结构语言因分词效率低下导致计算资源浪费和准确率下降,并提出支持形态感知分词和公平定价的建议。
Details
Motivation: 多语言分词的不公平性导致形态复杂的低资源语言在大型语言模型中处于劣势,增加了计算成本和降低了准确性,需要解决这一问题以实现公平的自然语言处理。Contribution: 1. 揭示了分词效率(每个词的token数量)与模型准确性之间的强相关性;2. 展示了推理模型在高/低资源语言中的优越性;3. 量化了分词膨胀对训练成本的影响。
Method: 通过评估10个大型语言模型在AfriMMLU数据集(16种非洲语言的9,000道多选题)上的表现,分析了分词效率(fertility)与准确性的关系。
Result: 分词效率(更高的fertility)始终与更低的准确性相关;推理模型在所有语言中表现优于非推理模型;分词膨胀导致训练成本和时间呈四次方增长。
Insight: 分词偏差对低资源语言的负面影响显著,需推动形态感知分词技术和公平的定价策略,同时加强多语言基准测试以促进NLP的公平性。
Abstract: Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).
[8] Biomedical Literature Q&A System Using Retrieval-Augmented Generation (RAG)
Mansi Garg,Lee-Chi Wang,Bhavesh Ghanchi,Sanjana Dumpala,Shreyash Kakde,Yen Chih Chen
Main category: cs.CL
TL;DR: 该论文提出了一个基于检索增强生成(RAG)架构的生物医学文献问答系统,通过整合多种数据源和高效的检索与生成技术,显著提升了回答的准确性和语义相关性。
Details
Motivation: 传统健康搜索引擎和公众获取生物医学研究的滞后性是主要问题。作者希望通过RAG架构,结合多样化的数据源,提供更准确、基于证据的医学信息。Contribution: 主要贡献包括:(1)设计了一个生物医学文献问答系统,整合PubMed文章、问答数据集和医学百科全书;(2)使用MiniLM语义嵌入和FAISS向量搜索优化检索;(3)通过QLoRA微调Mistral-7B-v0.3模型,实现高效生成。
Method: 方法包括:(1)检索模块:基于MiniLM的语义嵌入和FAISS向量搜索;(2)生成模块:使用QLoRA微调的Mistral-7B-v0.3模型;(3)专注于乳腺癌文献的领域对齐检索。
Result: 实验结果显示,系统在BERTScore(F1)指标上显著优于基线模型,尤其在事实一致性和语义相关性方面表现突出。
Insight: 研究表明,RAG增强的语言模型能够弥合复杂生物医学文献与公众健康知识之间的鸿沟,为多语言适配、隐私保护推理和个性化医疗AI系统的未来研究奠定了基础。
Abstract: This work presents a Biomedical Literature Question Answering (Q&A) system based on a Retrieval-Augmented Generation (RAG) architecture, designed to improve access to accurate, evidence-based medical information. Addressing the shortcomings of conventional health search engines and the lag in public access to biomedical research, the system integrates diverse sources, including PubMed articles, curated Q&A datasets, and medical encyclopedias ,to retrieve relevant information and generate concise, context-aware responses. The retrieval pipeline uses MiniLM-based semantic embeddings and FAISS vector search, while answer generation is performed by a fine-tuned Mistral-7B-v0.3 language model optimized using QLoRA for efficient, low-resource training. The system supports both general medical queries and domain-specific tasks, with a focused evaluation on breast cancer literature demonstrating the value of domain-aligned retrieval. Empirical results, measured using BERTScore (F1), show substantial improvements in factual consistency and semantic relevance compared to baseline models. The findings underscore the potential of RAG-enhanced language models to bridge the gap between complex biomedical literature and accessible public health knowledge, paving the way for future work on multilingual adaptation, privacy-preserving inference, and personalized medical AI systems.
[9] Using Contrastive Learning to Improve Two-Way Reasoning in Large Language Models: The Obfuscation Task as a Case Study
Serge Lionel Nikiema,Jordan Samhi,Micheline Bénédicte Moumoula,Albérick Euraste Djiré,Abdoul Kader Kaboré,Jacques Klein,Tegawendé F. Bissyandé
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型是否真正理解概念或仅是模式识别的问题,并提出双向推理作为理解能力的测试标准。作者发现当前模型在前向任务微调后会丧失双向推理能力,并提出对比微调(CFT)方法以解决这一问题。
Details
Motivation: 研究者试图验证大型语言模型是否具备真正的理解能力,而非仅依赖模式识别。双向推理被视为理解能力的标志,但现有模型在前向任务训练后表现不足。Contribution: 提出了双向推理作为理解能力的标准,并设计了对比微调(CFT)方法,成功实现了模型的双向推理能力。
Method: 采用对比微调(CFT),结合正向、负向和混淆样本训练模型,以避免前向训练导致的认知特化问题。
Result: 实验表明,CFT能显著提升反向推理能力,同时保持前向任务性能。
Insight: 双向推理不仅是理解能力的有效评估框架,也可作为训练方法提升模型能力。
Abstract: This research addresses a fundamental question in AI: whether large language models truly understand concepts or simply recognize patterns. The authors propose bidirectional reasoning,the ability to apply transformations in both directions without being explicitly trained on the reverse direction, as a test for genuine understanding. They argue that true comprehension should naturally allow reversibility. For example, a model that can change a variable name like userIndex to i should also be able to infer that i represents a user index without reverse training. The researchers tested current language models and discovered what they term cognitive specialization: when models are fine-tuned on forward tasks, their performance on those tasks improves, but their ability to reason bidirectionally becomes significantly worse. To address this issue, they developed Contrastive Fine-Tuning (CFT), which trains models using three types of examples: positive examples that maintain semantic meaning, negative examples with different semantics, and forward-direction obfuscation examples. This approach aims to develop deeper understanding rather than surface-level pattern recognition and allows reverse capabilities to develop naturally without explicit reverse training. Their experiments demonstrated that CFT successfully achieved bidirectional reasoning, enabling strong reverse performance while maintaining forward task capabilities. The authors conclude that bidirectional reasoning serves both as a theoretical framework for assessing genuine understanding and as a practical training approach for developing more capable AI systems.
[10] Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation
Hongyan Xie,Yitong Yao,Yikun Ban,Zixuan Huang,Deqing Wang,Zhenhe Wu,Haoxiang Su,Chao Wang,Shuangyong Song,Xuelong Li
Main category: cs.CL
TL;DR: 为了解决小语言模型(SLMs)在学习LLMs生成的Chain-of-Thought(CoT)数据时可能捕捉虚假相关性的问题,本文提出了Chain-of-Thought Correctness Perception Distillation(CoPeD),通过任务设计和数据利用提升推理质量。
Details
Motivation: 大语言模型(LLMs)推理能力强但部署成本高,因此通常通过让SLMs微调LLMs生成的CoT数据来复制其能力。然而,这些数据中可能包含无效的理性(rationales),导致SLMs学习虚假相关性,影响推理质量。Contribution: 1. 提出了一种正确性感知的任务设置,鼓励模型基于正确的理性预测答案并修正错误。2. 设计了Correctness-Aware Weighted loss,动态调整训练样本的权重,使模型更关注支持答案的理性。
Method: 1. 任务设置上,要求模型预测答案并修正错误的理性;2. 损失函数中引入动态加权机制,基于理性与答案的联合损失调整样本权重。
Result: 实验证明,CoPeD在分布内(IND)和分布外(OOD)的基准推理数据集上均有效。
Insight: 通过设计任务和损失函数,可以有效减少SLMs学习虚假相关性的问题,提升推理的忠实性。
Abstract: Large language models (LLMs) excel at reasoning tasks but are expensive to deploy. Thus small language models (SLMs) are fine-tuned on CoT data generated by LLMs to copy LLMs’ abilities. However, these CoT data may include noisy rationales that either fail to substantiate the answers or contribute no additional information to support answer prediction, which leads SLMs to capture spurious correlations between questions and answers and compromise the quality of reasoning. In this work, we propose Chain-of-Thought Correctness Perception Distillation (CoPeD), which aims to improve the reasoning quality of the student model from the perspectives of task setting and data utilization. Firstly, we introduce a correctness-aware task setting that encourages the student model to predict answers based on correct rationales and revise them when they are incorrect. This setting improves the faithfulness of reasoning and allows the model to learn from its mistakes. Then, we propose a Correctness-Aware Weighted loss, which dynamically adjusts the contribution of each training instance based on the combined loss of the rationale and the answer. This strategy encourages the model to focus more on samples where the rationale offers stronger support for the correct answer. Experiments have shown that CoPeD is effective on both in-distribution (IND) and out-of-distribution (OOD) benchmark reasoning datasets.
[11] Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation
Qiyuan Chen,Hongsen Huang,Qian Shao,Jiahe Chen,Jintai Chen,Hongxia Xu,Renjie Hua,Ren Chuan,Jian Wu
Main category: cs.CL
TL;DR: 本文提出Icon²,通过利用大语言模型(LLM)的表征空间固有调控,高效构建自合成偏好数据,以解决传统方法中的分布不匹配和高计算开销问题。实验显示在多个基准测试中显著提升了模型的对齐性能和效率。
Details
Motivation: 传统偏好数据集构建方法依赖预收集指令,易导致分布不匹配,且多随机响应采样计算开销大。本文探索一种新范式,利用LLM的固有调控能力构建更高效的偏好数据集。Contribution: 提出Icon²方法,通过层间方向向量编码人类偏好,并结合双向固有调控生成对齐区分明确的响应对,显著提升对齐性能和计算效率。
Method: 通过提取LLM的层间方向向量编码偏好,过滤自合成指令的一致性,并在解码中应用双向调控以生成区分明显的响应对。
Result: Llama3-8B和Qwen2-7B在AlpacaEval 2.0和Arena-Hard上平均胜率提升13.89%和13.45%,计算成本降低48.1%。
Insight: LLM的表征空间具备固有调控能力,可用于高效构建偏好数据,减少对预收集指令的依赖,同时提升对齐性能。
Abstract: Large Language Models (LLMs) require high quality preference datasets to align with human preferences. However, conventional methods for constructing such datasets face significant challenges: reliance on pre-collected instructions often leads to distribution mismatches with target models, while the need for sampling multiple stochastic responses introduces substantial computational overhead. In this work, we explore a paradigm shift by leveraging inherent regulation of LLMs’ representation space for efficient and tailored preference dataset construction, named Icon$^{2}$. Specifically, it first extracts layer-wise direction vectors to encode sophisticated human preferences and then uses these vectors to filter self-synthesized instructions based on their inherent consistency. During decoding, bidirectional inherent control is applied to steer token representations, enabling the precise generation of response pairs with clear alignment distinctions. Experimental results demonstrate significant improvements in both alignment and efficiency. Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by up to 48.1%.
[12] Cross-Question Method Reuse in Large Language Models: From Word-Level Prediction to Rational Logical-Layer Reasoning
Hong Su
Main category: cs.CL
TL;DR: 论文提出了一种扩展LLM中方法重用范围的技术,解决了传统方法对问题相似性要求高的问题。通过分离问题和解决方案,并引导LLM适配新问题,实现了跨问题的低相似性或隐藏相似性的方法重用。
Details
Motivation: 传统LLM方法重用要求问题高度相似,限制了其应用范围。本文旨在扩展方法重用的范围,解决低相似性或隐藏相似性问题。Contribution: 1. 提出分离问题和解决方案的方法,引导LLM适配新问题;2. 扩展方法重用到部分特征或隐藏特性共享的问题;3. 实验表明提高了可重用解决方案的筛选概率。
Method: 1. 分离问题和解决方案;2. 通过LLM引导适配新问题;3. 支持部分特征或隐藏特性的方法重用。
Result: 实验验证表明,该方法提高了筛选可重用解决方案的概率,增强了跨问题方法重用的有效性。
Insight: 通过分层次适配,LLM可以更灵活地处理低相似性问题,为方法重用提供了新思路。
Abstract: Large language models (LLMs) have been widely applied to assist in finding solutions for diverse questions. Prior work has proposed representing a method as a pair of a question and its corresponding solution, enabling method reuse. However, existing approaches typically require the questions to be highly similar. In this paper, we extend the scope of method reuse to address questions with low similarity or with hidden similarities that are not explicitly observable. For questions that are similar in a general-specific sense (i.e., broader or narrower in scope), we propose to first separate the question and solution, rather than directly feeding the pair to the LLM. The LLM is then guided to adapt the solution to new but related questions, allowing it to focus on solution transfer rather than question recognition. Furthermore, we extend this approach to cases where questions only share partial features or hidden characteristics. This enables cross-question method reuse beyond conventional similarity constraints. Experimental verification shows that our scope-extension approach increases the probability of filtering out reusable solutions, thereby improving the effectiveness of cross-question method reuse.
[13] A Survey of the State-of-the-Art in Conversational Question Answering Systems
Manoj Madushanka Perera,Adnan Mahmood,Kasun Eranda Wijethilake,Fahmida Islam,Maryam Tahermazandarani,Quan Z. Sheng
Main category: cs.CL
TL;DR: 这篇综述论文全面分析了对话式问答系统(ConvQA)的最新进展,聚焦核心组件、机器学习技术和大型语言模型的作用,同时总结了关键数据集和未来研究方向。
Details
Motivation: ConvQA系统在自然语言处理(NLP)中日益重要,尤其在多轮对话场景中需要保持连贯性和相关性。论文旨在总结当前进展,为未来研究提供方向。Contribution: 1. 对ConvQA系统的核心组件(历史选择、问题理解、答案预测)进行了详细分析;2. 探讨了强化学习、对比学习等技术在ConvQA中的应用;3. 评估了大型语言模型(如GPT-4、LLaMA 3)的作用;4. 总结了关键数据集和开放研究问题。
Method: 论文采用文献综述的方法,系统性地总结了ConvQA的技术框架,包括机器学习和大型语言模型的应用,并分析了数据集的特性。
Result: 论文提供了ConvQA领域的全面综述,展示了技术进步及其在不同领域的应用潜力。
Insight: 大型语言模型在ConvQA中发挥了关键作用,但多轮对话的连贯性和数据集多样性仍是未来研究的重点。
Abstract: Conversational Question Answering (ConvQA) systems have emerged as a pivotal area within Natural Language Processing (NLP) by driving advancements that enable machines to engage in dynamic and context-aware conversations. These capabilities are increasingly being applied across various domains, i.e., customer support, education, legal, and healthcare where maintaining a coherent and relevant conversation is essential. Building on recent advancements, this survey provides a comprehensive analysis of the state-of-the-art in ConvQA. This survey begins by examining the core components of ConvQA systems, i.e., history selection, question understanding, and answer prediction, highlighting their interplay in ensuring coherence and relevance in multi-turn conversations. It further investigates the use of advanced machine learning techniques, including but not limited to, reinforcement learning, contrastive learning, and transfer learning to improve ConvQA accuracy and efficiency. The pivotal role of large language models, i.e., RoBERTa, GPT-4, Gemini 2.0 Flash, Mistral 7B, and LLaMA 3, is also explored, thereby showcasing their impact through data scalability and architectural advancements. Additionally, this survey presents a comprehensive analysis of key ConvQA datasets and concludes by outlining open research directions. Overall, this work offers a comprehensive overview of the ConvQA landscape and provides valuable insights to guide future advancements in the field.
[14] Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Models
Donya Rooein,Flor Miriam Plaza-del-Arco,Debora Nozza,Dirk Hovy
Main category: cs.CL
TL;DR: 该论文调查了波斯语(Farsi)在三种主观任务(情感分析、情绪分析和毒性检测)中的数据可用性和质量挑战,发现尽管数据量总体增加,但公开数据集仍稀缺,且缺乏关键人口统计信息。评估显示模型表现不稳定。
Details
Motivation: 波斯语作为中等资源语言,拥有大量用户和数字文本,但在主观任务中面临数据不足和质量低的问题,这阻碍了其自然语言处理的发展。Contribution: 1. 综述110篇波斯语主观任务相关文献;2. 指出公开数据集稀缺及其缺乏人口统计信息的问题;3. 发现模型评估结果不稳定。
Method: 通过文献综述和现有数据集分析,评估波斯语主观任务的数据可用性和模型表现。
Result: 结果显示,波斯语主观任务的数据量和质量不足以支持稳健的模型训练,导致评估结果不稳定。
Insight: 单纯增加数据量无法解决资源匮乏语言的NLP问题,关键需要高质量、多样化的数据集和支持人口统计信息的标注。
Abstract: Given Farsi’s speaker base of over 127 million people and the growing availability of digital text, including more than 1.3 million articles on Wikipedia, it is considered a middle-resource language. However, this label quickly crumbles when the situation is examined more closely. We focus on three subjective tasks (Sentiment Analysis, Emotion Analysis, and Toxicity Detection) and find significant challenges in data availability and quality, despite the overall increase in data availability. We review 110 publications on subjective tasks in Farsi and observe a lack of publicly available datasets. Furthermore, existing datasets often lack essential demographic factors, such as age and gender, that are crucial for accurately modeling subjectivity in language. When evaluating prediction models using the few available datasets, the results are highly unstable across both datasets and models. Our findings indicate that the volume of data is insufficient to significantly improve a language’s prospects in NLP.
[15] Enhancing Factual Accuracy and Citation Generation in LLMs via Multi-Stage Self-Verification
Fernando Gabriela García,Qiyang Shi,Zilin Feng
Main category: cs.CL
TL;DR: 论文提出了VeriFact-CoT方法,通过多阶段自验证机制解决LLM在生成事实敏感内容时的幻觉问题和引用可信来源的缺失,显著提升了输出的准确性和可追溯性。
Details
Motivation: 大型语言模型在生成复杂、事实敏感内容时容易产生幻觉(hallucination)且缺乏可信的引用来源,这一问题限制了其在科学研究、新闻报道和法律咨询等高可信度场景中的应用。Contribution: 提出了VeriFact-CoT方法,通过多阶段的’事实验证-反思-引用整合’机制,使LLM能够自我检查和修正推理过程,从而提升生成的输出在准确性、可信度和可追溯性上的表现。
Method: 采用多阶段自验证机制,包括事实验证、反思和引用整合步骤,LLM通过这些步骤对中间推理和最终答案进行批判性自我评估和改进。
Result: 该方法显著提升了LLM生成内容的客观准确性和可信度,使其在高保真度应用中更为可靠。
Insight: 多阶段自验证机制为解决LLM的幻觉问题提供了新思路,强调了自我反思和外部引用在提升模型可信度中的重要性。
Abstract: This research introduces VeriFact-CoT (Verified Factual Chain-of-Thought), a novel method designed to address the pervasive issues of hallucination and the absence of credible citation sources in Large Language Models (LLMs) when generating complex, fact-sensitive content. By incorporating a multi-stage mechanism of ‘fact verification-reflection-citation integration,’ VeriFact-CoT empowers LLMs to critically self-examine and revise their intermediate reasoning steps and final answers. This process significantly enhances the objective accuracy, trustworthiness, and traceability of the generated outputs, making LLMs more reliable for applications demanding high fidelity such as scientific research, news reporting, and legal consultation.
[16] LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization
Luis Felipe Chary,Miguel Arjona Ramirez
Main category: cs.CL
TL;DR: LatinX是一个多语言文本转语音(TTS)模型,通过直接偏好优化(DPO)对齐语音到语音翻译中的说话人身份。其训练分为三阶段,显著降低了词错误率(WER)并提升了说话人相似性。
Details
Motivation: 跨语言语音翻译中保持源说话人身份的挑战促使开发LatinX,旨在通过多语言TTS模型和DPO对齐技术提升语音质量和说话人相似性。Contribution: LatinX的主要贡献包括:提出三阶段训练方法(预训练、监督微调和DPO对齐),在多语言(尤其葡萄牙语)任务中显著降低WER并提升说话人相似性。
Method: 采用12层解码器Transformer,三阶段训练:预训练文本到音频映射、监督微调实现零样本语音克隆、基于WER和说话人相似性的DPO对齐。
Result: 实验表明,LatinX在WER和说话人相似性上优于基线(XTTSv2),人类评估也证实了主观相似性的提升。
Insight: 未来方向包括平衡偏好信号和低延迟架构设计,同时也揭示了客观与主观评估之间的差异。
Abstract: We present LatinX, a multilingual text-to-speech (TTS) model for cascaded speech-to-speech translation that preserves the source speaker’s identity across languages. LatinX is a 12-layer decoder-only Transformer trained in three stages: (i) pre-training for text-to-audio mapping, (ii) supervised fine-tuning for zero-shot voice cloning, and (iii) alignment with Direct Preference Optimization (DPO) using automatically labeled pairs based on Word Error Rate (WER) and speaker-similarity metrics. Trained on English and Romance languages with emphasis on Portuguese, LatinX with DPO consistently reduces WER and improves objective similarity over the fine-tuned baseline. Human evaluations further indicate stronger perceived speaker similarity than a strong baseline (XTTSv2), revealing gaps between objective and subjective measures. We provide cross-lingual analyses and discuss balanced preference signals and lower-latency architectures as future work.
[17] ZhiFangDanTai: Fine-tuning Graph-based Retrieval-Augmented Generation Model for Traditional Chinese Medicine Formula
ZiXuan Zhang,Bowen Hao,Yingjie Li,Hongzhi Yin
Main category: cs.CL
TL;DR: 该论文提出了一种名为ZhiFangDanTai的框架,结合了图基检索增强生成(GraphRAG)与大语言模型(LLM)微调,用于改善传统中医药方的生成任务,解决了现有模型在细节和深度上的不足。
Details
Motivation: 传统中医药方在治疗流行病和复杂疾病中具有重要作用,但现有模型缺乏完整的方剂组成和详细解释,且数据集细节不足,限制了模型的输出深度。Contribution: 1)提出ZhiFangDanTai框架,结合GraphRAG和LLM微调;2)构建增强的指令数据集;3)提供理论证明,表明GraphRAG与微调技术可以减少泛化误差和幻觉率。
Method: 使用GraphRAG检索和综合结构化中医药知识,并通过增强的指令数据集微调LLM,以提高检索信息的整合能力。
Result: 实验结果表明,ZhiFangDanTai在收集和临床数据集上均显著优于现有先进模型。
Insight: 结合检索增强生成技术和LLM微调可以有效提升中医药方生成任务的质量和深度,减少模型幻觉现象。
Abstract: Traditional Chinese Medicine (TCM) formulas play a significant role in treating epidemics and complex diseases. Existing models for TCM utilize traditional algorithms or deep learning techniques to analyze formula relationships, yet lack comprehensive results, such as complete formula compositions and detailed explanations. Although recent efforts have used TCM instruction datasets to fine-tune Large Language Models (LLMs) for explainable formula generation, existing datasets lack sufficient details, such as the roles of the formula’s sovereign, minister, assistant, courier; efficacy; contraindications; tongue and pulse diagnosis-limiting the depth of model outputs. To address these challenges, we propose ZhiFangDanTai, a framework combining Graph-based Retrieval-Augmented Generation (GraphRAG) with LLM fine-tuning. ZhiFangDanTai uses GraphRAG to retrieve and synthesize structured TCM knowledge into concise summaries, while also constructing an enhanced instruction dataset to improve LLMs’ ability to integrate retrieved information. Furthermore, we provide novel theoretical proofs demonstrating that integrating GraphRAG with fine-tuning techniques can reduce generalization error and hallucination rates in the TCM formula task. Experimental results on both collected and clinical datasets demonstrate that ZhiFangDanTai achieves significant improvements over state-of-the-art models. Our model is open-sourced at https://huggingface.co/tczzx6/ZhiFangDanTai1.0.
[18] Let’s Roleplay: Examining LLM Alignment in Collaborative Dialogues
Abhijnan Nath,Carine Graff,Nikhil Krishnaswamy
Main category: cs.CL
TL;DR: 这篇论文研究了不同对齐方法如何影响LLM在多轮、多方协作对话中的表现,提出了一个基于角色扮演的新评估框架,结果表明摩擦感知方法在帮助团队达成共识和任务正确性上优于基线。
Details
Motivation: 随着LLM越来越多地被视为人类工作的‘协作者’,其行为的可靠性和对齐性变得至关重要。然而,现有的对齐方法通常在单用户简单场景下设计,忽略了多轮多方协作的复杂性。Contribution: 论文的主要贡献包括:1) 研究不同对齐方法在多轮多方协作中的作用;2) 提出一个基于‘摩擦代理’的角色扮演评估方法;3) 开发了一种新的反事实评估框架,量化摩擦干预对协作轨迹和信念对齐的影响。
Method: 论文采用角色扮演方法,通过‘摩擦代理’在多轮协作对话中干预,促使参与者反思推理过程。同时,提出了一个反事实评估框架,分析干预对协作结果的影响。
Result: 实验表明,摩擦感知的对齐方法在帮助团队达成共识(common ground)和提高任务正确性上显著优于常见基线方法。
Insight: 研究发现,在多轮多方协作场景中,传统的单用户对齐方法效果有限,摩擦感知方法能更好地促进团队决策的反思和优化。
Abstract: As Large Language Models (LLMs) integrate into diverse workflows, they are increasingly being considered “collaborators” with humans. If such AI collaborators are to be reliable, their behavior over multiturn interactions must be predictable, validated and verified before deployment. Common alignment techniques are typically developed under simplified single-user settings and do not account for the dynamics of long-horizon multiparty interactions. This paper examines how different alignment methods affect LLM agents’ effectiveness as partners in multiturn, multiparty collaborations. We study this question through the lens of friction agents that intervene in group dialogues to encourage the collaborative group to slow down and reflect upon their reasoning for deliberative decision-making. Using a roleplay methodology, we evaluate interventions from differently-trained friction agents in collaborative task conversations. We propose a novel counterfactual evaluation framework that quantifies how friction interventions change the trajectory of group collaboration and belief alignment. Our results show that a friction-aware approach significantly outperforms common alignment baselines in helping both convergence to a common ground, or agreed-upon task-relevant propositions, and correctness of task outcomes.
[19] Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis
Zhenqi Jia,Rui Liu,Berrak Sisman,Haizhou Li
Main category: cs.CL
TL;DR: 该论文提出了一种基于多模态细粒度上下文交互图的对话语音合成方法(MFCIG-CSS),通过建模语义和韵律的细粒度交互,提升了合成语音的自然性和韵律表现。
Details
Motivation: 现有对话语音合成方法忽略了对多模态对话历史(MDH)中词级语义和韵律的细粒度交互建模,导致生成的韵律不够自然。Contribution: 提出MFCIG-CSS系统,构建了语义交互图和韵律交互图,有效捕捉词级语义和韵律的交互及其对后续话语的影响。
Method: 设计两种多模态细粒度对话交互图(语义和韵律),通过这些图编码交互特征并用于增强语音合成的韵律表现。
Result: 在DailyTalk数据集上,MFCIG-CSS在韵律表达性上优于所有基线模型。
Insight: 细粒度的语义和韵律交互建模对提升对话语音合成的自然性至关重要。
Abstract: Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosodic interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosodic expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/MFCIG-CSS.
[20] Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge
Hao Liang,Ruitao Wu,Bohan Zeng,Junbo Niu,Wentao Zhang,Bin Dong
Main category: cs.CL
TL;DR: 本文介绍了一种标题辅助的多模态推理框架,填补了视觉与文本模态之间的鸿沟,并在ICML 2025 SeePhys挑战赛中夺冠,同时在MathVerse基准测试中验证了其泛化能力。
Details
Motivation: 尽管文本推理取得显著进展,但现有模型在多模态场景中仍表现不佳,因此需要一种能有效结合视觉和文本信息的推理方法。Contribution: 提出了标题辅助的多模态推理框架,成功应用于科学任务(SeePhys挑战赛)和几何推理(MathVerse基准),展示了方法的通用性和鲁棒性。
Method: 采用标题辅助的方法,将视觉信息转化为文本描述,从而利用文本推理模型完成多模态任务。
Result: 在ICML 2025 SeePhys挑战赛中取得第一名的成绩,并在MathVerse基准测试中表现出色,证明了方法的有效性。
Insight: 将视觉信息转化为文本标题可以显著提升多模态推理的性能,尤其是在科学和几何任务中。
Abstract: Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop & Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.
[21] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
Junteng Liu,Yunji Li,Chi Zhang,Jingyang Li,Aili Chen,Ke Ji,Weiyu Cheng,Zijia Wu,Chengyu Du,Qidi Xu,Jiayuan Song,Zhengmao Zhu,Wenhu Chen,Pengyu Zhao,Junxian He
Main category: cs.CL
TL;DR: 该论文提出了一种名为WebExplorer的系统化数据生成方法,用于训练长序列网络代理。通过模型驱动的探索和迭代的查询演化生成高质量数据集,并在此基础上训练出高性能的WebExplorer-8B模型,支持长序列任务和多步推理,在多个基准测试中达到SOTA性能。
Details
Motivation: 当前开源网络代理在复杂任务中的信息检索能力有限或缺乏透明实现,主要挑战在于缺乏高质量、复杂的数据集。Contribution: 提出WebExplorer,一种通过模型驱动探索和查询演化生成高质量数据集的方法,并成功训练出支持长序列任务的WebExplorer-8B模型。
Method: 结合监督微调和强化学习,利用生成的数据集训练模型,支持128K上下文长度和多步工具调用。
Result: 在多个信息检索基准测试中,WebExplorer-8B取得了与其规模匹配的SOTA性能,尤其在长序列任务中表现优于更大规模的模型。
Insight: 高质量数据生成和长序列能力是提升网络代理性能的关键。模型规模并非唯一决定性因素,数据处理和训练方法同样重要。
Abstract: The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.
[22] Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training
Andrei Baroian,Kasper Notebomer
Main category: cs.CL
TL;DR: 本文介绍了三种新的层级缩放变体(Framed、Reverse和Crown),通过重新分配FFN宽度和注意力头来优化LLM预训练,相比均匀层大小的基线模型,取得了更好的性能且不影响训练速度。
Details
Motivation: 传统Transformer模型使用均匀层大小设计,忽视了不同深度层的功能多样性和计算需求差异。本文旨在探索层级缩放(LWS)的变体,以更高效地分配模型参数。Contribution: 提出了三种新的LWS变体(Framed、Reverse、Crown),首次系统性地在不同层之间通过线性插值重新分配参数,展示了其在固定参数预算下的优越性能。
Method: 通过两或三点的线性插值,在预训练阶段重新分配FFN宽度和注意力头的参数分配方式。实验在180M参数和5B token的固定预算下进行。
Result: 所有提出的变体均收敛到相似的损失值,并且在性能上优于相同成本的均匀基线模型,同时训练速度未显著降低。
Insight: 层级缩放可能是优化LLM预训练的有效方向,但需更大规模的实验(如更多token和参数)以验证其潜力。
Abstract: Transformer-based language models traditionally use uniform (isotropic) layer sizes, yet they ignore the diverse functional roles that different depths can play and their computational capacity needs. Building on Layer-Wise Scaling (LWS) and pruning literature, we introduce three new LWS variants - Framed, Reverse, and Crown - that redistribute FFN widths and attention heads via two or three-point linear interpolation in the pre-training stage. We present the first systematic ablation of LWS and its variants, on a fixed budget of 180M parameters, trained on 5B tokens. All models converge to similar losses and achieve better performance compared to an equal-cost isotropic baseline, without a substantial decrease in training throughput. This work represents an initial step into the design space of layer-wise architectures for pre-training, but future work should scale experiments to orders of magnitude more tokens and parameters to fully assess their potential.
[23] SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion
Mengxue Yang,Chun Yang,Jiaqi Zhu,Jiafan Li,Jingqi Zhang,Yuyang Li,Ying Li
Main category: cs.CL
TL;DR: SLiNT是一个结合结构信息和语义上下文的知识图谱补全框架,通过结构增强和对比训练提升LLM在链接预测任务中的表现。
Details
Motivation: 知识图谱中的链接预测任务需要结合结构信息和语义上下文,但现有的大型语言模型在结构化信息利用上表现不足,导致结构稀疏性和语义模糊性问题。Contribution: 提出了SLiNT框架,包含结构引导的邻域增强、动态硬对比学习和梯度解耦的双注入方法,有效融合知识图谱的结构信息与LLM的生成能力。
Method: 通过SGNE增强稀疏实体,DHCL提供细粒度监督,GDDI在token级别注入结构信息,同时保留LLM核心参数。
Result: 在WN18RR和FB15k-237数据集上,SLiNT表现优于或与现有方法持平,验证了其有效性。
Insight: 结构感知的表示学习能够显著提升知识图谱补全任务的性能,尤其是在稀疏和零样本场景下。
Abstract: Link prediction in knowledge graphs requires integrating structural information and semantic context to infer missing entities. While large language models offer strong generative reasoning capabilities, their limited exploitation of structural signals often results in structural sparsity and semantic ambiguity, especially under incomplete or zero-shot settings. To address these challenges, we propose SLiNT (Structure-aware Language model with Injection and coNtrastive Training), a modular framework that injects knowledge-graph-derived structural context into a frozen LLM backbone with lightweight LoRA-based adaptation for robust link prediction. Specifically, Structure-Guided Neighborhood Enhancement (SGNE) retrieves pseudo-neighbors to enrich sparse entities and mitigate missing context; Dynamic Hard Contrastive Learning (DHCL) introduces fine-grained supervision by interpolating hard positives and negatives to resolve entity-level ambiguity; and Gradient-Decoupled Dual Injection (GDDI) performs token-level structure-aware intervention while preserving the core LLM parameters. Experiments on WN18RR and FB15k-237 show that SLiNT achieves superior or competitive performance compared with both embedding-based and generation-based baselines, demonstrating the effectiveness of structure-aware representation learning for scalable knowledge graph completion.
[24] Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval
Hao Lin,Peitong Xie,Jingxue Chen,Jie Lin,Qingkun Tang,Qianchun Lu
Main category: cs.CL
TL;DR: 该论文提出了MoLER,一种基于MoL增强强化学习的领域感知RAG方法,通过两阶段训练流程(持续预训练和强化学习)显著提升了检索性能,并在多个基准数据集上实现了SOTA结果。
Details
Motivation: 现有的RAG系统在检索阶段的粗排序优化中难以平衡领域知识学习和查询增强,导致检索性能不佳。为此,论文提出MoLER方法,旨在解决这一问题。Contribution: 1. 提出了MoLER方法,结合了持续预训练和强化学习;2. 提出了MSLF和MMLF策略以降低计算开销;3. 在多个基准数据集上实现了SOTA性能。
Method: MoLER采用两阶段流程:1. 使用混合损失(MoL)进行持续预训练;2. 利用GRPO强化学习优化查询和段落生成。创新的MSLF和MMLF策略用于高效训练和推理。
Result: 实验表明,MoLER在多个基准数据集上显著优于基线方法,实现了SOTA性能。
Insight: MoLER成功平衡了领域知识与通用语言能力,并通过高效的训练策略解决了RAG系统的可扩展性问题。
Abstract: Retrieval-Augmented Generation (RAG) systems rely heavily on the retrieval stage, particularly the coarse-ranking process. Existing coarse-ranking optimization approaches often struggle to balance domain-specific knowledge learning with query enhencement, resulting in suboptimal retrieval performance. To address this challenge, we propose MoLER, a domain-aware RAG method that uses MoL-Enhanced Reinforcement Learning to optimize retrieval. MoLER has a two-stage pipeline: a continual pre-training (CPT) phase using a Mixture of Losses (MoL) to balance domain-specific knowledge with general language capabilities, and a reinforcement learning (RL) phase leveraging Group Relative Policy Optimization (GRPO) to optimize query and passage generation for maximizing document recall. A key innovation is our Multi-query Single-passage Late Fusion (MSLF) strategy, which reduces computational overhead during RL training while maintaining scalable inference via Multi-query Multi-passage Late Fusion (MMLF). Extensive experiments on benchmark datasets show that MoLER achieves state-of-the-art performance, significantly outperforming baseline methods. MoLER bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains.
[25] IntrEx: A Dataset for Modeling Engagement in Educational Conversations
Xingwei Tan,Mahathi Parvatham,Chiara Gambi,Gabriele Pergola
Main category: cs.CL
TL;DR: 论文介绍了IntrEx数据集,首个标注教育对话中兴趣度和预期兴趣度的大型数据集,用于研究语言特征如何驱动学习者参与。
Details
Motivation: 现有研究较少关注教育对话中语言特征对学习者兴趣的影响,IntrEx填补了这一空白,帮助理解长期对话中的兴趣演变。Contribution: 1. 发布IntrEx数据集,扩展TSCC并提供序列级标注;2. 提出基于RLHF的比较标注方法以提升一致性;3. 发现小型LLMs在兴趣度预测上优于GPT-4o;4. 分析语言和认知因素对对话参与的影响。
Method: 1. 基于TSCC构建IntrEx,标注对话兴趣度;2. 采用比较评分法和RLHF提升标注一致性;3. 训练LLMs(7B/8B)预测兴趣度;4. 分析语言(具体性、可读性)和认知(接纳度)特征。
Result: 小型LLMs在兴趣度预测上超越GPT-4o,表明专用数据集的潜力;语言具体性、可读性和接纳度显著影响对话参与。
Insight: 教育对话的参与度建模需关注长期兴趣演变;专用数据集可提升小型LLMs的专业任务表现;语言具体性和学习者接纳度是关键驱动因素。
Abstract: Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We investigate whether large language models (LLMs) can predict human interestingness judgments. We find that LLMs (7B/8B parameters) fine-tuned on interestingness ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings. Finally, we analyze how linguistic and cognitive factors, such as concreteness, comprehensibility (readability), and uptake, influence engagement in educational dialogues.
[26] MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML
Haoyu Dong,Pengkun Zhang,Mingzhe Lu,Yanzhen Shen,Guolin Ke
Main category: cs.CL
TL;DR: MachineLearningLM通过大规模合成表格预测任务继续预训练语言模型,显著提升了LLM在上下文学习(ICL)中的ML任务表现,同时保留了通用知识。
Details
Motivation: 现有的LLM在标准ML任务中难以通过ICL从多示例中学习,因此需增强其上下文学习能力而不影响通用功能。Contribution: 提出MachineLearningLM框架,利用百万合成的SCMs任务继续预训练,显著提升LLM在表格分类等任务中的表现,且不影响通用聊天能力。
Method: 通过随机森林教师模型蒸馏决策策略到LLM,采用高效序列化提示支持大批量推理,实现性能提升。
Result: 在跨领域表格分类中平均提升15%准确率,且在MMLU测试中保持75.4%的通用能力。
Insight: LLM可以通过合成任务继续预训练显著提升ICL能力,同时保持通用性,展示了ICL在多示例下的扩展潜力。
Abstract: Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.
[27] MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security
Yanrui Du,Fenglei Fan,Sendong Zhao,Jiawei Cao,Ting Liu,Bing Qin
Main category: cs.CL
TL;DR: MoGU_v2是一个改进框架,旨在平衡大型语言模型(LLM)的实用性和安全性,通过动态分配权重和紧密耦合路由器和隐藏状态来实现更好的性能。
Details
Motivation: 随着LLM在人类生活中的广泛应用,其安全性成为关键问题,尤其是对恶意指令的无害响应能力。现有方法虽提升安全性,但往往导致实用性下降,需要在两者之间找到平衡。Contribution: 提出MoGU_v2框架,改进路由器与隐藏状态的耦合性,嵌入仅在高安全性特征编码层中,并通过双向适应优化性能,提升LLM的安全性和实用性。
Method: MoGU_v2通过动态路由器分配权重,嵌入特定层以编码安全性特征,优化时激活主干模块实现双向适应,并使用数据混合策略恢复安全性。
Result: MoGU_v2在各种LLM系列中表现出强适应性和稳定改进,包括主流LLM、资源受限场景的设备和注重用户可解释性的推理LLM。
Insight: MoGU_v2提供了一种高效方法,无需在实用性和安全性之间强制权衡,适用于多种LLM应用场景,并能通过简单策略恢复因微调引入的风险。
Abstract: As Large Language Models (LLMs) increasingly permeate human life, their security has emerged as a critical concern, particularly their ability to maintain harmless responses to malicious instructions. Although extensive methods have improved LLMs’ security, they often lead to conservative, rejection-oriented responses that compromise practical usability. This presents a key challenge: how to advance the Pareto frontier between LLMs’ usability and security, rather than necessitate a trade-off between them. To address this, we propose the MoGU framework, in which the intra-layer router dynamically allocates weights by sensing hidden states, thereby balancing the contributions of security-optimized and usability-optimized variants. Despite its initial potential, the MoGU framework faces limitations such as parameter redundancy and performance bottlenecks. To overcome these, we further propose an improved MoGU_v2 framework that establishes a tighter coupling between the routers and hidden states. In MoGU_v2, routers are embedded only in layers encoding highly classifiable security features, and backbone modules are activated during router optimization to enable bidirectional adaptation. MoGU_V2 exhibits strong adaptability and stable improvements across various series of LLMs, including mainstream LLMs serving as brains in various applications, on-device LLMs optimized for resource-constrained scenarios, and reasoning LLMs tailored for user interpretability. Meanwhile, even facing risks introduced by Instruction Fine-tuning, MoGU_v2 can easily restore security without compromising the task performance gains via a simple data-mix strategy. These comprehensive improvements highlight MoGU_V2 as a robust and versatile solution for mitigating security risks in real-world applications.
[28] Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem
Valentin Quesnel,Damien Sileo
Main category: cs.CL
TL;DR: 论文提出了一种利用E-prover在TPTP公理库中生成高质量数学推理数据的方法,避免了依赖易错的大语言模型(LLMs),并通过三个任务验证了前沿模型在深度推理方面的不足。
Details
Motivation: 当前缺乏高质量、逻辑严谨的数据是提升大语言模型数学推理能力的主要瓶颈,论文希望通过利用自动定理证明的研究成果解决这一问题。Contribution: 提出了一种基于E-prover的符号化数据生成框架,生成了大规模且逻辑严谨的数学推理数据集,并通过实验揭示了前沿模型在深度推理上的短板。
Method: 利用E-prover的饱和能力处理TPTP公理库,筛选出“有趣”的定理,并生成三个难度可控的任务:蕴含验证、前提选择和证明重构。
Result: 实验表明,前沿模型在需要深度结构推理的任务上表现不佳,框架为诊断这种差距提供了工具和训练数据来源。
Insight: 纯粹的符号化数据生成可以绕过LLMs的易错性,为数学推理任务的训练和评估提供了可靠的基础。
Abstract: The scarcity of high-quality, logically sound data is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Our work confronts this challenge by turning decades of automated theorem proving research into a scalable data engine. Rather than relying on error-prone LLMs or complex proof-assistant syntax like Lean and Isabelle, our framework leverages E-prover’s saturation capabilities on the vast TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems. Our pipeline is principled and simple: saturate axioms, filter for “interesting” theorems, and generate tasks. With no LLMs in the loop, we eliminate factual errors by construction. This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Our zero-shot experiments on frontier models reveal a clear weakness: performance collapses on tasks requiring deep, structural reasoning. Our framework provides both the diagnostic tool to measure this gap and a scalable source of symbolic training data to address it. We make the code and data publicly available. https://github.com/sileod/reasoning_core https://hf.co/datasets/reasoning-core/rc1
[29] A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs
Max Malyi,Jonathan Shek,Alasdair McDonald,Andre Biscaya
Main category: cs.CL
TL;DR: 论文提出了一种开源框架,用于评估大语言模型在风力涡轮机维护日志分类任务中的表现,比较了不同模型的可靠性和效率,并建议采用人机协同系统作为最佳应用方式。
Details
Motivation: 风力涡轮机维护日志的非结构化特性阻碍了自动化分析,影响了运维效率。论文旨在通过大语言模型解决这一挑战。Contribution: 提出了一个可复现的框架,用于系统评估大语言模型在复杂工业日志分类任务中的表现,并开源了工具。
Method: 对多种开源和专有大语言模型进行了系统评估,分析了其在可靠性、效率和模型校准方面的权衡。
Result: 结果显示模型的性能存在明显层次,部分模型与基准标准高度匹配且置信得分可信。分类性能高度依赖任务的语义模糊性。
Insight: 当前大语言模型在日志分类任务中未达到完美准确性,校准差异显著,人机协同系统是最有效的短期应用方案。
Abstract: Effective Operation and Maintenance (O&M) is critical to reducing the Levelised Cost of Energy (LCOE) from wind power, yet the unstructured, free-text nature of turbine maintenance logs presents a significant barrier to automated analysis. Our paper addresses this by presenting a novel and reproducible framework for benchmarking Large Language Models (LLMs) on the task of classifying these complex industrial records. To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool. We systematically evaluate a diverse suite of state-of-the-art proprietary and open-source LLMs, providing a foundational assessment of their trade-offs in reliability, operational efficiency, and model calibration. Our results quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores. We also demonstrate that classification performance is highly dependent on the task’s semantic ambiguity, with all models showing higher consensus on objective component identification than on interpretive maintenance actions. Given that no model achieves perfect accuracy and that calibration varies dramatically, we conclude that the most effective and responsible near-term application is a Human-in-the-Loop system, where LLMs act as a powerful assistant to accelerate and standardise data labelling for human experts, thereby enhancing O&M data quality and downstream reliability analysis.
[30] COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens
Eugene Kwek,Wenpeng Yin
Main category: cs.CL
TL;DR: COMPACT提出了一种联合修剪稀有词汇和FFN中间通道的方法,旨在优化内存、延迟和服务成本,同时保持标准Transformer架构。
Details
Motivation: 由于大型语言模型(LLM)在内存、延迟和服务成本方面的效率对边缘部署和可持续推理至关重要,COMPACT旨在解决宽度修剪和深度修剪的局限性。Contribution: COMPACT首次联合修剪词汇和FFN中间通道,并通过共同令牌加权的激活对齐重要性,实现了部署友好性和强内存节省。
Method: COMPACT通过修剪稀有词汇和FFN中间通道,使用共同令牌加权的激活方法,保持标准Transformer架构,无需训练即可快速完成修剪。
Result: 在Qwen、LLaMA和Gemma系列模型(0.5B-70B)上,COMPACT在类似或更高的修剪比例下实现了下游任务的最优性能,显著减少了参数、GPU内存和端到端延迟。
Insight: COMPACT结合了深度和宽度修剪的优点,为LLM的高效部署提供了新思路,尤其在硬件受限场景下具有广泛应用潜力。
Abstract: Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.
[31] The Majority is not always right: RL training for solution aggregation
Wenting Zhao,Pranjal Aggarwal,Swarnadeep Saha,Asli Celikyilmaz,Jason Weston,Ilia Kulikov
Main category: cs.CL
TL;DR: The paper proposes AggLM, a reinforcement learning-based method for aggregating multiple solutions in reasoning tasks, outperforming traditional majority voting and reward-model ranking.
Details
Motivation: Current methods for aggregating solutions in LLMs rely heavily on simple majority voting or reward models, which may not effectively capture minority-correct answers.Contribution: Introduces AggLM, a reinforcement learning-trained aggregator model that reviews, reconciles, and synthesizes solutions, balancing easy and hard examples.
Method: Uses reinforcement learning with verifiable rewards to train an aggregator model, focusing on balancing training examples for robustness.
Result: AggLM outperforms rule-based and reward-model baselines across benchmarks and generalizes well to solutions from different models.
Insight: Learning aggregation as an explicit reasoning skill can enhance performance beyond traditional methods, even with fewer tokens.
Abstract: Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.
[32] Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning
Liang Chen,Xueting Han,Li Shen,Jing Bai,Kam-Fai Wong
Main category: cs.CL
TL;DR: 该研究提出了一种新方法,通过双层优化将监督微调(SFT)和强化学习(RL)更紧密地结合,从而提升大语言模型(LLM)的推理能力。
Details
Motivation: 传统的两阶段训练(SFT后接RL)效率低下,且限制了SFT和RL的协同作用。该方法旨在通过双层优化增强两者的交互。Contribution: 提出了一种新颖的双层优化方法,使SFT能够根据RL的优化过程进行元学习,从而协同提升推理能力。
Method: 通过双层优化框架,下层进行RL更新并接收SFT监督,上层显式地最大化SFT和RL联合训练的协同增益。
Result: 在五个推理基准测试中,该方法均优于基线,且在效果和效率之间取得了更好的平衡。
Insight: 通过优化训练范式间的协作,可以显著提升LLM的推理能力和训练效率。
Abstract: Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach limits interaction between SFT and RL, thereby constraining overall effectiveness. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL’s optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.
[33] Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
Yinjie Wang,Ling Yang,Bowen Li,Ye Tian,Ke Shen,Mengdi Wang
Main category: cs.CL
TL;DR: 论文提出了TraceRL,一种轨迹感知的强化学习框架,用于扩散语言模型(DLMs),通过引入偏好推断轨迹提升后训练效果,并适用于多种架构。
Details
Motivation: 当前扩散语言模型在复杂数学和编程任务上的推理性能有待提升。TraceRL通过结合轨迹感知的强化学习,提升了模型的训练稳定性与推理能力。Contribution: 1. 提出了TraceRL框架,结合扩散式价值模型增强训练稳定性;2. 开发了TraDo系列模型,在小规模模型上超越了大模型的表现;3. 首次实现了长链推理(long-CoT)的扩散语言模型;4. 开源了全面的框架,支持扩散语言模型的构建、训练和部署。
Method: 使用了轨迹感知的强化学习,结合扩散式价值模型,通过课程学习优化模型推理能力。
Result: TraDo-4B-Instruct在复杂数学推理任务上表现优于7B规模的AR模型;TraDo-8B-Instruct在数学推理基准上相对提升了6.1%(对比Qwen2.5-7B)和51.3%(对比Llama3.1-8B)。长链推理模型在MATH500上相对提升了18.1%。
Insight: 轨迹感知的强化学习能够有效提升扩散语言模型的推理能力,小规模模型通过优化架构和训练方法可以超越大规模模型。
Abstract: We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion LLMs across diverse architectures. The framework integrates accelerated KV-cache techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: https://github.com/Gen-Verse/dLLM-RL
[34] On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts
Linlu Qiu,Cedegao E. Zhang,Joshua B. Tenenbaum,Yoon Kim,Roger P. Levy
Main category: cs.CL
TL;DR: 论文通过Wavelength游戏框架评估语言模型的语用推理能力,发现最先进的模型在语言理解任务中接近人类水平,语言生成任务中CoT和RSA能显著提升表现。
Details
Motivation: 随着语言模型作为对话代理普及,理解其语用推理能力(即在上下文中的交际目标和规范推理)变得愈发重要。Contribution: 提出基于Wavelength游戏的评估框架,研究语言模型在语言理解和生成中的语用推理能力,并探索RSA方法对其改进。
Method: 1. 采用直接和CoT提示法测试语言模型的语用推理;2. 引入RSA框架增强贝叶斯语用推理。
Result: 最先进模型在语言理解中表现接近人类,无需额外方法;语言生成中CoT优于直接提示,RSA进一步提升性能。
Insight: RSA为提升语言模型语用推理能力提供新方向,同时有助于理解模型与人类在概念表征和社交推理上的差异。
Abstract: Language use is shaped by pragmatics – i.e., reasoning about communicative goals and norms in context. As language models (LMs) are increasingly used as conversational agents, it becomes ever more important to understand their pragmatic reasoning abilities. We propose an evaluation framework derived from Wavelength, a popular communication game where a speaker and a listener communicate about a broad range of concepts in a granular manner. We study a range of LMs on both language comprehension and language production using direct and Chain-of-Thought (CoT) prompting, and further explore a Rational Speech Act (RSA) approach to incorporating Bayesian pragmatic reasoning into LM inference. We find that state-of-the-art LMs, but not smaller ones, achieve strong performance on language comprehension, obtaining similar-to-human accuracy and exhibiting high correlations with human judgments even without CoT prompting or RSA. On language production, CoT can outperform direct prompting, and using RSA provides significant improvements over both approaches. Our study helps identify the strengths and limitations in LMs’ pragmatic reasoning abilities and demonstrates the potential for improving them with RSA, opening up future avenues for understanding conceptual representation, language understanding, and social reasoning in LMs and humans.
cs.CV [Back]
[35] Context-Aware Knowledge Distillation with Adaptive Weighting for Image Classification
Zhengda Li
Main category: cs.CV
TL;DR: 论文提出了一种自适应知识蒸馏(AKD)框架,动态调整硬标签和软标签损失的权重,并通过上下文感知模块自适应重加权教师输出,提升了分类任务的性能。
Details
Motivation: 传统的知识蒸馏使用固定的权重alpha平衡硬标签和软标签损失,但这种静态方法可能无法适应训练过程中的变化,导致次优结果。Contribution: 1. 提出动态学习的alpha参数;2. 基于师生差异动态计算alpha;3. 设计了上下文感知模块(CAM)自适应重加权教师输出。
Method: 1. 将alpha作为可学习参数;2. 设计公式动态计算alpha;3. 引入MLP+Attention的CAM模块。
Result: 在CIFAR-10数据集上,AKD优于固定权重基线,并带来更稳定的收敛。
Insight: 动态调整损失权重和上下文感知能更高效地传递知识,提升学生模型性能。
Abstract: Knowledge distillation (KD) is a widely used technique to transfer knowledge from a large teacher network to a smaller student model. Traditional KD uses a fixed balancing factor alpha as a hyperparameter to combine the hard-label cross-entropy loss with the soft-label distillation loss. However, a static alpha is suboptimal because the optimal trade-off between hard and soft supervision can vary during training. In this work, we propose an Adaptive Knowledge Distillation (AKD) framework. First we try to make alpha as learnable parameter that can be automatically learned and optimized during training. Then we introduce a formula to reflect the gap between the student and the teacher to compute alpha dynamically, guided by student-teacher discrepancies, and further introduce a Context-Aware Module (CAM) using MLP + Attention to adaptively reweight class-wise teacher outputs. Experiments on CIFAR-10 with ResNet-50 as teacher and ResNet-18 as student demonstrate that our approach achieves superior accuracy compared to fixed-weight KD baselines, and yields more stable convergence.
[36] A Dataset Generation Scheme Based on Video2EEG-SPGN-Diffusion for SEED-VD
Yunfei Guo,Tao Zhang,Wu Huang,Yao Song
Main category: cs.CV
TL;DR: 本文提出了一个基于Video2EEG-SPGN-Diffusion框架的多模态数据集生成方案,结合了SEED-VD数据集和扩散模型,用于生成与视频刺激对齐的EEG信号,并公开了一个新的数据集。
Details
Motivation: 研究旨在解决多模态研究中视频-EEG对齐的挑战,为情感分析和脑机接口提供数据支持。Contribution: 主要贡献包括:1) 开源框架Video2EEG-SPGN-Diffusion;2) 发布了包含1000多个视频-EEG对齐样本的新数据集;3) 提供了工程化的数据对齐流程。
Method: 方法结合了自博弈图网络(SPGN)和扩散模型,用于生成个性化的EEG信号。
Result: 生成了一个包含62通道、200Hz采样率的EEG信号数据集,标注了情感标签,支持多模态研究。
Insight: 该框架为情感分析和脑机接口提供了新的数据生成工具,填补了多模态数据对齐的研究空白。
Abstract: This paper introduces an open-source framework, Video2EEG-SPGN-Diffusion, that leverages the SEED-VD dataset to generate a multimodal dataset of EEG signals conditioned on video stimuli. Additionally, we disclose an engineering pipeline for aligning video and EEG data pairs, facilitating the training of multimodal large models with EEG alignment capabilities. Personalized EEG signals are generated using a self-play graph network (SPGN) integrated with a diffusion model. As a major contribution, we release a new dataset comprising over 1000 samples of SEED-VD video stimuli paired with generated 62-channel EEG signals at 200 Hz and emotion labels, enabling video-EEG alignment and advancing multimodal research. This framework offers novel tools for emotion analysis, data augmentation, and brain-computer interface applications, with substantial research and engineering significance.
[37] RT-VLM: Re-Thinking Vision Language Model with 4-Clues for Real-World Object Recognition Robustness
Junghyun Park,Tuan Anh Nguyen,Dugki Min
Main category: cs.CV
TL;DR: RT-VLM提出了一个结合视觉语言模型和多模态线索的框架,通过生成合成数据集和两阶段推理,提升真实世界对象识别的鲁棒性。
Details
Motivation: 真实世界的对象识别模型在面对领域偏移(如低层图像统计变化、姿态变化、遮挡和类间混淆)时性能显著下降,作者希望通过多模态线索和自我修正机制来解决这一问题。Contribution: 1) 设计了一个生成带有4种线索(边界框、类名、对象级描述和场景级描述)的合成数据集的流程;2) 提出了两阶段的Re-Thinking推理机制;3) 在多个领域偏移基准测试中表现优于基线。
Method: 1) 合成数据集生成流程标注了4种线索;2) 对Llama 3.2 11B Vision Instruct进行参数高效的监督调优;3) 推理时采用两阶段Re-Thinking机制:首先生成4种线索,再迭代修正。
Result: RT-VLM在多种领域偏移的基准测试中均表现优于现有基线,展示了其鲁棒性和可迁移性。
Insight: 结合结构化多模态线索和显式的自我修正循环是提高视觉理解可靠性和可迁移性的有效途径。
Abstract: Real world deployments often expose modern object recognition models to domain shifts that precipitate a severe drop in accuracy. Such shifts encompass (i) variations in low level image statistics, (ii) changes in object pose and viewpoint, (iii) partial occlusion, and (iv) visual confusion across adjacent classes. To mitigate this degradation, we introduce the Re-Thinking Vision Language Model (RT-VLM) framework. The foundation of this framework is a unique synthetic dataset generation pipeline that produces images annotated with “4-Clues”: precise bounding boxes, class names, detailed object-level captions, and a comprehensive context-level caption for the entire scene. We then perform parameter efficient supervised tuning of Llama 3.2 11B Vision Instruct on this resource. At inference time, a two stage Re-Thinking scheme is executed: the model first emits its own four clues, then re examines these responses as evidence and iteratively corrects them. Across robustness benchmarks that isolate individual domain shifts, RT-VLM consistently surpasses strong baselines. These findings indicate that the integration of structured multimodal evidence with an explicit self critique loop constitutes a promising route toward reliable and transferable visual understanding.
[38] A Real-Time, Vision-Based System for Badminton Smash Speed Estimation on Mobile Devices
Diwen Huang
Main category: cs.CV
TL;DR: 本文提出了一种基于智能手机的实时视觉系统,用于低成本、高效地测量羽毛球扣杀速度,通过YOLOv5模型和卡尔曼滤波实现羽毛球检测与轨迹跟踪,并结合时空缩放的运动学方法计算速度。
Details
Motivation: 传统运动性能分析技术昂贵且复杂,限制了业余和休闲玩家的使用。本文旨在通过智能手机技术填补这一空白,为羽毛球运动提供经济实惠的性能分析工具。Contribution: 主要贡献包括:1)基于YOLOv5和卡尔曼滤波的羽毛球检测与跟踪方法;2)结合时空缩放的运动学速度估计技术;3)开发了用户友好的移动应用,降低了高性能分析的门槛。
Method: 系统使用YOLOv5检测羽毛球,卡尔曼滤波跟踪轨迹,并通过视频中的运动学和时空缩放方法计算速度。所有功能集成到移动应用中。
Result: 该方法实现了实时羽毛球扣杀速度测量,验证了其在性能和易用性上的有效性。
Insight: 智能手机和计算机视觉技术的结合可以显著降低高性能运动分析的成本和复杂度,使其更广泛可用。
Abstract: Performance metrics in sports, such as shot speed and angle, provide crucial feedback for athlete development. However, the technology to capture these metrics has historically been expensive, complex, and largely inaccessible to amateur and recreational players. This paper addresses this gap in the context of badminton, one of the world’s most popular sports, by introducing a novel, cost-effective, and user-friendly system for measuring smash speed using ubiquitous smartphone technology. Our approach leverages a custom-trained YOLOv5 model for shuttlecock detection, combined with a Kalman filter for robust trajectory tracking. By implementing a video-based kinematic speed estimation method with spatiotemporal scaling, the system automatically calculates the shuttlecock’s velocity from a standard video recording. The entire process is packaged into an intuitive mobile application, democratizing access to high-level performance analytics and empowering players at all levels to analyze and improve their game.
[39] Anticipatory Fall Detection in Humans with Hybrid Directed Graph Neural Networks and Long Short-Term Memory
Younggeol Cho,Gokhan Solak,Olivia Nocentini,Marta Lorenzini,Andrea Fortuna,Arash Ajoudani
Main category: cs.CV
TL;DR: 该论文提出了一种结合动态图神经网络(DGNN)和长短期记忆(LSTM)网络的混合模型,用于提前预测人体跌倒,并通过分离运动预测和步态分类任务提高准确性。
Details
Motivation: 在辅助机器人系统中,跌倒检测和预防至关重要,但现有的研究主要集中在跌倒发生后,而跌倒前的预测及稳定与即将跌倒之间的瞬态分析却未得到充分探索。Contribution: 提出了一种新的混合模型(DGNN+LSTM),并验证了分离预测和分类任务的优越性,同时实现了对瞬态步态的监控。
Method: 使用实时骨骼特征作为输入,DGNN分类步态(稳定、瞬态、跌倒),LSTM预测后续运动,提前检测跌倒。
Result: 在OUMVLP-Pose和URFD数据集上表现出色,预测误差和识别准确率优于纯DGNN模型和现有文献方法。
Insight: 分离预测和分类任务能提升性能,瞬态监控为高级辅助系统提供了新的功能优化方向。
Abstract: Detecting and preventing falls in humans is a critical component of assistive robotic systems. While significant progress has been made in detecting falls, the prediction of falls before they happen, and analysis of the transient state between stability and an impending fall remain unexplored. In this paper, we propose a anticipatory fall detection method that utilizes a hybrid model combining Dynamic Graph Neural Networks (DGNN) with Long Short-Term Memory (LSTM) networks that decoupled the motion prediction and gait classification tasks to anticipate falls with high accuracy. Our approach employs real-time skeletal features extracted from video sequences as input for the proposed model. The DGNN acts as a classifier, distinguishing between three gait states: stable, transient, and fall. The LSTM-based network then predicts human movement in subsequent time steps, enabling early detection of falls. The proposed model was trained and validated using the OUMVLP-Pose and URFD datasets, demonstrating superior performance in terms of prediction error and recognition accuracy compared to models relying solely on DGNN and models from literature. The results indicate that decoupling prediction and classification improves performance compared to addressing the unified problem using only the DGNN. Furthermore, our method allows for the monitoring of the transient state, offering valuable insights that could enhance the functionality of advanced assistance systems.
[40] Comparative Evaluation of Hard and Soft Clustering for Precise Brain Tumor Segmentation in MR Imaging
Dibya Jyoti Bora,Mrinal Kanti Mishra
Main category: cs.CV
TL;DR: 论文对MRI脑肿瘤分割中的硬聚类(K-Means)和软聚类(FCM)进行了比较,结果显示K-Means在速度快但分割精度较低,FCM则更准确但计算成本更高。
Details
Motivation: 脑肿瘤分割是医学影像分析中的关键挑战,准确的边界划分对临床决策至关重要。研究比较硬聚类和软聚类的效果,以优化分割方法。Contribution: 通过对K-Means和FCM的全面对比,揭示了两种聚类方法在速度和精度上的权衡,为实际应用提供了指导。
Method: 使用BraTS2020数据集,通过高斯滤波和CLAHE预处理,对比K-Means(硬聚类)和FCM(软聚类)的分割性能,评估指标包括DSC和处理时间。
Result: K-Means平均运行时间为0.3秒/图像,DSC为0.43;FCM平均运行时间为1.3秒/图像,DSC为0.67。
Insight: 硬聚类适合需要快速响应的场景,软聚类在精度要求高时更为适用,两者需根据实际需求权衡使用。
Abstract: Segmentation of brain tumors from Magnetic Resonance Imaging (MRI) remains a pivotal challenge in medical image analysis due to the heterogeneous nature of tumor morphology and intensity distributions. Accurate delineation of tumor boundaries is critical for clinical decision-making, radiotherapy planning, and longitudinal disease monitoring. In this study, we perform a comprehensive comparative analysis of two major clustering paradigms applied in MRI tumor segmentation: hard clustering, exemplified by the K-Means algorithm, and soft clustering, represented by Fuzzy C-Means (FCM). While K-Means assigns each pixel strictly to a single cluster, FCM introduces partial memberships, meaning each pixel can belong to multiple clusters with varying degrees of association. Experimental validation was performed using the BraTS2020 dataset, incorporating pre-processing through Gaussian filtering and Contrast Limited Adaptive Histogram Equalization (CLAHE). Evaluation metrics included the Dice Similarity Coefficient (DSC) and processing time, which collectively demonstrated that K-Means achieved superior speed with an average runtime of 0.3s per image, whereas FCM attained higher segmentation accuracy with an average DSC of 0.67 compared to 0.43 for K-Means, albeit at a higher computational cost (1.3s per image). These results highlight the inherent trade-off between computational efficiency and boundary precision.
[41] Handling imbalance and few-sample size in ML based Onion disease classification
Abhijeet Manoj Pal,Rajbabu Velmurugan
Main category: cs.CV
TL;DR: 该论文提出了一种基于深度学习的模型,用于洋葱作物病虫害的多分类任务。通过集成注意力模块和综合数据增强方法,解决了类别不平衡和小样本问题,并在真实农田图像数据集上取得了96.90%的总体准确率和0.96的F1分数。
Details
Motivation: 当前病虫害分类方法多为二元分类,无法满足实际需求,尤其是在需要精确识别特定病虫害类型的场景。Contribution: 1) 提出了一种鲁棒的深度学习模型,用于多分类任务;2) 集成了注意力模块和综合数据增强方法,解决了类别不平衡和小样本问题;3) 在真实数据集上表现优于其他方法。
Method: 1) 使用预训练的卷积神经网络(CNN);2) 集成注意力模块;3) 采用综合数据增强管道。
Result: 模型在真实农田图像数据集上取得了96.90%的总体准确率和0.96的F1分数。
Insight: 注意力模块和数据增强对解决类别不平衡和小样本问题非常有效,尤其是在农业领域的复杂场景中。
Abstract: Accurate classification of pests and diseases plays a vital role in precision agriculture, enabling efficient identification, targeted interventions, and preventing their further spread. However, current methods primarily focus on binary classification, which limits their practical applications, especially in scenarios where accurately identifying the specific type of disease or pest is essential. We propose a robust deep learning based model for multi-class classification of onion crop diseases and pests. We enhance a pre-trained Convolutional Neural Network (CNN) model by integrating attention based modules and employing comprehensive data augmentation pipeline to mitigate class imbalance. We propose a model which gives 96.90% overall accuracy and 0.96 F1 score on real-world field image dataset. This model gives better results than other approaches using the same datasets.
[42] Delta Velocity Rectified Flow for Text-to-Image Editing
Gaspard Beaudouin,Minghan Li,Jaeyeon Kim,Sunghoon Yoon,Mengyu Wang
Main category: cs.CV
TL;DR: 本文提出了Delta Velocity Rectified Flow (DVRF),一种无需反转的文本到图像编辑框架,通过显式建模源和目标速度场的差异来减少过平滑问题。
Details
Motivation: 现有基于蒸馏采样的方法存在过平滑问题,且缺乏明确的路径感知能力,影响了文本到图像编辑的质量。Contribution: 提出了DVRF框架,引入时间依赖的位移项以增强目标轨迹对齐,并建立了与Delta Denoising Score和FlowEdit的理论联系。
Method: 基于蒸馏方法,显式建模速度场差异,引入时间依赖的位移项优化噪声潜在空间。
Result: 实验表明DVRF在编辑质量、保真度和可控性上优于现有方法,且无需修改架构。
Insight: DVRF为速度场优化提供了理论框架,同时揭示了与扩散优化的联系。
Abstract: We propose Delta Velocity Rectified Flow (DVRF), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. DVRF is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that when this shift is disabled, DVRF reduces to Delta Denoising Score, thereby bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, when the shift term follows a linear schedule under rectified-flow dynamics, DVRF generalizes the Inversion-free method FlowEdit and provides a principled theoretical interpretation for it. Experimental results indicate that DVRF achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications, making it efficient and broadly applicable to text-to-image editing tasks. Code is available at https://github.com/gaspardbd/DeltaVelocityRectifiedFlow.
[43] Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Diagnosis
Zahid Ullah,Minki Hong,Tahir Mahmood,Jihie Kim
Main category: cs.CV
TL;DR: 该论文通过系统地将注意力模块融入五种经典CNN架构(如VGG16、ResNet18等),显著提升了医学图像诊断的准确性和泛化能力,尤其在脑肿瘤MRI和组织病理学数据集上表现优异。
Details
Motivation: 传统的CNN在医学图像分析中难以捕捉对准确诊断至关重要的细粒度和复杂特征,因此需要引入注意力机制来增强特征聚焦能力。Contribution: 1. 提出了在五种广泛使用的CNN架构中系统集成注意力模块的方法;2. 展示了注意力机制在提升分类精度和特征定位方面的有效性;3. 为开发临床适用的深度决策系统提供了实用框架。
Method: 在五种CNN架构中嵌入Squeeze-Excitation块或混合卷积块注意力模块(CBAM),以自适应地重新校准通道和空间特征表示。
Result: 实验表明,注意力增强的CNN在所有指标上均优于基线模型,尤其是带混合注意力的EfficientNetB5在两个数据集上表现最佳。
Insight: 注意力机制不仅提升分类性能,还能增强特征定位能力,有助于模型在不同成像模态间的泛化。
Abstract: Deep learning has become a powerful tool for medical image analysis; however, conventional Convolutional Neural Networks (CNNs) often fail to capture the fine-grained and complex features critical for accurate diagnosis. To address this limitation, we systematically integrate attention mechanisms into five widely adopted CNN architectures, namely, VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5, to enhance their ability to focus on salient regions and improve discriminative performance. Specifically, each baseline model is augmented with either a Squeeze and Excitation block or a hybrid Convolutional Block Attention Module, allowing adaptive recalibration of channel and spatial feature representations. The proposed models are evaluated on two distinct medical imaging datasets, a brain tumor MRI dataset comprising multiple tumor subtypes, and a Products of Conception histopathological dataset containing four tissue categories. Experimental results demonstrate that attention augmented CNNs consistently outperform baseline architectures across all metrics. In particular, EfficientNetB5 with hybrid attention achieves the highest overall performance, delivering substantial gains on both datasets. Beyond improved classification accuracy, attention mechanisms enhance feature localization, leading to better generalization across heterogeneous imaging modalities. This work contributes a systematic comparative framework for embedding attention modules in diverse CNN architectures and rigorously assesses their impact across multiple medical imaging tasks. The findings provide practical insights for the development of robust, interpretable, and clinically applicable deep learning based decision support systems.
[44] Vision-Based Object Detection for UAV Solar Panel Inspection Using an Enhanced Defects Dataset
Ashen Rodrigo,Isuru Munasinghe,Asanka Perera
Main category: cs.CV
TL;DR: 该论文评估了五种先进的物体检测模型(YOLOv3、Faster R-CNN、RetinaNet、EfficientDet和Swin Transformer)在太阳能电池板缺陷检测中的性能,并开发了一个专用的COCO格式数据集。
Details
Motivation: 太阳能电池板的缺陷和污染检测对光伏系统效率至关重要,但现有方法在准确性和效率上的权衡需要进一步研究。Contribution: 开发了一个专用的太阳能电池板缺陷检测数据集,并全面评估了五种先进检测模型的性能。
Method: 使用COCO格式的自定义数据集训练和评估五种模型(YOLOv3、Faster R-CNN等),基于mAP、精度、召回率和推理速度进行比较。
Result: 结果显示模型在准确性和计算效率之间存在权衡,为实际应用提供了选择依据。
Insight: 不同模型适用于不同场景,Swin Transformer在精度上表现优异,而YOLOv3在速度上更优。
Abstract: Timely and accurate detection of defects and contaminants in solar panels is critical for maintaining the efficiency and reliability of photovoltaic systems. This study presents a comprehensive evaluation of five state-of-the-art object detection models: YOLOv3, Faster R-CNN, RetinaNet, EfficientDet, and Swin Transformer, for identifying physical and electrical defects as well as surface contaminants such as dust, dirt, and bird droppings on solar panels. A custom dataset, annotated in the COCO format and specifically designed for solar panel defect and contamination detection, was developed alongside a user interface to train and evaluate the models. The performance of each model is assessed and compared based on mean Average Precision (mAP), precision, recall, and inference speed. The results demonstrate the trade-offs between detection accuracy and computational efficiency, highlighting the relative strengths and limitations of each model. These findings provide valuable guidance for selecting appropriate detection approaches in practical solar panel monitoring and maintenance scenarios. The dataset will be publicly available at https://github.com/IsuruMunasinghe98/solar-panel-inspection-dataset.
[45] Unsupervised Instance Segmentation with Superpixels
Cuong Manh Hoang
Main category: cs.CV
TL;DR: 这篇论文提出了一种无需人工标注的无监督实例分割框架,通过结合超像素和多阶段优化方法,显著提升了分割性能。
Details
Motivation: 当前实例分割模型依赖大量人工标注,成本高昂。作者旨在开发一种无需人工标注的无监督方法,降低标注成本的同时仍保持高性能。Contribution: 提出一种新的无监督实例分割框架,结合超像素和多阶段优化方法,包括粗掩模生成、掩模过滤和自训练过程。
Method: 1. 使用MultiCut算法从自监督特征生成粗掩模。2. 设计掩模过滤器提取高质量粗掩模。3. 提出超像素引导的掩模损失(硬损失和软损失)训练分割网络。4. 引入自训练过程,采用自适应损失优化预测掩模。
Result: 在公开数据集上的实验表明,该框架在实例分割和目标检测任务中优于现有无监督方法。
Insight: 通过结合低层图像特征(超像素)和高层语义信息(自监督特征),无监督方法也能实现高质量的实例分割。
Abstract: Instance segmentation is essential for numerous computer vision applications, including robotics, human-computer interaction, and autonomous driving. Currently, popular models bring impressive performance in instance segmentation by training with a large number of human annotations, which are costly to collect. For this reason, we present a new framework that efficiently and effectively segments objects without the need for human annotations. Firstly, a MultiCut algorithm is applied to self-supervised features for coarse mask segmentation. Then, a mask filter is employed to obtain high-quality coarse masks. To train the segmentation network, we compute a novel superpixel-guided mask loss, comprising hard loss and soft loss, with high-quality coarse masks and superpixels segmented from low-level image features. Lastly, a self-training process with a new adaptive loss is proposed to improve the quality of predicted masks. We conduct experiments on public datasets in instance segmentation and object detection to demonstrate the effectiveness of the proposed framework. The results show that the proposed framework outperforms previous state-of-the-art methods.
[46] Augmented Structure Preserving Neural Networks for cell biomechanics
Juan Olalla-Pombo,Alberto Badías,Miguel Ángel Sanz-Gómez,José María Benítez,Francisco Javier Montáns
Main category: cs.CV
TL;DR: 本文提出了一种结合结构保持神经网络(SPNN)与其他机器学习工具(如人工神经网络)的新方法,用于研究细胞生物力学中的复杂现象,包括细胞迁移和有丝分裂事件预测。
Details
Motivation: 细胞生物力学涉及从胚胎发生到肿瘤生长等多种复杂现象,但现有方法未能充分捕捉细胞间的相互作用及其对集体行为的影响。本文旨在通过结合机械系统分析与环境因素建模来改进预测。Contribution: 1. 提出了一种结合SPNN和人工神经网络的新模型,用于预测细胞迁移轨迹。2. 开发了基于神经网络的有丝分裂事件预测模型。3. 在模拟和真实数据上验证了模型的高准确性。
Method: 1. SPNN用于建模细胞运动的机械系统特性。2. 人工神经网络结合计算机视觉技术捕捉环境因素。3. 采用roll-out策略预测完整细胞轨迹。4. 利用相同特征设计有丝分裂事件预测模型。
Result: 模型在模拟和真实细胞迁移案例中表现出高精度预测能力,并且有丝分裂事件预测模型也取得了良好效果。
Insight: 结合机械系统建模与环境因素分析可以显著提升细胞生物力学现象的预测能力,为理解细胞集体行为提供了新工具。
Abstract: Cell biomechanics involve a great number of complex phenomena that are fundamental to the evolution of life itself and other associated processes, ranging from the very early stages of embryo-genesis to the maintenance of damaged structures or the growth of tumors. Given the importance of such phenomena, increasing research has been dedicated to their understanding, but the many interactions between them and their influence on the decisions of cells as a collective network or cluster remain unclear. We present a new approach that combines Structure Preserving Neural Networks, which study cell movements as a purely mechanical system, with other Machine Learning tools (Artificial Neural Networks), which allow taking into consideration environmental factors that can be directly deduced from an experiment with Computer Vision techniques. This new model, tested on simulated and real cell migration cases, predicts complete cell trajectories following a roll-out policy with a high level of accuracy. This work also includes a mitosis event prediction model based on Neural Networks architectures which makes use of the same observed features.
[47] Advanced Brain Tumor Segmentation Using EMCAD: Efficient Multi-scale Convolutional Attention Decoding
GodsGift Uzor,Tania-Amanda Nkoyo Fredrick Eneye,Chukwuebuka Ijezue
Main category: cs.CV
TL;DR: 本文提出了一种名为EMCAD的高效多尺度卷积注意力解码器,用于优化脑肿瘤分割任务中的性能和计算效率,特别是在计算资源有限的情况下。
Details
Motivation: 脑肿瘤分割是医学图像分析中关键的预处理步骤,但现有的解码机制通常计算成本较高。为了解决这一问题,作者设计了EMCAD。Contribution: 主要贡献是提出了EMCAD解码器,能够在保持性能的同时显著降低计算成本。
Method: EMCAD采用多尺度卷积和注意力机制相结合的方式,优化解码过程。
Result: 在BraTs2020数据集上,模型的最佳Dice分数为0.31,训练过程中的平均Dice分数为0.285±0.015,表现稳定且未出现过拟合。
Insight: EMCAD的设计表明,多尺度卷积和注意力机制的融合可以有效平衡性能和计算效率,尤其适用于资源受限的场景。
Abstract: Brain tumor segmentation is a critical pre-processing step in the medical image analysis pipeline that involves precise delineation of tumor regions from healthy brain tissue in medical imaging data, particularly MRI scans. An efficient and effective decoding mechanism is crucial in brain tumor segmentation especially in scenarios with limited computational resources. However these decoding mechanisms usually come with high computational costs. To address this concern EMCAD a new efficient multi-scale convolutional attention decoder designed was utilized to optimize both performance and computational efficiency for brain tumor segmentation on the BraTs2020 dataset consisting of MRI scans from 369 brain tumor patients. The preliminary result obtained by the model achieved a best Dice score of 0.31 and maintained a stable mean Dice score of 0.285 plus/minus 0.015 throughout the training process which is moderate. The initial model maintained consistent performance across the validation set without showing signs of over-fitting.
[48] Dynamic Sensitivity Filter Pruning using Multi-Agent Reinforcement Learning For DCNN’s
Iftekhar Haider Chowdhury,Zaed Ikbal Syed,Ahmed Faizul Haque Dhrubo,Mohammad Abdul Qayum
Main category: cs.CV
TL;DR: 论文提出了一种新型的单次滤波器剪枝框架Differential Sensitivity Fusion Pruning,通过融合多种敏感性指标评估滤波器的重要性,高效且确定了剪枝方案。
Details
Motivation: 深度卷积神经网络(DCNN)在实际部署中面临计算和内存开销大的问题,需要高效的剪枝方法以减轻模型复杂度。Contribution: 1. 提出了Differential Sensitivity Fusion Pruning框架,融合梯度敏感性、一阶泰勒展开和KL散度等多种指标评估滤波器重要性;2. 设计了指数缩放机制,突出不一致重要性滤波器;3. 方法高效且确定,仅需单次前向-反向传播。
Method: 1. 计算每个滤波器的微分敏感性分数,融合多种指标的差异;2. 应用指数缩放机制;3. 单次前向-反向传播完成剪枝。
Result: 在50%-70%剪枝率下,实现了80%的浮点运算减少,70%剪枝时保留98.23%的基线准确率。
Insight: 该方法为DCNN的高效压缩和移动端部署提供了新思路,平衡了模型压缩与性能的权衡。
Abstract: Deep Convolutional Neural Networks have achieved state of the art performance across various computer vision tasks, however their practical deployment is limited by computational and memory overhead. This paper introduces Differential Sensitivity Fusion Pruning, a novel single shot filter pruning framework that focuses on evaluating the stability and redundancy of filter importance scores across multiple criteria. Differential Sensitivity Fusion Pruning computes a differential sensitivity score for each filter by fusing the discrepancies among gradient based sensitivity, first order Taylor expansion, and KL divergence of activation distributions. An exponential scaling mechanism is applied to emphasize filters with inconsistent importance across metrics, identifying candidates that are structurally unstable or less critical to the model performance. Unlike iterative or reinforcement learning based pruning strategies, Differential Sensitivity Fusion Pruning is efficient and deterministic, requiring only a single forward-backward pass for scoring and pruning. Extensive experiments across varying pruning rates between 50 to 70 percent demonstrate that Differential Sensitivity Fusion Pruning significantly reduces model complexity, achieving over 80 percent Floating point Operations Per Seconds reduction while maintaining high accuracy. For instance, at 70 percent pruning, our approach retains up to 98.23 percent of baseline accuracy, surpassing traditional heuristics in both compression and generalization. The proposed method presents an effective solution for scalable and adaptive Deep Convolutional Neural Networks compression, paving the way for efficient deployment on edge and mobile platforms.
[49] Veriserum: A dual-plane fluoroscopic dataset with knee implant phantoms for deep learning in medical imaging
Jinhao Wang,Florian Vogl,Pascal Schütz,Saša Ćuković,William R. Taylor
Main category: cs.CV
TL;DR: Veriserum是一个开源的双平面荧光透视数据集,包含约110,000张膝关节植入物的X射线图像,用于支持深度学习在医学影像中的配准任务。数据集涵盖多种日常活动姿态,并提供自动和手动标注的真实姿态,旨在推动2D/3D图像配准、分割和3D重建等应用的发展。
Details
Motivation: 医学影像中的深度学习任务缺乏高质量、公开可用的数据集限制了算法的开发和评估,尤其是在膝关节植入物的双平面荧光透视分析领域。Contribution: Veriserum数据集提供了大规模的膝关节植入物X射线图像,支持深度学习模型的训练,并提供了丰富的姿态标注和校准工具。
Method: 通过双平面荧光透视技术采集了10种膝关节植入物的组合图像,涵盖1,600次试验和多种日常活动姿态,使用自动和手动方法标注了真实姿态。
Result: 数据集包含110,000张图像,200张手动标注用于基准测试,支持多种医学影像任务的研究。
Insight: Veriserum填补了医学影像深度学习中数据集的空白,为算法开发提供了可重复的基准,同时推动了双平面荧透技术的应用。
Abstract: Veriserum is an open-source dataset designed to support the training of deep learning registration for dual-plane fluoroscopic analysis. It comprises approximately 110,000 X-ray images of 10 knee implant pair combinations (2 femur and 5 tibia implants) captured during 1,600 trials, incorporating poses associated with daily activities such as level gait and ramp descent. Each image is annotated with an automatically registered ground-truth pose, while 200 images include manually registered poses for benchmarking. Key features of Veriserum include dual-plane images and calibration tools. The dataset aims to support the development of applications such as 2D/3D image registration, image segmentation, X-ray distortion correction, and 3D reconstruction. Freely accessible, Veriserum aims to advance computer vision and medical imaging research by providing a reproducible benchmark for algorithm development and evaluation. The Veriserum dataset used in this study is publicly available via https://movement.ethz.ch/data-repository/veriserum.html, with the data stored at ETH Z"urich Research Collections: https://doi.org/10.3929/ethz-b-000701146.
[50] OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation
Ahad Jawaid,Yu Xiang
Main category: cs.CV
TL;DR: OpenEgo是一个大规模多模态的自我中心数据集,专注于灵巧操作任务,整合了六个公共数据集,提供标准化的手部姿态标注和意图对齐的动作基元。
Details
Motivation: 现有的自我中心视频数据集通常缺乏细粒度的时间局部动作描述或灵巧的手部标注,限制了模仿学习的研究。Contribution: 提出了OpenEgo数据集,整合了1107小时的多模态数据,覆盖290个操作任务和600+环境,提供了统一的手部姿态布局和时间戳动作基元。
Method: 通过整合多个公共数据集,统一手部姿态标注,并设计意图对齐的动作基元,支持语言条件化的模仿学习策略。
Result: 验证了该数据集在训练语言条件化模仿学习策略方面的实用性,能够预测灵巧的手部轨迹。
Insight: OpenEgo降低了从自我中心视频学习灵巧操作的障碍,支持视觉-语言-动作学习的可重复研究。
Abstract: Egocentric human videos provide scalable demonstrations for imitation learning, but existing corpora often lack either fine-grained, temporally localized action descriptions or dexterous hand annotations. We introduce OpenEgo, a multimodal egocentric manipulation dataset with standardized hand-pose annotations and intention-aligned action primitives. OpenEgo totals 1107 hours across six public datasets, covering 290 manipulation tasks in 600+ environments. We unify hand-pose layouts and provide descriptive, timestamped action primitives. To validate its utility, we train language-conditioned imitation-learning policies to predict dexterous hand trajectories. OpenEgo is designed to lower the barrier to learning dexterous manipulation from egocentric video and to support reproducible research in vision-language-action learning. All resources and instructions will be released at www.openegocentric.com.
[51] Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting
Sen Wang,Kunyi Li,Siyun Liang,Elena Alegret,Jing Ma,Nassir Navab,Stefano Gasperini
Main category: cs.CV
TL;DR: 论文提出了VALA方法,通过可见性感知的语言聚合和流式加权几何中值合并多视图特征,解决了3D高斯泼溅中语言特征不一致的问题,提升了开放词汇分割的性能。
Details
Motivation: 现有方法在3D高斯泼溅中提取开放词汇语言特征时存在两个问题:背景高斯对渲染像素贡献小但获得与前景相同的特征,以及多视图语言嵌入的噪声导致不一致。Contribution: 提出了VALA方法,通过可见性感知的语言聚合和流式加权几何中值合并多视图特征,解决了上述问题。
Method: VALA通过计算每条光线的边际贡献并应用可见性感知门,保留可见高斯。流式加权几何中值用于合并多视图特征。
Result: VALA在开放词汇定位和分割任务上优于现有方法,性能一致提升。
Insight: 可见性感知的特征聚合和多视图噪声的合并策略是提升3D场景语言嵌入一致性和鲁棒性的关键。
Abstract: Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works.
[52] DuoCLR: Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation
Haitao Tian,Pierre Payeur
Main category: cs.CV
TL;DR: DuoCLR提出了一种双代理对比学习框架,专注于增强基于骨架的人体动作分割的预训练,通过多尺度表示和跨序列变体提升性能。
Details
Motivation: 现有工作多为动作识别设计,忽略动作分割的多尺度表示需求。DuoCLR致力于解决这一问题,利用对比学习提升分割效果。Contribution: 1. 提出了’Shuffle and Warp’数据增强策略;2. 设计了两个代理任务CPC和ROR;3. 验证了DuoCLR在动作分割任务中的优越性。
Method: 通过’Shuffle and Warp’生成多样化的多动作排列,结合CPC(跨排列对比)和ROR(相对顺序推理)进行双代理对比学习。
Result: DuoCLR在未修剪数据集上显著优于现有方法,尤其在多类和多标签动作分割任务中表现优异。
Insight: 结合多尺度表示和对比学习的双代理任务可以有效提升动作分割性能,数据增强策略的多样性是关键。
Abstract: In this paper, a contrastive representation learning framework is proposed to enhance human action segmentation via pre-training using trimmed (single action) skeleton sequences. Unlike previous representation learning works that are tailored for action recognition and that build upon isolated sequence-wise representations, the proposed framework focuses on exploiting multi-scale representations in conjunction with cross-sequence variations. More specifically, it proposes a novel data augmentation strategy, ‘Shuffle and Warp’, which exploits diverse multi-action permutations. The latter effectively assists two surrogate tasks that are introduced in contrastive learning: Cross Permutation Contrasting (CPC) and Relative Order Reasoning (ROR). In optimization, CPC learns intra-class similarities by contrasting representations of the same action class across different permutations, while ROR reasons about inter-class contexts by predicting relative mapping between two permutations. Together, these tasks enable a Dual-Surrogate Contrastive Learning (DuoCLR) network to learn multi-scale feature representations optimized for action segmentation. In experiments, DuoCLR is pre-trained on a trimmed skeleton dataset and evaluated on an untrimmed dataset where it demonstrates a significant boost over state-the-art comparatives in both multi-class and multi-label action segmentation tasks. Lastly, ablation studies are conducted to evaluate the effectiveness of each component of the proposed approach.
[53] RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentangled Representation
Yihong Leng,Siming Zheng,Jinwei Chen,Bo Li,Jiaojiao Li,Peng-Tao Jiang
Main category: cs.CV
TL;DR: 论文提出了一种基于事件相机的鲁棒性运动去模糊方法RED,通过模态特定的解耦表示和随机扰动策略提高对事件流不完整的鲁棒性。
Details
Motivation: 现有方法忽略了事件流的不完整性(由DVS阈值机制引入的灵敏度与噪声权衡导致),这影响了运动先验的完整性并限制了事件引导去模糊的效果。Contribution: 1) 提出了RED网络,结合模态特定的解耦表示;2) 设计了鲁棒性导向的扰动策略(RPS);3) 提出了解耦的OmniAttention模块来建模多维相关性。
Method: 1) RPS通过随机掩码训练提升鲁棒性;2) OmniAttention显式建模运动内、运动间和跨模态相关性;3) 设计了交互模块增强运动敏感区域和语义上下文的注入。
Result: 在合成和真实数据集上,RED在准确性和鲁棒性上均达到最先进水平。
Insight: 事件流的不完整性是关键挑战,通过扰动训练和解耦表示可以有效提升模型对不完整事件的适应性。
Abstract: Event cameras provide sparse yet temporally high-temporal-resolution motion information, demonstrating great potential for motion deblurring. Existing methods focus on cross-modal interaction, overlooking the inherent incompleteness of event streams, which arises from the trade-off between sensitivity and noise introduced by the thresholding mechanism of Dynamic Vision Sensors (DVS). Such degradation compromises the integrity of motion priors and limits the effectiveness of event-guided deblurring. To tackle these challenges, we propose a Robust Event-guided Deblurring (RED) network with modality-specific disentangled representation. First, we introduce a Robustness-Oriented Perturbation Strategy (RPS) that applies random masking to events, which exposes RED to incomplete patterns and then foster robustness against various unknown scenario conditions.Next, a disentangled OmniAttention is presented to explicitly model intra-motion, inter-motion, and cross-modality correlations from two inherently distinct but complementary sources: blurry images and partially disrupted events. Building on these reliable features, two interactive modules are designed to enhance motion-sensitive areas in blurry images and inject semantic context into incomplete event representations. Extensive experiments on synthetic and real-world datasets demonstrate RED consistently achieves state-of-the-art performance in both accuracy and robustness.
[54] Sensitivity-Aware Post-Training Quantization for Deep Neural Networks
Zekang Zheng,Haokun Li,Yaofo Chen,Mingkui Tan,Qing Du
Main category: cs.CV
TL;DR: 本文提出了一种高效的基于参数敏感性的后训练量化(PTQ)方法,通过优先量化高敏感性参数,并利用未量化低敏感性参数补偿量化误差,从而在保证精度的同时显著提升量化效率。
Details
Motivation: 现有后训练量化方法因迭代更新参数而产生高计算复杂度,限制了在资源受限的边缘计算和实时推理场景中的应用。本文旨在通过参数敏感性分析,设计一种高效且低精损的量化方法。Contribution: 1) 提出基于参数敏感性的量化优先级策略;2) 引入行并行量化框架和全局共享逆Hessian矩阵更新机制,显著降低计算复杂度;3) 在ResNet-50和YOLOv5s上实现20-200倍量化加速,精度损失低于0.3%。
Method: 1) 参数敏感性分析指导优先量化高敏感性参数;2) 未量化低敏感性参数用于补偿误差;3) 利用参数敏感性的列聚类特性,设计行并行量化框架和共享逆Hessian更新机制。
Result: 在ResNet-50和YOLOv5s上实验表明,量化速度提升20-200倍,平均精度损失低于0.3%,显著优于基准方法Optimal Brain Quantization。
Insight: 通过参数敏感性分析和并行量化框架,实现了高效与低精损的平衡,为边缘计算和实时场景提供了实用解决方案。
Abstract: Model quantization reduces neural network parameter precision to achieve compression, but often compromises accuracy. Existing post-training quantization (PTQ) methods employ iterative parameter updates to preserve accuracy under high compression ratios, incurring significant computational complexity and resource overhead, which limits applicability in resource-constrained edge computing and real-time inference scenarios. This paper proposes an efficient PTQ method guided by parameter sensitivity analysis. The approach prioritizes quantization of high-sensitivity parameters, leveraging unquantized low-sensitivity parameters to compensate for quantization errors, thereby mitigating accuracy degradation. Furthermore, by exploiting column-wise clustering of parameter sensitivity, the method introduces a row-parallel quantization framework with a globally shared inverse Hessian matrix update mechanism, reducing computational complexity by an order of magnitude. Experimental results on ResNet-50 and YOLOv5s demonstrate a 20-200-fold quantization speedup over the Optimal Brain Quantization baseline, with mean accuracy loss below 0.3%, confirming the method’s efficacy in balancing efficiency and accuracy.
[55] Reconstruction and Reenactment Separated Method for Realistic Gaussian Head
Zhiling Ye,Cong Zhou,Xiubao Zhang,Haifeng Shen,Weihong Deng,Quan Lu
Main category: cs.CV
TL;DR: 论文提出了一种分离式的重建与驱动框架,仅需单张人像输入即可生成可控的高斯头像,实现高帧率渲染(90 FPS@512x512),并通过实验验证其性能优于现有方法。
Details
Motivation: 当前3D高斯头像生成方法在单张输入条件下的泛化能力与高频纹理重建不足,且驱动效率与重建性能之间存在权衡。本文旨在通过分离式设计解决这些问题。Contribution: 1. 提出了一种重建与驱动分离的框架;2. 开发了基于WebSSL的大规模单样本高斯头像生成器;3. 通过两阶段训练提升泛化能力与高频纹理重建。
Method: 采用两阶段训练策略:首阶段通过大规模数据训练WebSSL模型,第二阶段优化高频细节;分离重建与驱动模块,确保驱动效率不受重建参数规模影响。
Result: 实现512x512分辨率下90 FPS的实时渲染,重建模块参数规模与性能呈正相关,实验显示方法优于现有技术。
Insight: 分离设计允许独立优化重建与驱动部分,既提升质量又保持效率,为高斯头像生成提供了可扩展的解决方案。
Abstract: In this paper, we explore a reconstruction and reenactment separated framework for 3D Gaussians head, which requires only a single portrait image as input to generate controllable avatar. Specifically, we developed a large-scale one-shot gaussian head generator built upon WebSSL and employed a two-stage training approach that significantly enhances the capabilities of generalization and high-frequency texture reconstruction. During inference, an ultra-lightweight gaussian avatar driven by control signals enables high frame-rate rendering, achieving 90 FPS at a resolution of 512x512. We further demonstrate that the proposed framework follows the scaling law, whereby increasing the parameter scale of the reconstruction module leads to improved performance. Moreover, thanks to the separation design, driving efficiency remains unaffected. Finally, extensive quantitative and qualitative experiments validate that our approach outperforms current state-of-the-art methods.
[56] Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
Jungin Park,Jiyoung Lee,Kwanghoon Sohn
Main category: cs.CV
TL;DR: Paper proposes VideoGraph, a language-guided recursive spatiotemporal graph model for video summarization, emphasizing semantic relationships between objects and frames.
Details
Motivation: Existing methods focus on temporal modeling but overlook fine-grained visual entities and semantic relationships. Language guidance is crucial for complex video understanding.Contribution: Introduces VideoGraph, a novel framework combining spatial (objects) and temporal (frames) graphs with language-guided semantic relationships.
Method: Recursive spatiotemporal graph networks connect objects and frames as nodes. Language queries enhance semantic node representations.
Result: Achieves state-of-the-art performance on benchmarks for generic and query-focused video summarization.
Insight: Semantic understanding via language-guidance improves summarization, and recursive refinement enhances keyframe classification.
Abstract: Video summarization aims to select keyframes that are visually diverse and can represent the whole story of a given video. Previous approaches have focused on global interlinkability between frames in a video by temporal modeling. However, fine-grained visual entities, such as objects, are also highly related to the main content of the video. Moreover, language-guided video summarization, which has recently been studied, requires a comprehensive linguistic understanding of complex real-world videos. To consider how all the objects are semantically related to each other, this paper regards video summarization as a language-guided spatiotemporal graph modeling problem. We present recursive spatiotemporal graph networks, called VideoGraph, which formulate the objects and frames as nodes of the spatial and temporal graphs, respectively. The nodes in each graph are connected and aggregated with graph edges, representing the semantic relationships between the nodes. To prevent the edges from being configured with visual similarity, we incorporate language queries derived from the video into the graph node representations, enabling them to contain semantic knowledge. In addition, we adopt a recursive strategy to refine initial graphs and correctly classify each frame node as a keyframe. In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and query-focused video summarization in both supervised and unsupervised manners. The code is available at https://github.com/park-jungin/videograph.
[57] Patch-level Kernel Alignment for Self-Supervised Dense Representation Learning
Juan Yeo,Ijun Jang,Taesup Kim
Main category: cs.CV
TL;DR: 本文提出了Patch-level Kernel Alignment (PaKA),一种针对自监督密集表示学习的简单但有效的对齐目标,旨在通过教师-学生模型之间的特征分布对齐提升密集任务的性能。
Details
Motivation: 现有自监督学习方法主要关注全局表示,难以满足密集预测任务对局部语义的需求,因此需要一种方法将语义知识转移到密集特征空间。Contribution: 提出了PaKA方法,通过教师-学生模型的密集特征分布对齐,捕捉统计依赖性;研究了针对密集表示的数据增强策略。
Method: PaKA通过核对齐目标匹配教师-学生模型的密集块结构关系;设计了特定增强策略以优化密集表示学习。
Result: 在多个密集视觉任务基准测试中取得SOTA性能。
Insight: 密集表示学习需要专门的对齐目标和增强策略,PaKA通过捕捉统计依赖性有效提升了局部语义的建模能力。
Abstract: Dense representations are essential for vision tasks that require spatial precision and fine-grained detail. While most self-supervised representation learning methods focus on global representations that summarize the image as a whole, such approaches often fall short in capturing the localized semantics necessary for dense prediction tasks. To overcome these limitations, we propose a framework that builds on pretrained representations through additional self-supervised learning, aiming to transfer existing semantic knowledge into the dense feature space. Our method aligns the distributions of dense features between a teacher and a student model. Specifically, we introduce Patch-level Kernel Alignment (PaKA), a simple yet effective alignment objective that captures statistical dependencies, thereby matching the structural relationships of dense patches across the two models. In addition, we investigate augmentation strategies specifically designed for dense representation learning. Our framework achieves state-of-the-art results across a variety of dense vision benchmarks, demonstrating the effectiveness of our approach.
[58] SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning
Hanzhen Wang,Jiaming Xu,Jiayi Pan,Yongkang Zhou,Guohao Dai
Main category: cs.CV
TL;DR: SpecPrune-VLA通过结合当前动作局部信息和历史动作全局信息,提出了一种两层剪枝和启发式控制方法,加速VLA模型,显著提升速度且成功率损失可忽略。
Details
Motivation: 现有VLA模型剪枝方法仅依赖当前动作的局部信息,忽略了历史动作的全局上下文,导致成功率下降和加速效果有限。Contribution: 提出了SpecPrune-VLA,一种无需训练的方法,结合静态和动态剪枝,以及轻量级动作感知控制器,显著提升速度且保持高成功率。
Method: 1) 静态剪枝:利用全局历史和局部上下文减少视觉token;2) 动态剪枝:基于层重要性剪枝token;3) 动作感知控制器:根据动作粒度调整剪枝强度。
Result: 在LIBERO上,SpecPrune-VLA比OpenVLA-OFT快1.46倍(A800)和1.57倍(RTX 3090),成功率损失可忽略。
Insight: 结合历史和局部信息的剪枝策略能更有效地保留关键token,提升模型效率,特别是在细粒度动作场景下。
Abstract: Pruning accelerates compute-bound models by reducing computation. Recently applied to Vision-Language-Action (VLA) models, existing methods prune tokens using only local info from current action, ignoring global context from prior actions, causing >20% success rate drop and limited speedup. We observe high similarity across consecutive actions and propose leveraging both local (current) and global (past) info for smarter token selection. We introduce SpecPrune-VLA, a training-free method with two-level pruning and heuristic control: (1) Static pruning at action level: uses global history and local context to reduce visual tokens per action; (2) Dynamic pruning at layer level: prunes tokens per layer based on layer-specific importance; (3) Lightweight action-aware controller: classifies actions as coarse/fine-grained (by speed), adjusting pruning aggressiveness since fine-grained actions are pruning-sensitive. Experiments on LIBERO show SpecPrune-VLA achieves 1.46 times speedup on NVIDIA A800 and 1.57 times on NVIDIA GeForce RTX 3090 vs. OpenVLA-OFT, with negligible success rate loss.
[59] SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models
Kien Nguyen,Anh Tran,Cuong Pham
Main category: cs.CV
TL;DR: SuMa是专为文本到图像扩散模型设计的子空间映射方法,旨在高效且鲁棒地擦除窄概念(如版权内容或名人),解决了现有方法无法兼顾鲁棒性与图像质量的问题。
Details
Motivation: 随着文本到图像扩散模型的普及,其潜在滥用(如生成有害或侵权内容)引发担忧。现有概念擦除方法难以同时满足鲁棒性(彻底移除目标概念)和有效性(保持图像质量),尤其针对窄概念(如版权角色或名人)。Contribution: 1. 提出SuMa方法,通过子空间映射实现窄概念的鲁棒擦除;2. 设计目标子空间与参考子空间的映射机制,平衡擦除效果与图像质量;3. 在四类任务中验证方法的优越性。
Method: 1. 提取表示目标概念的子空间;2. 将其映射到最小化距离的参考子空间,实现概念中和;3. 保留其他非目标概念的语义信息以确保图像质量。
Result: SuMa在子类擦除、名人擦除、艺术风格擦除和实例擦除任务中表现优异,图像质量与专注有效性的方法相当,擦除效果媲美专注鲁棒性的方法。
Insight: 窄概念的擦除需精细化的子空间操作,SuMa通过映射机制解决了现有方法在窄概念上的局限性,为版权与法律问题提供了新解决方案。
Abstract: The rapid growth of text-to-image diffusion models has raised concerns about their potential misuse in generating harmful or unauthorized contents. To address these issues, several Concept Erasure methods have been proposed. However, most of them fail to achieve both robustness, i.e., the ability to robustly remove the target concept., and effectiveness, i.e., maintaining image quality. While few recent techniques successfully achieve these goals for NSFW concepts, none could handle narrow concepts such as copyrighted characters or celebrities. Erasing these narrow concepts is critical in addressing copyright and legal concerns. However, erasing them is challenging due to their close distances to non-target neighboring concepts, requiring finer-grained manipulation. In this paper, we introduce Subspace Mapping (SuMa), a novel method specifically designed to achieve both robustness and effectiveness in easing these narrow concepts. SuMa first derives a target subspace representing the concept to be erased and then neutralizes it by mapping it to a reference subspace that minimizes the distance between the two. This mapping ensures the target concept is robustly erased while preserving image quality. We conduct extensive experiments with SuMa across four tasks: subclass erasure, celebrity erasure, artistic style erasure, and instance erasure and compare the results with current state-of-the-art methods. Our method achieves image quality comparable to approaches focused on effectiveness, while also yielding results that are on par with methods targeting completeness.
[60] Self-supervised Learning for Hyperspectral Images of Trees
Moqsadur Rahman,Saurav Kumar,Santosh S. Palmate,M. Shahriar Hossain
Main category: cs.CV
TL;DR: 论文研究了自监督学习在树木高光谱图像中的应用,提出了一种基于植被特性的嵌入空间表示方法,显著提升了下游任务的性能。
Details
Motivation: 高光谱图像在精准农业中具有重要意义,但缺乏标注数据的问题限制了其应用。本文旨在通过自监督学习解决这一问题,直接从高光谱图像中提取有意义的树木植被特征。Contribution: 主要贡献是提出了一种自监督学习方法,构建了一个与植被特性相关的嵌入空间,该表示在下游机器学习任务中表现优于直接使用高光谱植被特性。
Method: 方法采用自监督学习框架,通过神经网络学习高光谱图像的嵌入表示,重点关注树木的植被特性。
Result: 实验结果表明,构建的树表示在下游任务中性能优于传统的高光谱植被特性表示。
Insight: 论文揭示了自监督学习在高光谱图像分析中的潜力,特别是在缺乏标注数据的情况下,能为精准农业提供更高效的特征提取方法。
Abstract: Aerial remote sensing using multispectral and RGB imagers has provided a critical impetus to precision agriculture. Analysis of the hyperspectral images with limited or no labels is challenging. This paper focuses on self-supervised learning to create neural network embeddings reflecting vegetation properties of trees from aerial hyperspectral images of crop fields. Experimental results demonstrate that a constructed tree representation, using a vegetation property-related embedding space, performs better in downstream machine learning tasks compared to the direct use of hyperspectral vegetation properties as tree representations.
[61] Evaluating YOLO Architectures: Implications for Real-Time Vehicle Detection in Urban Environments of Bangladesh
Ha Meem Hossain,Pritam Nath,Mahitun Nesa Mahi,Imtiaz Uddin,Ishrat Jahan Eiste,Syed Nasibur Rahman Ratul,Md Naim Uddin Mozumdar,Asif Mohammed Saad
Main category: cs.CV
TL;DR: 本研究评估了六种YOLO架构在孟加拉城市环境中的车辆检测性能,发现YOLOv11x表现最佳,但中型模型在性能和速度上更优。研究揭示了稀有车辆类别检测的挑战。
Details
Motivation: 现有车辆检测系统在孟加拉独特道路环境中表现不佳,亟需针对本地车辆类型优化的模型以支持自动驾驶技术。Contribution: 1. 构建包含29种本地车辆类别的数据集;2. 评估多种YOLO模型在本地环境中的性能;3. 发现中型模型在性能与速度间的平衡。
Method: 使用高分辨率图像数据集,手动标注并以YOLO格式训练六种YOLO模型,评估其mAP、召回率和F1分数。
Result: YOLOv11x表现最佳(mAP@0.5 63.7%),但中型模型(如YOLOv8m)在性能与速度间更优。稀有车辆类别检测效果差。
Insight: 数据集不平衡和样本不足对稀有车辆检测影响显著,同类车辆易混淆,需针对性优化以提升模型实用性。
Abstract: Vehicle detection systems trained on Non-Bangladeshi datasets struggle to accurately identify local vehicle types in Bangladesh’s unique road environments, creating critical gaps in autonomous driving technology for developing regions. This study evaluates six YOLO model variants on a custom dataset featuring 29 distinct vehicle classes, including region-specific vehicles such as Desi Nosimon'', Leguna’’, Battery Rickshaw'', and CNG’’. The dataset comprises high-resolution images (1920x1080) captured across various Bangladeshi roads using mobile phone cameras and manually annotated using LabelImg with YOLO format bounding boxes. Performance evaluation revealed YOLOv11x as the top performer, achieving 63.7% mAP@0.5, 43.8% mAP@0.5:0.95, 61.4% recall, and 61.6% F1-score, though requiring 45.8 milliseconds per image for inference. Medium variants (YOLOv8m, YOLOv11m) struck an optimal balance, delivering robust detection performance with mAP@0.5 values of 62.5% and 61.8% respectively, while maintaining moderate inference times around 14-15 milliseconds. The study identified significant detection challenges for rare vehicle classes, with Construction Vehicles and Desi Nosimons showing near-zero accuracy due to dataset imbalances and insufficient training samples. Confusion matrices revealed frequent misclassifications between visually similar vehicles, particularly Mini Trucks versus Mini Covered Vans. This research provides a foundation for developing robust object detection systems specifically adapted to Bangladesh traffic conditions, addressing critical needs in autonomous vehicle technology advancement for developing regions where conventional generic-trained models fail to perform adequately.
[62] EditIDv2: Editable ID Customization with Data-Lubricated ID Feature Integration for Text-to-Image Generation
Guandong Li,Zhaobin Chu
Main category: cs.CV
TL;DR: EditIDv2是一个无需调优的解决方案,专注于复杂叙事场景和长文本输入的字符编辑。通过改进ID特征集成模块和数据润滑技术,实现了在高复杂情境下保持身份一致性和语义深度的多级编辑。
Details
Motivation: 现有字符编辑方法在长文本和多语义层情境下表现不佳,容易出现编辑能力退化、语义偏差和身份一致性丢失的问题。Contribution: 提出了EditIDv2,通过分解PerceiverAttention、引入ID损失和动态训练,以及离线融合策略,仅用少量数据润滑实现高质量编辑和身份一致性。
Method: 采用PerceiverAttention分解、ID损失与扩散模型联合动态训练,以及离线融合策略优化ID特征集成模块。
Result: 在IBench评估中表现优异,能够满足长提示和高质量图像生成的需求。
Insight: 数据润滑技术的有限使用足以支持复杂叙事环境下的深度语义编辑,且ID特征集成模块的设计对编辑能力至关重要。
Abstract: We propose EditIDv2, a tuning-free solution specifically designed for high-complexity narrative scenes and long text inputs. Existing character editing methods perform well under simple prompts, but often suffer from degraded editing capabilities, semantic understanding biases, and identity consistency breakdowns when faced with long text narratives containing multiple semantic layers, temporal logic, and complex contextual relationships. In EditID, we analyzed the impact of the ID integration module on editability. In EditIDv2, we further explore and address the influence of the ID feature integration module. The core of EditIDv2 is to discuss the issue of editability injection under minimal data lubrication. Through a sophisticated decomposition of PerceiverAttention, the introduction of ID loss and joint dynamic training with the diffusion model, as well as an offline fusion strategy for the integration module, we achieve deep, multi-level semantic editing while maintaining identity consistency in complex narrative environments using only a small amount of data lubrication. This meets the demands of long prompts and high-quality image generation, and achieves excellent results in the IBench evaluation.
[63] OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation
Xiaomeng Zhu,Changwei Wang,Haozhe Wang,Xinyu Liu,Fangzhen Lin
Main category: cs.CV
TL;DR: 该论文提出了一种新的方法OOTSM,用于分解场景图预测任务,通过两步法(视觉捕捉和纯文本预测)提升长期预测的鲁棒性。
Details
Motivation: 现有场景图预测方法主要依赖视觉线索,难以整合常识知识,限制了长期预测的稳健性。Contribution: 提出了OOTSM方法,通过两步分解任务(视觉捕捉+文本预测),并引入LLM预测对象变化和关系,显著提升了长期预测性能。
Method: 采用两步法:1)使用场景图捕捉模型将视频转换为场景图序列;2)纯文本模型预测未来帧的场景图(LSGA)。OOTSM利用LLM分两阶段预测对象变化和关系。
Result: 在Action Genome基准测试中,短期mean-Recall提升3.4%,长期mean-Recall显著提升21.9%。
Insight: 显式利用常识知识和语言模型能显著提升场景图预测的性能,尤其是长期预测。
Abstract: A scene graph is a structured represention of objects and their relationships in a scene. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications as intelligent surveillance and human-machine collaboration. Existing SGA approaches primarily leverage visual cues, often struggling to integrate valuable commonsense knowledge, thereby limiting long-term prediction robustness. To explicitly leverage such commonsense knowledge, we propose a new approach to better understand the objects, concepts, and relationships in a scene graph. Our approach decouples the SGA task in two steps: first a scene graph capturing model is used to convert a video clip into a sequence of scene graphs, then a pure text-based model is used to predict scene graphs in future frames. Our focus in this work is on the second step, and we call it Linguistic Scene Graph Anticipation (LSGA) and believes it should have independent interest beyond the use in SGA discussed here. For LSGA, we introduce an Object-Oriented Two-Staged Method (OOTSM) where an Large Language Model (LLM) first forecasts object appearances and disappearances before generating detailed human-object relations. We conduct extensive experiments to evaluate OOTSM in two settings. For LSGA, we evaluate our fine-tuned open-sourced LLMs against zero-shot APIs (i.e., GPT-4o, GPT-4o-mini, and DeepSeek-V3) on a benchmark constructed from Action Genome annotations. For SGA, we combine our OOTSM with STTran++ from, and our experiments demonstrate effective state-of-the-art performance: short-term mean-Recall (@10) increases by 3.4% while long-term mean-Recall (@50) improves dramatically by 21.9%. Code is available at https://github.com/ZhuXMMM/OOTSM.
[64] WIPUNet: A Physics-inspired Network with Weighted Inductive Biases for Image Denoising
Wasikul Islam
Main category: cs.CV
TL;DR: 本文提出了一种受物理启发的WIPUNet网络,通过加权归纳偏差提升图像去噪在高噪声情况下的鲁棒性。
Details
Motivation: 高能粒子物理中的'pileup'噪声与图像去噪具有相似性,作者希望利用物理学的守恒性、局部性和隔离性等原则增强模型的鲁棒性。Contribution: (i)将pileup缓解原则转化为模块化的归纳偏差;(ii)将其集成到UNet中;(iii)展示了在高噪声下不依赖复杂SOTA方法的鲁棒性提升。
Method: 提出了PU启发的去噪器层次结构,包括带守恒约束的CNN、高斯噪声变体和WIPUNet(加权归纳启发的UNet)。
Result: 在CIFAR-10和BSD500数据集上,WIPUNet在高噪声下性能显著优于基线模型。
Insight: 物理学启发的归纳偏差可以提升模型在高噪声情况下的稳定性,而无需依赖复杂的数据驱动方法。
Abstract: In high-energy particle physics, collider measurements are contaminated by “pileup”, overlapping soft interactions that obscure the hard-scatter signal of interest. Dedicated subtraction strategies exploit physical priors such as conservation, locality, and isolation. Inspired by this analogy, we investigate how such principles can inform image denoising by embedding physics-guided inductive biases into neural architectures. This paper is a proof of concept: rather than targeting state-of-the-art (SOTA) benchmarks, we ask whether physics-inspired priors improve robustness under strong corruption. We introduce a hierarchy of PU-inspired denoisers: a residual CNN with conservation constraints, its Gaussian-noise variants, and the Weighted Inductive Pileup-physics-inspired U-Network for Denoising (WIPUNet), which integrates these ideas into a UNet backbone. On CIFAR-10 with Gaussian noise at $\sigma\in{15,25,50,75,100}$, PU-inspired CNNs are competitive with standard baselines, while WIPUNet shows a \emph{widening margin} at higher noise. Complementary BSD500 experiments show the same trend, suggesting physics-inspired priors provide stability where purely data-driven models degrade. Our contributions are: (i) translating pileup-mitigation principles into modular inductive biases; (ii) integrating them into UNet; and (iii) demonstrating robustness gains at high noise without relying on heavy SOTA machinery.
[65] Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance
Weijie Shen,Xinrui Wang,Yuanqi Nie,Apiradee Boonmee
Main category: cs.CV
TL;DR: 该论文提出了CAMVR框架,通过动态内存和自适应视觉引导提升LVLM在多轮视觉-文本推理中的能力,解决了上下文丢失和碎片化推理问题。
Details
Motivation: 当前LVLM和LLM在多轮交互中存在上下文理解不足和视觉推理碎片化的问题,需要一种更强的多模态推理框架。Contribution: 1. 提出动态读写内存单元(VCMU)存储多模态上下文信息;2. 引入自适应视觉聚焦机制(AVFG)动态调整视觉关注区域;3. 多级推理策略确保生成响应的连贯性。
Method: 结合VCMU和AVFG机制,利用历史多模态上下文信息动态指导推理过程。
Result: 在VisDial、A-OKVQA和MTIF数据集上达到SOTA性能。
Insight: 动态内存和自适应视觉引导显著提升了多轮视觉-文本推理的连贯性和准确性。
Abstract: Current Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) excel in single-turn tasks but face significant challenges in multi-turn interactions requiring deep contextual understanding and complex visual reasoning, often leading to fragmented reasoning, context loss, and hallucinations. To address these limitations, we propose Context-Aware Multi-Turn Visual Reasoning (CAMVR), a novel framework designed to empower LVLMs with robust and coherent multi-turn visual-textual inference capabilities. CAMVR introduces two key innovations: a Visual-Textual Context Memory Unit (VCMU), a dynamic read-write memory network that stores and manages critical visual features, textual semantic representations, and their cross-modal correspondences from each interaction turn; and an Adaptive Visual Focus Guidance (AVFG) mechanism, which leverages the VCMU’s context to dynamically adjust the visual encoder’s attention to contextually relevant image regions. Our multi-level reasoning integration strategy ensures that response generation is deeply coherent with both current inputs and accumulated historical context. Extensive experiments on challenging datasets, including VisDial, an adapted A-OKVQA, and our novel Multi-Turn Instruction Following (MTIF) dataset, demonstrate that CAMVR consistently achieves state-of-the-art performance.
[66] Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization
Jingwei Peng,Zhixuan Qiu,Boyu Jin,Surasakdi Siripong
Main category: cs.CV
TL;DR: 该论文提出了一种名为LVLM-VAR的新框架,利用预训练的视觉语言大模型(LVLMs)进行视频动作识别,通过视频到语义令牌(VST)模块将原始视频序列转换为离散且语义和时间一致的语义动作令牌,结合自然语言指令和LoRA微调的LVLM,显著提升了动作识别的准确性和可解释性。
Details
Motivation: 传统的人体动作识别方法在理解深层语义、复杂上下文信息和细粒度区分方面存在局限性。Contribution: 提出LVLM-VAR框架,首次将预训练的视觉语言大模型应用于视频动作识别,设计了视频到语义令牌(VST)模块,并提升了模型的准确性和可解释性。
Method: 采用VST模块将视频序列转换为语义动作令牌,结合自然语言指令和LoRA微调的LVLM(如LLaVA-13B)进行分类和语义推理。
Result: 在NTU RGB+D和NTU RGB+D 120等基准测试上取得了领先或极具竞争力的性能(如NTU RGB+D X-Sub 94.1%和NTU RGB+D 120 X-Set 90.0%),并通过自然语言解释增强了模型的可解释性。
Insight: 视觉语言大模型可以显著提升视频动作识别的性能,同时通过语义令牌和自然语言解释增强模型的可解释性。
Abstract: Human action recognition often struggles with deep semantic understanding, complex contextual information, and fine-grained distinction, limitations that traditional methods frequently encounter when dealing with diverse video data. Inspired by the remarkable capabilities of large language models, this paper introduces LVLM-VAR, a novel framework that pioneers the application of pre-trained Vision-Language Large Models (LVLMs) to video action recognition, emphasizing enhanced accuracy and interpretability. Our method features a Video-to-Semantic-Tokens (VST) Module, which innovatively transforms raw video sequences into discrete, semantically and temporally consistent “semantic action tokens,” effectively crafting an “action narrative” that is comprehensible to an LVLM. These tokens, combined with natural language instructions, are then processed by a LoRA-fine-tuned LVLM (e.g., LLaVA-13B) for robust action classification and semantic reasoning. LVLM-VAR not only achieves state-of-the-art or highly competitive performance on challenging benchmarks such as NTU RGB+D and NTU RGB+D 120, demonstrating significant improvements (e.g., 94.1% on NTU RGB+D X-Sub and 90.0% on NTU RGB+D 120 X-Set), but also substantially boosts model interpretability by generating natural language explanations for its predictions.
[67] JRN-Geo: A Joint Perception Network based on RGB and Normal images for Cross-view Geo-localization
Hongyu Zhou,Yunzhou Zhang,Tingsong Huang,Fawei Ge,Man Qi,Xichen Zhang,Yizhong Zhang
Main category: cs.CV
TL;DR: JRN-Geo提出了一种基于RGB和法线图像的联合感知网络,用于跨视角地理定位,通过深度融合语义和几何结构信息,并结合3D地理增强技术,显著提升了模型对视角变化的鲁棒性。
Details
Motivation: 跨视角地理定位中,由于视角差异和外观变化,现有方法主要依赖RGB图像的语义特征,而忽略了空间结构信息的重要性。Contribution: 1. 提出JRN-Geo网络,结合RGB和法线图像的语义与几何结构信息。2. 设计了差异感知融合模块(DAFM)和联合约束交互聚合(JCIA)策略。3. 引入了3D地理增强技术生成视角变化样本。
Method: 采用双分支特征提取框架,结合DAFM和JCIA实现深度融合,并通过3D地理增强技术增强特征学习。
Result: 在University-1652和SUES-200数据集上验证了模型的鲁棒性,达到了SOTA性能。
Insight: 融合几何结构信息(法线图像)和语义信息(RGB图像)能够有效提升跨视角地理定位的性能,尤其是在复杂视角变化场景下。
Abstract: Cross-view geo-localization plays a critical role in Unmanned Aerial Vehicle (UAV) localization and navigation. However, significant challenges arise from the drastic viewpoint differences and appearance variations between images. Existing methods predominantly rely on semantic features from RGB images, often neglecting the importance of spatial structural information in capturing viewpoint-invariant features. To address this issue, we incorporate geometric structural information from normal images and introduce a Joint perception network to integrate RGB and Normal images (JRN-Geo). Our approach utilizes a dual-branch feature extraction framework, leveraging a Difference-Aware Fusion Module (DAFM) and Joint-Constrained Interaction Aggregation (JCIA) strategy to enable deep fusion and joint-constrained semantic and structural information representation. Furthermore, we propose a 3D geographic augmentation technique to generate potential viewpoint variation samples, enhancing the network’s ability to learn viewpoint-invariant features. Extensive experiments on the University-1652 and SUES-200 datasets validate the robustness of our method against complex viewpoint ariations, achieving state-of-the-art performance.
[68] Knowledge-Augmented Vision Language Models for Underwater Bioacoustic Spectrogram Analysis
Ragib Amin Nihal,Benjamin Yen,Takeshi Ashizawa,Kazuhiro Nakadai
Main category: cs.CV
TL;DR: 论文提出了一种结合视觉语言模型(VLM)和大型语言模型(LLM)的框架,用于分析水下生物声学频谱图,无需手动标注或重新训练模型。
Details
Motivation: 海洋哺乳动物声音分析依赖于对生物声学频谱图的解读,但现有的视觉语言模型未经领域特定的可视化训练,限制了其应用。Contribution: 主要贡献是提出了一种无需手动标注或重新训练的方法,通过VLM解释频谱图并结合LLM验证,实现对声学数据的适应和分析。
Method: 通过将VLM对频谱图的视觉解释与基于LLM的验证相结合,构建领域知识,从而分析频谱图中的有意义模式。
Result: 结果表明,该方法能够从频谱图中提取有意义的模式,适应声学数据。
Insight: 研究揭示了VLM在领域特定可视化任务中的潜力,尤其是与LLM结合使用时,可以显著提升模型的适应能力。
Abstract: Marine mammal vocalization analysis depends on interpreting bioacoustic spectrograms. Vision Language Models (VLMs) are not trained on these domain-specific visualizations. We investigate whether VLMs can extract meaningful patterns from spectrograms visually. Our framework integrates VLM interpretation with LLM-based validation to build domain knowledge. This enables adaptation to acoustic data without manual annotation or model retraining.
[69] LiDAR-BIND-T: Improving SLAM with Temporally Consistent Cross-Modal LiDAR Reconstruction
Niels Balemans,Ali Anwar,Jan Steckel,Siegfried Mercelis
Main category: cs.CV
TL;DR: 论文扩展了LiDAR-BIND框架,通过引入时间一致性机制,提出了LiDAR-BIND-T,显著提升了SLAM的鲁棒性和性能。
Details
Motivation: LiDAR-BIND在多模态传感器融合中缺乏显式的时间一致性机制,导致SLAM性能受限。论文旨在通过时间一致性增强模块提升跨模态LiDAR重建的时空连贯性。Contribution: 提出了三个核心贡献:(i)时间嵌入相似性对齐连续潜在空间,(ii)运动对齐变换损失匹配预测与真实LiDAR位移,(iii)基于专门时间模块的窗口时间融合。此外更新了模型架构以更好地保留空间结构。
Method: 采用时间嵌入相似性、运动对齐变换损失和窗口时间融合模块,通过改进的空间结构保留方法实现跨模态LiDAR重建的时间一致性增强。
Result: 实验表明,该方法在雷达/声呐到LiDAR的转换中提升了时空连贯性,降低了绝对轨迹误差并改善了SLAM中的占用地图精度。提出了基于FVMD和相关峰距离的新评价指标。
Insight: 显式时间一致性机制对多模态SLAM至关重要,LiDAR-BIND-T通过模块化设计保持了即插即用的特性,同时显著提升了时间稳定性。
Abstract: This paper extends LiDAR-BIND, a modular multi-modal fusion framework that binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space, with mechanisms that explicitly enforce temporal consistency. We introduce three contributions: (i) temporal embedding similarity that aligns consecutive latents, (ii) a motion-aligned transformation loss that matches displacement between predictions and ground truth LiDAR, and (iii) windows temporal fusion using a specialised temporal module. We further update the model architecture to better preserve spatial structure. Evaluations on radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial coherence, yielding lower absolute trajectory error and better occupancy map accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We propose different metrics based on the Fr'echet Video Motion Distance (FVMD) and a correlation-peak distance metric providing practical temporal quality indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or LiDAR-BIND-T, maintains plug-and-play modality fusion while substantially enhancing temporal stability, resulting in improved robustness and performance for downstream SLAM.
[70] Multi-LVI-SAM: A Robust LiDAR-Visual-Inertial Odometry for Multiple Fisheye Cameras
Xinyu Zhang,Kai Huang,Junqiao Zhao,Zihan Yuan,Tiantian Feng
Main category: cs.CV
TL;DR: Multi-LVI-SAM 是一个多鱼眼相机-LiDAR-惯性里程计框架,通过全景视觉特征模型统一多相机观测,提升状态估计的精度和鲁棒性。
Details
Motivation: 现有方案在处理多相机数据时效率低且不一致,影响了状态估计的精度和系统设计的复杂性。Contribution: 提出了全景视觉特征模型和外部补偿方法,解决了多相机观测的统一表示与三角测量不一致问题。
Method: 基于因子图将全景视觉特征模型与LiDAR-惯性系统紧密结合,优化全局几何框架和多视角约束。
Result: 在公共数据集上验证了更高的精度和鲁棒性,优于现有多相机-LiDAR-惯性系统。
Insight: 全景模型简化了多相机系统的设计,外部补偿方法显著提升了特征一致性和三角测量精度。
Abstract: We propose a multi-camera LiDAR-visual-inertial odometry framework, Multi-LVI-SAM, which fuses data from multiple fisheye cameras, LiDAR and inertial sensors for highly accurate and robust state estimation. To enable efficient and consistent integration of visual information from multiple fisheye cameras, we introduce a panoramic visual feature model that unifies multi-camera observations into a single representation. The panoramic model serves as a global geometric optimization framework that consolidates multi-view constraints, enabling seamless loop closure and global pose optimization, while simplifying system design by avoiding redundant handling of individual cameras. To address the triangulation inconsistency caused by the misalignment between each camera’s frame and the panoramic model’s frame, we propose an extrinsic compensation method. This method improves feature consistency across views and significantly reduces triangulation and optimization errors, leading to more accurate pose estimation. We integrate the panoramic visual feature model into a tightly coupled LiDAR-visual-inertial system based on a factor graph. Extensive experiments on public datasets demonstrate that the panoramic visual feature model enhances the quality and consistency of multi-camera constraints, resulting in higher accuracy and robustness than existing multi-camera LiDAR-visual-inertial systems.
[71] Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation
Tianhao Guo,Bingjie Lu,Feng Wang,Zhengyang Lu
Main category: cs.CV
TL;DR: 论文提出了一种基于变分框架的距离自适应超分辨率方法,通过引入深度信息动态调整重建策略,解决了传统超分辨率方法在空间不变性假设下的局限性。
Details
Motivation: 传统超分辨率方法假设空间不变的退化模型,而真实成像系统(如大气散射、景深变化等)具有复杂的距离依赖性效应,需要结合几何场景理解的适应性重建策略。Contribution: 1. 提出了首个理论支持的变分框架,将超分辨率建模为空间变化的逆问题。2. 设计了深度条件卷积核的神经网络架构,动态调整平滑约束。3. 在多种数据集上实现了最先进的性能。
Method: 1. 通过伪微分算子建模距离依赖的退化特性。2. 使用深度条件卷积核的级联残差块实现离散梯度流动力学。3. 引入大气散射理论的谱约束防止远场区域的噪声放大。
Result: 在KITTI户外场景上,2倍和4倍超分辨率分别达到36.89/0.9516和30.54/0.8721的PSNR/SSIM,分别优于现有方法0.44dB和0.36dB。
Insight: 结合深度信息的自适应超分辨率方法显著提升了深度变化场景的重建性能,同时保持了传统基准上的竞争力,为复杂退化场景的研究提供了理论支持。
Abstract: Single image super-resolution traditionally assumes spatially-invariant degradation models, yet real-world imaging systems exhibit complex distance-dependent effects including atmospheric scattering, depth-of-field variations, and perspective distortions. This fundamental limitation necessitates spatially-adaptive reconstruction strategies that explicitly incorporate geometric scene understanding for optimal performance. We propose a rigorous variational framework that characterizes super-resolution as a spatially-varying inverse problem, formulating the degradation operator as a pseudodifferential operator with distance-dependent spectral characteristics that enable theoretical analysis of reconstruction limits across depth ranges. Our neural architecture implements discrete gradient flow dynamics through cascaded residual blocks with depth-conditional convolution kernels, ensuring convergence to stationary points of the theoretical energy functional while incorporating learned distance-adaptive regularization terms that dynamically adjust smoothness constraints based on local geometric structure. Spectral constraints derived from atmospheric scattering theory prevent bandwidth violations and noise amplification in far-field regions, while adaptive kernel generation networks learn continuous mappings from depth to reconstruction filters. Comprehensive evaluation across five benchmark datasets demonstrates state-of-the-art performance, achieving 36.89/0.9516 and 30.54/0.8721 PSNR/SSIM at 2 and 4 scales on KITTI outdoor scenes, outperforming existing methods by 0.44dB and 0.36dB respectively. This work establishes the first theoretically-grounded distance-adaptive super-resolution framework and demonstrates significant improvements on depth-variant scenarios while maintaining competitive performance across traditional benchmarks.
[72] InterAct: A Large-Scale Dataset of Dynamic, Expressive and Interactive Activities between Two People in Daily Scenarios
Leo Ho,Yinghao Huang,Dafei Qin,Mingyi Shi,Wangpok Tse,Wei Liu,Junichi Yamagishi,Taku Komura
Main category: cs.CV
TL;DR: 这篇论文提出了一个名为InterAct的大规模多模态数据集,专注于捕捉两人在日常场景中的动态、表情丰富且语义一致的交互行为。同时,作者还提出了一种基于扩散模型的简单而有效的方法,用于从语音输入中生成交互式的面部表情和身体动作。
Details
Motivation: 现有的研究通常仅关注单人或两人对话手势,且假设身体方向和位置不变。然而,真实场景中的交互行为更复杂,且涉及长时间和大空间的动态变化。为此,作者提出了InterAct数据集,填补了这一空白。Contribution: 1) 提出了InterAct数据集,包含241个长时间、多模态的交互序列;2) 提出了一种基于扩散模型的方法,用于生成交互式的面部表情和身体动作;3) 引入了层次化的动作回归和新颖的精细调节机制。
Method: 作者采用扩散模型从语音输入生成交互式动作和表情。动作回归采用层次化方式,并结合了一种新的精细调节机制以提高嘴唇运动的准确性。
Result: InterAct数据集展示了多样化和复杂的个体动作及长期交互模式。提出的方法在交互式动作生成中表现出有效性,尤其在唇部准确性上有显著提升。
Insight: 未来的多人交互研究需要更复杂的数据支持,而InterAct为此提供了基础。扩散模型在生成动态交互动作方面展现了潜力。
Abstract: We address the problem of accurate capture of interactive behaviors between two people in daily scenarios. Most previous works either only consider one person or solely focus on conversational gestures of two people, assuming the body orientation and/or position of each actor are constant or barely change over each interaction. In contrast, we propose to simultaneously model two people’s activities, and target objective-driven, dynamic, and semantically consistent interactions which often span longer duration and cover bigger space. To this end, we capture a new multi-modal dataset dubbed InterAct, which is composed of 241 motion sequences where two people perform a realistic and coherent scenario for one minute or longer over a complete interaction. For each sequence, two actors are assigned different roles and emotion labels, and collaborate to finish one task or conduct a common interaction activity. The audios, body motions, and facial expressions of both persons are captured. InterAct contains diverse and complex motions of individuals and interesting and relatively long-term interaction patterns barely seen before. We also demonstrate a simple yet effective diffusion-based method that estimates interactive face expressions and body motions of two people from speech inputs. Our method regresses the body motions in a hierarchical manner, and we also propose a novel fine-tuning mechanism to improve the lip accuracy of facial expressions. To facilitate further research, the data and code is made available at https://hku-cg.github.io/interact/ .
[73] Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation
Bingrui Zhao,Lin Yuanbo Wu,Xiangtian Fan,Deyin Liu,Lu Zhang,Ruyi He,Jialie Shen,Ximing Li
Main category: cs.CV
TL;DR: PARSE-VOS是一个无需训练的分层推理框架,利用大型语言模型(LLMs)实现从粗到细的视频对象分割。
Details
Motivation: 当前方法在处理复杂语言描述时表现不佳,PARSE-VOS旨在通过分层推理解决动态视频与静态文本的对齐问题。Contribution: 提出了首个基于LLM的训练免费框架,通过语义解析和分层推理实现高效的视频对象分割。
Method: 框架分为三步:1)语言查询解析为语义命令;2)时空接地模块生成候选轨迹;3)分层识别模块分两阶段推理选择目标对象。
Result: 在Ref-YouTube-VOS、Ref-DAVIS17和MeViS三个基准测试中达到最先进性能。
Insight: 分层推理(从粗粒度运动到细粒度姿态)是解决复杂语言与动态视频对齐问题的关键。
Abstract: Referring Video Object Segmentation (RVOS) aims to segment an object of interest throughout a video based on a language description. The prominent challenge lies in aligning static text with dynamic visual content, particularly when objects exhibiting similar appearances with inconsistent motion and poses. However, current methods often rely on a holistic visual-language fusion that struggles with complex, compositional descriptions. In this paper, we propose \textbf{PARSE-VOS}, a novel, training-free framework powered by Large Language Models (LLMs), for a hierarchical, coarse-to-fine reasoning across text and video domains. Our approach begins by parsing the natural language query into structured semantic commands. Next, we introduce a spatio-temporal grounding module that generates all candidate trajectories for all potential target objects, guided by the parsed semantics. Finally, a hierarchical identification module select the correct target through a two-stage reasoning process: it first performs coarse-grained motion reasoning with an LLM to narrow down candidates; if ambiguity remains, a fine-grained pose verification stage is conditionally triggered to disambiguate. The final output is an accurate segmentation mask for the target object. \textbf{PARSE-VOS} achieved state-of-the-art performance on three major benchmarks: Ref-YouTube-VOS, Ref-DAVIS17, and MeViS.
[74] PictOBI-20k: Unveiling Large Multimodal Models in Visual Decipherment for Pictographic Oracle Bone Characters
Zijian Chen,Wenjie Hua,Jinhao Li,Lirong Deng,Fan Du,Tingzhu Chen,Guangtao Zhai
Main category: cs.CV
TL;DR: 该论文提出了PictOBI-20k数据集,用于评估大型多模态模型(LMMs)在象形甲骨文字视觉解读任务中的表现,实验表明当前LMMs在该任务上表现有限。
Details
Motivation: 甲骨文字的解读对理解人类早期生产方式至关重要,但现有方法受限于考古发掘的不连续性和有限的文献资料。通过利用LMMs的强大视觉感知能力,有望提升甲骨文字的视觉解读能力。Contribution: 1. 提出了PictOBI-20k数据集,包含20k甲骨文字和真实物体图像,形成15k多选问题;2. 展示了LMMs在甲骨文字视觉解读任务上的初步能力及其局限性。
Method: 构建PictOBI-20k数据集,设计主观标注以研究人类与LMMs在视觉推理中的一致性,并通过多选实验评估LMMs的表现。
Result: 实验表明,当前LMMs具备初步的甲骨文字视觉解读能力,但主要受语言先验限制,未能有效利用视觉信息。
Insight: LMMs在视觉解读任务中的表现依赖于视觉注意力的优化,未来研究可基于该数据集改进LMMs的视觉注意力机制。
Abstract: Deciphering oracle bone characters (OBCs), the oldest attested form of written Chinese, has remained the ultimate, unwavering goal of scholars, offering an irreplaceable key to understanding humanity’s early modes of production. Current decipherment methodologies of OBC are primarily constrained by the sporadic nature of archaeological excavations and the limited corpus of inscriptions. With the powerful visual perception capability of large multimodal models (LMMs), the potential of using LMMs for visually deciphering OBCs has increased. In this paper, we introduce PictOBI-20k, a dataset designed to evaluate LMMs on the visual decipherment tasks of pictographic OBCs. It includes 20k meticulously collected OBC and real object images, forming over 15k multi-choice questions. We also conduct subjective annotations to investigate the consistency of the reference point between humans and LMMs in visual reasoning. Experiments indicate that general LMMs possess preliminary visual decipherment skills, and LMMs are not effectively using visual information, while most of the time they are limited by language priors. We hope that our dataset can facilitate the evaluation and optimization of visual attention in future OBC-oriented LMMs. The code and dataset will be available at https://github.com/OBI-Future/PictOBI-20k.
[75] Posterior shape models revisited: Improving 3D reconstructions from partial data using target specific models
Jonathan Aellen,Florian Burkhardt,Thomas Vetter,Marcel Lüthi
Main category: cs.CV
TL;DR: 本文探讨了医学影像中部分形状重建时,训练数据与目标形状姿态对齐的重要性,并提出了一种高效调整现有模型以适应特定目标的方法。
Details
Motivation: 在医学影像中,部分形状重建常因训练数据与目标形状姿态未对齐而产生偏差,尤其是对小型部分观察时。论文旨在解决这一问题。Contribution: 提出了姿态对齐的高效方法,显著提高重建精度和预测方差,同时保持线性模型的计算效率。
Method: 通过预处理步骤调整现有模型,无需原始训练数据,支持平移精确恢复和旋转近似。
Result: 提升了部分形状重建的准确性,适用于即插即用的重建流程。
Insight: 姿态对齐是影响部分形状重建质量的关键因素,简单预处理即可显著改进现有模型的性能。
Abstract: In medical imaging, point distribution models are often used to reconstruct and complete partial shapes using a statistical model of the full shape. A commonly overlooked, but crucial factor in this reconstruction process, is the pose of the training data relative to the partial target shape. A difference in pose alignment of the training and target shape leads to biased solutions, particularly when observing small parts of a shape. In this paper, we demonstrate the importance of pose alignment for partial shape reconstructions and propose an efficient method to adjust an existing model to a specific target. Our method preserves the computational efficiency of linear models while significantly improving reconstruction accuracy and predicted variance. It exactly recovers the intended aligned model for translations, and provides a good approximation for small rotations, all without access to the original training data. Hence, existing shape models in reconstruction pipelines can be adapted by a simple preprocessing step, making our approach widely applicable in plug-and-play scenarios.
[76] 3DPillars: Pillar-based two-stage 3D object detection
Jongyoun Noh,Junghyup Lee,Hyekang Park,Bumsub Ham
Main category: cs.CV
TL;DR: 论文提出了3DPillars,一种基于柱状伪图像的两阶段3D目标检测框架,弥补了PointPillars的性能差距,同时保持了高效性。
Details
Motivation: PointPillars虽然高效,但由于伪图像表示无法保留精确的3D结构,且难以采用两阶段检测框架,导致其性能落后于其他先进方法。作者希望通过改进这两点问题,提升检测性能。Contribution: 1) 提出了3DPillars,一种新的CNN架构,通过2D卷积高效学习3D体素特征;2) 引入了带有稀疏场景上下文特征模块的RoI头部,支持两阶段检测框架并充分利用场景上下文信息。
Method: 1) 使用可分离体素特征模块提取3D特征;2) 通过稀疏场景上下文特征模块聚合多尺度特征,优化两阶段检测流程。
Result: 在KITTI和Waymo Open数据集上验证了方法的有效性和效率,实现了速度和精度的良好平衡。
Insight: 通过将3D特征视为堆叠的伪图像,可以利用2D卷积高效处理3D任务;稀疏场景特征的引入进一步提升了两阶段检测的性能。
Abstract: PointPillars is the fastest 3D object detector that exploits pseudo image representations to encode features for 3D objects in a scene. Albeit efficient, PointPillars is typically outperformed by state-of-the-art 3D detection methods due to the following limitations: 1) The pseudo image representations fail to preserve precise 3D structures, and 2) they make it difficult to adopt a two-stage detection pipeline using 3D object proposals that typically shows better performance than a single-stage approach. We introduce in this paper the first two-stage 3D detection framework exploiting pseudo image representations, narrowing the performance gaps between PointPillars and state-of-the-art methods, while retaining its efficiency. Our framework consists of two novel components that overcome the aforementioned limitations of PointPillars: First, we introduce a new CNN architecture, dubbed 3DPillars, that enables learning 3D voxel-based features from the pseudo image representation efficiently using 2D convolutions. The basic idea behind 3DPillars is that 3D features from voxels can be viewed as a stack of pseudo images. To implement this idea, we propose a separable voxel feature module that extracts voxel-based features without using 3D convolutions. Second, we introduce an RoI head with a sparse scene context feature module that aggregates multi-scale features from 3DPillars to obtain a sparse scene feature. This enables adopting a two-stage pipeline effectively, and fully leveraging contextual information of a scene to refine 3D object proposals. Experimental results on the KITTI and Waymo Open datasets demonstrate the effectiveness and efficiency of our approach, achieving a good compromise in terms of speed and accuracy.
[77] Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets
Phongsakon Mark Konrad,Andrei-Alexandru Popa,Yaser Sabzehmeidani,Liang Zhong,Elisa A. Liehn,Serkan Ayvaz
Main category: cs.CV
TL;DR: 本文研究了在有限的心血管组织病理学图像数据集上,基于深度学习的动脉结构分割模型的性能表现。结果表明,模型对数据分割高度敏感,标准基准测试在低数据临床环境中的局限性显现。
Details
Motivation: 心血管疾病的研究和诊断需要精确的颈动脉结构分割,但深该领域深度学习模型的开发受限于标注数据的稀缺。本文试图评估现有方法在低数据条件下的表现。Contribution: 系统评估了多种深度学习分割模型(如U-Net、DeepLabV3+、SegFormer 和 SAM 变体)在有限数据集上的性能,揭示了标准基准测试的局限性。
Method: 采用贝叶斯搜索的超参数优化策略,对多种模型在心血管组织病理学图像上进行测试,并分析其对数据分割的敏感性。
Result: 模型性能对数据分割高度敏感,小差异主要由统计噪声驱动,而非算法优势。性能排名未必反映临床实用性。
Insight: 在低数据医学场景中,标准基准测试可能不可靠,需谨慎解读模型性能排名。
Abstract: Accurate segmentation of carotid artery structures in histopathological images is vital for advancing cardiovascular disease research and diagnosis. However, deep learning model development in this domain is constrained by the scarcity of annotated cardiovascular histopathological data. This study investigates a systematic evaluation of state-of-the-art deep learning segmentation models, including convolutional neural networks (U-Net, DeepLabV3+), a Vision Transformer (SegFormer), and recent foundation models (SAM, MedSAM, MedSAM+UNet), on a limited dataset of cardiovascular histology images. Despite employing an extensive hyperparameter optimization strategy with Bayesian search, our findings reveal that model performance is highly sensitive to data splits, with minor differences driven more by statistical noise than by true algorithmic superiority. This instability exposes the limitations of standard benchmarking practices in low-data clinical settings and challenges the assumption that performance rankings reflect meaningful clinical utility.
[78] BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model
Yujie Li,Wenjia Xu,Yuanben Zhang,Zhiwei Wei,Mugen Peng
Main category: cs.CV
TL;DR: 论文提出了BTCChat,一种先进的多模态大语言模型,用于提升双时相遥感图像变化描述(Bi-temporal Change Captioning)的能力。通过设计Change Extraction模块和Prompt Augmentation机制,模型更好地捕捉时空特征和语义变化,并在实验中取得了最先进的性能。
Details
Motivation: 双时相卫星影像在城市化监测和灾害评估中至关重要,但现有方法通过简单拼接处理图像对,无法有效建模时空关联和语义变化,限制了模型的整体表现。Contribution: 论文提出了BTCChat模型,具备双时相变化理解能力,支持变化描述和单图像解释;设计了Change Extraction模块和Prompt Augmentation机制,分别用于捕捉时空特征和增强空间细节注意力。
Method: BTCChat包含Change Extraction模块用于提取双时相图像的时空特征,以及Prompt Augmentation机制,通过上下文线索增强提示效果。
Result: 实验表明,BTCChat在变化描述和视觉问答任务中达到了最先进的性能。
Insight: 论文强调了在多模态大语言模型中,时空特征和语义变化的精确建模对遥感图像分析的重要性,并通过创新的模块设计提升了性能。
Abstract: Bi-temporal satellite imagery supports critical applications such as urban development monitoring and disaster assessment. Although powerful multimodal large language models (MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model’s attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks.
[79] A Fine-Grained Attention and Geometric Correspondence Model for Musculoskeletal Risk Classification in Athletes Using Multimodal Visual and Skeletal Features
Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Tamanna Shermin,Md Rafiqul Islam,Mukhtar Hussain,Sami Azam
Main category: cs.CV
TL;DR: 论文提出了一种名为ViSK-GAT的多模态深度学习框架,结合视觉和骨骼特征对运动员的肌肉骨骼风险进行分类,性能显著优于现有方法。
Details
Motivation: 肌肉骨骼疾病对运动员构成重大风险,现有方法因依赖单一数据类型而难以在复杂环境中可靠评估风险。Contribution: 提出ViSK-GAT框架,融合视觉和骨骼坐标特征,引入细粒度注意力模块和几何对应模块提升分类性能。
Method: 结合残差块与轻量级Transformer块,通过FGAM和MGCM模块实现模态间特征优化与对齐。
Result: 验证和测试准确率分别达93.55%和93.89%,显著优于九种主流迁移学习模型。
Insight: 多模态特征融合与几何对齐可显著提升风险分类性能,对早期干预具有重要应用价值。
Abstract: Musculoskeletal disorders pose significant risks to athletes, and assessing risk early is important for prevention. However, most existing methods are designed for controlled settings and fail to reliably assess risk in complex environments due to their reliance on a single type of data. This research proposes ViSK-GAT (Visual-Skeletal Geometric Attention Transformer), a novel multimodal deep learning framework designed to classify musculoskeletal risk using visual and skeletal coordinate-based features. In addition, a custom multimodal dataset is constructed by combining visual data and skeletal coordinates for risk assessment. Each sample is labeled into eight risk categories based on the Rapid Entire Body Assessment system. ViSK-GAT combines a Residual Block with a Lightweight Transformer Block to learn spatial and temporal dependencies jointly. It incorporates two novel modules: the Fine-Grained Attention Module (FGAM), which enables precise inter-modal feature refinement through cross-attention between visual and skeletal inputs, and the Multimodal Geometric Correspondence Module (MGCM), which enhances cross-modal coherence by aligning image features with coordinate-based representations. ViSK-GAT achieved strong performance with validation and test accuracies of 93.55% and 93.89%, respectively; a precision of 93.86%; an F1 score of 93.85%; and Cohen’s Kappa and Matthews Correlation Coefficient of 93%. The regression results also indicated a low Root Mean Square Error of the predicted probability distribution of 0.1205 and a corresponding Mean Absolute Error of 0.0156. Compared to nine popular transfer learning backbones, ViSK-GAT consistently outperformed previous methods. The ViSK-GAT model advances artificial intelligence implementation and application, transforming musculoskeletal risk classification and enabling impactful early interventions in sports.
[80] Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models
Ruiqi Shen,Haotian Wu,Wenjing Zhang,Jiangjing Hu,Deniz Gunduz
Main category: cs.CV
TL;DR: 该论文提出了一种基于CLIP模型的语义压缩方法,将CLIP特征嵌入压缩为最小比特,同时保持跨任务的语义信息,显著降低了比特率。
Details
Motivation: 随着新兴应用更注重语义保存而非像素级重建,传统图像压缩方法难以满足跨任务和多数据分布的鲁棒性需求,因此需要新的语义压缩范式。Contribution: 提出了基于CLIP模型的语义压缩方法,能够在极低比特率下保持语义完整性,并展示出零样本鲁棒性。
Method: 利用CLIP的特征嵌入能力,将图像压缩为语义特征而非像素数据,通过对比学习优化特征表示。
Result: 在基准数据集上平均比特率为2-3*10^(-3)比特/像素,仅为主流方法的5%,且在多任务和多数据分布下表现鲁棒。
Insight: 语义压缩优于传统像素级压缩,尤其是在跨任务和多数据分布场景中,展示了基础模型的潜力。
Abstract: Recent deep learning-based methods for lossy image compression achieve competitive rate-distortion performance through extensive end-to-end training and advanced architectures. However, emerging applications increasingly prioritize semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks. These challenges call for advanced semantic compression paradigms. Motivated by the zero-shot and representational capabilities of multimodal foundation models, we propose a novel semantic compression method based on the contrastive language-image pretraining (CLIP) model. Rather than compressing images for reconstruction, we propose compressing the CLIP feature embeddings into minimal bits while preserving semantic information across different tasks. Experiments show that our method maintains semantic integrity across benchmark datasets, achieving an average bit rate of approximately 2-3* 10(-3) bits per pixel. This is less than 5% of the bitrate required by mainstream image compression approaches for comparable performance. Remarkably, even under extreme compression, the proposed approach exhibits zero-shot robustness across diverse data distributions and downstream tasks.
[81] AttriPrompt: Dynamic Prompt Composition Learning for CLIP
Qiqi Zhan,Shiwei Li,Qingjie Liu,Yunhong Wang
Main category: cs.CV
TL;DR: AttriPrompt提出了一种动态提示组合学习框架,通过利用CLIP视觉编码器的中间层特征增强文本语义表示,实现了细粒度对齐和内容感知的自适应提示设计。
Details
Motivation: 当前深度文本提示方法存在两个主要问题:过度依赖对比学习目标,忽略了细粒度特征优化;以及静态提示无法适应不同输入类别。AttriPrompt通过动态组合学习和分层视觉信息解决了这些问题。Contribution: 提出了AttriPrompt框架,包含属性检索模块、双流对比学习和自正则化机制,显著提升了模型在细粒度对齐和跨域知识迁移上的性能。
Method: 设计了属性检索模块聚类视觉特征,并通过双流对比学习实现细粒度对齐,引入自正则化机制防止过拟合。
Result: 在三个基准测试中,AttriPrompt表现优于现有方法,基类到新类场景下性能提升高达7.37%。
Insight: 利用CLIP视觉编码器的中间层特征动态组合提示,可以有效增强文本语义表示,为视觉语言预训练模型的实用化提供了新思路。
Abstract: The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP’s vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt’s superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.
[82] Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching
Feng Wang,Zihao Yu
Main category: cs.CV
TL;DR: 论文提出了一种名为Coefficients-Preserving Sampling(CPS)的方法,用于在Flow Matching模型中解决SDE采样引入的噪声问题,从而提升强化学习在生成任务中的效果。
Details
Motivation: 当前在Flow Matching模型中应用强化学习时,SDE采样会引入显著的噪声,影响奖励学习和模型收敛。Contribution: 提出了CPS方法,通过保留系数减少噪声,改进了Flow Matching模型的采样过程,提升了强化学习的稳定性和收敛速度。
Method: 从DDIM中汲取灵感,重新设计了采样过程,避免了SDE带来的噪声问题。
Result: 实验表明,CPS能够消除噪声,提升奖励建模的准确性,使得Flow-GRPO和Dance-GRPO等优化器更快、更稳定地收敛。
Insight: 噪声问题可能源于推断过程中注入的过多随机性,通过保留系数的设计可以有效减少噪声,从而提升生成质量。
Abstract: Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS
[83] Spatial-Aware Self-Supervision for Medical 3D Imaging with Multi-Granularity Observable Tasks
Yiqin Zhang,Meiling Chen,Zhengjie Zhang
Main category: cs.CV
TL;DR: 论文提出了一种针对医学3D成像的自监督学习方法,通过多粒度可观察任务捕获空间相关语义,提升模型的可解释性和性能。
Details
Motivation: 现有医学3D成像的自监督方法多源自2D视觉领域,缺乏对3D空间知识的直观学习过程,导致医学可解释性不足。Contribution: 提出了一个包含三个子任务的方法,通过多粒度空间关系建模捕获3D医学成像的空间语义,并确保任务设计符合可观察性原则。
Method: 设计了三个子任务,利用3D成像的额外维度增强语义深度,并通过多粒度空间关系建模保持训练稳定性。
Result: 实验表明,该方法性能与当前方法相当,同时能直观展示自监督学习过程。
Insight: 通过可观察任务和空间关系建模,可以在保持性能的同时提升3D医学成像自监督学习的可解释性。
Abstract: The application of self-supervised techniques has become increasingly prevalent within medical visualization tasks, primarily due to its capacity to mitigate the data scarcity prevalent in the healthcare sector. The majority of current works are influenced by designs originating in the generic 2D visual domain, which lack the intuitive demonstration of the model’s learning process regarding 3D spatial knowledge. Consequently, these methods often fall short in terms of medical interpretability. We propose a method consisting of three sub-tasks to capture the spatially relevant semantics in medical 3D imaging. Their design adheres to observable principles to ensure interpretability, and minimize the performance loss caused thereby as much as possible. By leveraging the enhanced semantic depth offered by the extra dimension in 3D imaging, this approach incorporates multi-granularity spatial relationship modeling to maintain training stability. Experimental findings suggest that our approach is capable of delivering performance that is on par with current methodologies, while facilitating an intuitive understanding of the self-supervised learning process.
[84] OmniStyle2: Scalable and High Quality Artistic Style Transfer Data Generation via Destylization
Ye Wang,Zili Yi,Yibo Zhang,Peng Zheng,Xuping Xie,Jiang Lin,Yilin Wang,Rui Ma
Main category: cs.CV
TL;DR: OmniStyle2提出了一种新颖的艺术风格迁移方法,通过反风格化(destylization)生成大规模数据集DST-100K,并训练了一个简洁的前馈模型,性能超越现有方法。
Details
Motivation: 艺术风格迁移缺乏真实标注数据的监督信号,传统方法依赖合成数据或手工设计,难以保证内容与风格的准确性。本文通过反风格化生成高质量数据集,解决这一问题。Contribution: 1. 提出反风格化任务并构建DST-100K数据集;2. 开发文本引导的反风格化模型DST和评估模型DST-Filter;3. 训练出性能优越的OmniStyle2模型。
Method: 1. 使用文本引导的DST模型从艺术作品中提取无风格内容;2. 通过DST-Filter的多阶段评估链式推理筛选高质量数据对;3. 基于FLUX.1-dev训练前馈模型。
Result: OmniStyle2在质量和定量基准测试中均超越现有方法,验证了反风格化数据生成的有效性。
Insight: 数据生成是艺术风格迁移的关键,反风格化为缺乏真实标注的任务提供了可靠的监督范式。
Abstract: OmniStyle2 introduces a novel approach to artistic style transfer by reframing it as a data problem. Our key insight is destylization, reversing style transfer by removing stylistic elements from artworks to recover natural, style-free counterparts. This yields DST-100K, a large-scale dataset that provides authentic supervision signals by aligning real artistic styles with their underlying content. To build DST-100K, we develop (1) DST, a text-guided destylization model that reconstructs stylefree content, and (2) DST-Filter, a multi-stage evaluation model that employs Chain-of-Thought reasoning to automatically discard low-quality pairs while ensuring content fidelity and style accuracy. Leveraging DST-100K, we train OmniStyle2, a simple feed-forward model based on FLUX.1-dev. Despite its simplicity, OmniStyle2 consistently surpasses state-of-the-art methods across both qualitative and quantitative benchmarks. Our results demonstrate that scalable data generation via destylization provides a reliable supervision paradigm, overcoming the fundamental challenge posed by the lack of ground-truth data in artistic style transfer.
[85] ConstStyle: Robust Domain Generalization with Unified Style Transformation
Nam Duong Tran,Nam Nguyen Phuong,Hieu H. Pham,Phi Le Nguyen,My T. Thai
Main category: cs.CV
TL;DR: ConstStyle提出了一种统一风格转换的方法,通过将训练和测试数据映射到一个统一的域中,以减少域偏移的影响,显著提升了域泛化的鲁棒性。
Details
Motivation: 深度神经网络在数据分布变化时性能下降,现有域泛化方法在面对训练域有限或域间差距大时表现不佳,因此需要一种更鲁棒的方法。Contribution: 提出ConstStyle,通过统一风格转换和理论分析捕获域不变特征,弥合训练域和测试域之间的差距。
Method: 将所有样本映射到一个统一的域,优化训练域数据;测试时将未见域样本投影到同一域再进行预测。
Result: ConstStyle在多种场景下优于现有方法,尤其训练域有限时,性能提升高达19.82%。
Insight: 通过统一域对齐训练和测试数据可有效减少域偏移的影响,即使域差距大或训练域少。
Abstract: Deep neural networks often suffer performance drops when test data distribution differs from training data. Domain Generalization (DG) aims to address this by focusing on domain-invariant features or augmenting data for greater diversity. However, these methods often struggle with limited training domains or significant gaps between seen (training) and unseen (test) domains. To enhance DG robustness, we hypothesize that it is essential for the model to be trained on data from domains that closely resemble unseen test domains-an inherently difficult task due to the absence of prior knowledge about the unseen domains. Accordingly, we propose ConstStyle, a novel approach that leverages a unified domain to capture domain-invariant features and bridge the domain gap with theoretical analysis. During training, all samples are mapped onto this unified domain, optimized for seen domains. During testing, unseen domain samples are projected similarly before predictions. By aligning both training and testing data within this unified domain, ConstStyle effectively reduces the impact of domain shifts, even with large domain gaps or few seen domains. Extensive experiments demonstrate that ConstStyle consistently outperforms existing methods across diverse scenarios. Notably, when only a limited number of seen domains are available, ConstStyle can boost accuracy up to 19.82% compared to the next best approach.
[86] S-LAM3D: Segmentation-Guided Monocular 3D Object Detection via Feature Space Fusion
Diana-Alexandra Sas,Florin Oniga
Main category: cs.CV
TL;DR: 本文提出了一种名为S-LAM3D的方法,通过将预计算的语义分割信息融入特征空间,指导单目3D物体检测,而无需扩展检测模型或联合学习先验。该方法在KITTI基准测试中表现优异,特别针对小物体(行人和骑行者)。
Details
Motivation: 单目3D物体检测由于输入仅为单张2D图像,缺乏深度信息,是一个不适定问题。现有方法主要依赖CNN或Transformer提取特征,但忽略了分割信息对检测的潜在提升。Contribution: 提出了一种解耦策略,将分割信息先验直接融入特征空间,指导检测任务,无需额外扩展模型或联合学习,提升了小物体的检测性能。
Method: 通过预计算的分割信息(如语义分割)作为先验,直接在特征空间与RGB特征融合,避免增加额外的预测分支。
Result: 在KITTI基准测试中,该方法在小物体(行人和骑行者)上表现优于仅依赖RGB特征的基线模型,证明了分割信息的指导价值。
Insight: 输入数据的理解可以弥补额外传感器或训练数据的不足,分割信息为单目3D检测提供了有效的先验知识。
Abstract: Monocular 3D Object Detection represents a challenging Computer Vision task due to the nature of the input used, which is a single 2D image, lacking in any depth cues and placing the depth estimation problem as an ill-posed one. Existing solutions leverage the information extracted from the input by using Convolutional Neural Networks or Transformer architectures as feature extraction backbones, followed by specific detection heads for 3D parameters prediction. In this paper, we introduce a decoupled strategy based on injecting precomputed segmentation information priors and fusing them directly into the feature space for guiding the detection, without expanding the detection model or jointly learning the priors. The focus is on evaluating the impact of additional segmentation information on existing detection pipelines without adding additional prediction branches. The proposed method is evaluated on the KITTI 3D Object Detection Benchmark, outperforming the equivalent architecture that relies only on RGB image features for small objects in the scene: pedestrians and cyclists, and proving that understanding the input data can balance the need for additional sensors or training data.
[87] Motion Aware ViT-based Framework for Monocular 6-DoF Spacecraft Pose Estimation
Jose Sosa,Dan Pineau,Arunkumar Rathinam,Abdelrahman Shabayek,Djamila Aouada
Main category: cs.CV
TL;DR: 该论文提出了一种基于Vision Transformer(ViT)和光流的单目6-DoF航天器姿态估计框架,通过利用运动感知热图和光流捕捉动态信息,结合PnP求解器恢复6-DoF姿态,并在多个数据集上验证了优于单图像基线的性能。
Details
Motivation: 现有航天器姿态估计方法通常基于静态单图像关键点定位,未能充分利用空间操作中固有的时间信息。本文旨在通过整合运动动态信息,提升姿态估计的准确性。Contribution: 1. 将人类姿态估计中的深度学习框架迁移至航天器姿态估计领域;2. 结合ViT编码器和预训练光流模型,捕捉运动动态;3. 在多个数据集上验证了方法的性能优势。
Method: 1. 使用ViT编码器提取图像特征;2. 预训练光流模型提供运动信息;3. 通过运动感知热图和光流联合定位2D关键点;4. 基于PnP求解器恢复6-DoF姿态。
Result: 在SPADES-RGB数据集上验证了优于单图像基线的性能,并在SPARK-2024数据集上展示了良好的泛化能力。
Insight: 1. 运动信息对航天器姿态估计至关重要;2. ViT结合光流能够有效利用时间信息;3. 方法在跨数据集测试中表现出较强的泛化能力。
Abstract: Monocular 6-DoF pose estimation plays an important role in multiple spacecraft missions. Most existing pose estimation approaches rely on single images with static keypoint localisation, failing to exploit valuable temporal information inherent to space operations. In this work, we adapt a deep learning framework from human pose estimation to the spacecraft pose estimation domain that integrates motion-aware heatmaps and optical flow to capture motion dynamics. Our approach combines image features from a Vision Transformer (ViT) encoder with motion cues from a pre-trained optical flow model to localise 2D keypoints. Using the estimates, a Perspective-n-Point (PnP) solver recovers 6-DoF poses from known 2D-3D correspondences. We train and evaluate our method on the SPADES-RGB dataset and further assess its generalisation on real and synthetic data from the SPARK-2024 dataset. Overall, our approach demonstrates improved performance over single-image baselines in both 2D keypoint localisation and 6-DoF pose estimation. Furthermore, it shows promising generalisation capabilities when testing on different data distributions.
[88] BLaVe-CoT: Consistency-Aware Visual Question Answering for Blind and Low Vision Users
Wanyin Cheng,Zanxi Ruan
Main category: cs.CV
TL;DR: BLaVe-CoT是一种专为盲人和低视力用户设计的VQA框架,通过推理答案一致性解决视觉模糊和问题歧义问题。
Details
Motivation: 现有VQA系统假设单一答案和区域,但盲人和低视力用户拍摄的照片模糊且问题歧义,导致多答案需求。Contribution: 提出BLaVe-CoT框架,结合LoRA调优的BLIP-2模型生成候选答案,PolyFormer空间定位,并通过链式推理模块评估答案一致性。
Method: 使用BLIP-2生成多样答案,PolyFormer空间定位,链式推理模块评估一致性。
Result: 在VQA-AnswerTherapy基准测试中优于前方法,对模糊和噪声更具鲁棒性。
Insight: VQA系统需适应人类不确定性,为盲人和低视力用户提供包容性支持。
Abstract: Visual Question Answering (VQA) holds great potential for assisting Blind and Low Vision (BLV) users, yet real-world usage remains challenging. Due to visual impairments, BLV users often take blurry or poorly framed photos and face difficulty in articulating specific questions about what they cannot fully see. As a result, their visual questions are frequently ambiguous, and different users may interpret them in diverse ways. This leads to multiple valid answers, each grounded in different image regions-posing a mismatch with conventional VQA systems that assume a single answer and region. To bridge this gap, we present BLaVe-CoT, a VQA framework designed to reason about answer consistency in the face of ambiguity. Our method proposes diverse candidate answers using a LoRA-tuned BLIP-2 model, then grounds each answer spatially using PolyFormer, and finally applies a chain-of-thought reasoning module to assess whether the answers refer to the same or different regions. Evaluated on the VQA-AnswerTherapy benchmark, BLaVe-CoT outperforms previous methods and proves more robust to the ambiguity and visual noise common in assistive settings. This work highlights the need for VQA systems that can adapt to real human uncertainty and provide inclusive support for BLV users. To foster further research and accessibility applications, we have made the code publicly available at https://github.com/Accecwan/BLaVe-CoT.
[89] DVLO4D: Deep Visual-Lidar Odometry with Sparse Spatial-temporal Fusion
Mengmeng Liu,Michael Ying Yang,Jiuming Liu,Yunpeng Zhang,Jiangtao Li,Sander Oude Elberink,George Vosselman,Hao Cheng
Main category: cs.CV
TL;DR: DVLO4D提出了一种新颖的视觉-LiDAR里程计框架,通过稀疏时空融合提升了精度和鲁棒性,主要包括稀疏查询融合、时空交互与更新模块以及时序剪辑训练策略。
Details
Motivation: 传统方法在传感器对齐、时间信息利用和手动调参方面存在不足,DVLO4D旨在解决这些问题。Contribution: 提出了三项创新:稀疏查询融合、时空交互与更新模块以及时序剪辑训练策略,显著提升了里程计的性能。
Method: 利用稀疏LiDAR查询进行多模态融合,结合时间预测和全局损失优化,减少累积误差。
Result: 在KITTI和Argoverse数据集上表现优异,推理时间82ms,适合实时部署。
Insight: 稀疏时空融合和全局优化策略对提升里程计的精度和鲁棒性具有重要作用。
Abstract: Visual-LiDAR odometry is a critical component for autonomous system localization, yet achieving high accuracy and strong robustness remains a challenge. Traditional approaches commonly struggle with sensor misalignment, fail to fully leverage temporal information, and require extensive manual tuning to handle diverse sensor configurations. To address these problems, we introduce DVLO4D, a novel visual-LiDAR odometry framework that leverages sparse spatial-temporal fusion to enhance accuracy and robustness. Our approach proposes three key innovations: (1) Sparse Query Fusion, which utilizes sparse LiDAR queries for effective multi-modal data fusion; (2) a Temporal Interaction and Update module that integrates temporally-predicted positions with current frame data, providing better initialization values for pose estimation and enhancing model’s robustness against accumulative errors; and (3) a Temporal Clip Training strategy combined with a Collective Average Loss mechanism that aggregates losses across multiple frames, enabling global optimization and reducing the scale drift over long sequences. Extensive experiments on the KITTI and Argoverse Odometry dataset demonstrate the superiority of our proposed DVLO4D, which achieves state-of-the-art performance in terms of both pose accuracy and robustness. Additionally, our method has high efficiency, with an inference time of 82 ms, possessing the potential for the real-time deployment.
[90] Analysis of Blood Report Images Using General Purpose Vision-Language Models
Nadia Bakhsheshi,Hamid Beigy
Main category: cs.CV
TL;DR: 论文探讨了通用视觉-语言模型(VLMs)在自动分析血液报告图像中的应用潜力,评估了三种VLM模型的性能,结果表明它们有助于开发面向患者的初步分析工具。
Details
Motivation: 血液报告的可靠分析对健康知识至关重要,但个人往往难以解读,导致焦虑和问题被忽视。研究旨在利用VLMs解决这一问题。Contribution: 论文的主要贡献是通过评估三种VLMs(Qwen-VL-Max、Gemini 2.5 Pro和Llama 4 Maverick),证实了它们在血液报告图像分析中的实用性。
Method: 研究使用了100张血液报告图像数据集,针对每份报告设计了临床相关问题,并用Sentence-BERT评估模型的回答一致性。
Result: 结果表明,通用VLMs能够直接从图像提供清晰解读,有望提升健康素养并降低理解复杂医学信息的门槛。
Insight: 研究为未来开发可靠且易于访问的AI辅助医疗应用奠定了基础,但需谨慎对待结果,因为数据集规模较小。
Abstract: The reliable analysis of blood reports is important for health knowledge, but individuals often struggle with interpretation, leading to anxiety and overlooked issues. We explore the potential of general-purpose Vision-Language Models (VLMs) to address this challenge by automatically analyzing blood report images. We conduct a comparative evaluation of three VLMs: Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick, determining their performance on a dataset of 100 diverse blood report images. Each model was prompted with clinically relevant questions adapted to each blood report. The answers were then processed using Sentence-BERT to compare and evaluate how closely the models responded. The findings suggest that general-purpose VLMs are a practical and promising technology for developing patient-facing tools for preliminary blood report analysis. Their ability to provide clear interpretations directly from images can improve health literacy and reduce the limitations to understanding complex medical information. This work establishes a foundation for the future development of reliable and accessible AI-assisted healthcare applications. While results are encouraging, they should be interpreted cautiously given the limited dataset size.
[91] TinyDef-DETR:An Enhanced DETR Detector for UAV Power Line Defect Detection
Jiaming Cui
Main category: cs.CV
TL;DR: TinyDef-DETR是一个基于DETR的目标检测框架,专为无人机电力线小缺陷检测设计,通过无损失下采样、边界感知特征提取和多尺度注意力模块,显著提升了小目标检测的性能。
Details
Motivation: 无人机电力线缺陷检测面临小目标和复杂背景的挑战,传统检测器由于下采样丢失细节、边界敏感性不足以及全局与局部信息融合不充分,性能受限。Contribution: 1) 提出无损失的空间到深度下采样模块;2) 边界增强卷积提升特征提取;3) 跨阶段双域多尺度注意力模块结合全局与局部信息;4) 使用Focaler-Wise-SIoU损失改进小目标定位。
Method: 结合细节保留下采样、边界敏感表示、双域注意力和难度自适应回归损失(Focaler-Wise-SIoU),优化DETR框架。
Result: 在CSG-ADCD和VisDrone数据集上表现优异,小目标检测准确率和召回率显著提升,计算开销仅小幅增加。
Insight: 细节保留、边界增强和全局-局部信息融合是提升小目标检测的关键,且DETR框架可通过模块优化适应特定任务。
Abstract: Automated inspection of transmission lines using UAVs is hindered by the difficulty of detecting small and ambiguous defects against complex backgrounds. Conventional detectors often suffer from detail loss due to strided downsampling, weak boundary sensitivity in lightweight backbones, and insufficient integration of global context with local cues. To address these challenges, we propose TinyDef-DETR, a DETR-based framework designed for small-defect detection. The method introduces a stride-free space-to-depth module for lossless downsampling, an edge-enhanced convolution for boundary-aware feature extraction, a cross-stage dual-domain multi-scale attention module to jointly capture global and local information, and a Focaler-Wise-SIoU regression loss to improve localization of small objects. Experiments conducted on the CSG-ADCD dataset demonstrate that TinyDef-DETR achieves substantial improvements in both precision and recall compared to competitive baselines, with particularly notable gains on small-object subsets, while incurring only modest computational overhead. Further validation on the VisDrone benchmark confirms the generalization capability of the proposed approach. Overall, the results indicate that integrating detail-preserving downsampling, edge-sensitive representations, dual-domain attention, and difficulty-adaptive regression provides a practical and efficient solution for UAV-based small-defect inspection in power grids.
[92] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
Yuming Li,Yikai Wang,Yuying Zhu,Zhongyu Zhao,Ming Lu,Qi She,Shanghang Zhang
Main category: cs.CV
TL;DR: 论文提出BranchGRPO,通过分支采样策略和修剪技术,显著降低扩散模型的训练成本并提高稳定性,同时提升生成模型的人类偏好对齐效果。
Details
Motivation: 现有GRPO方法因在策略展开和SDE采样步骤上的高计算成本以及稀疏奖励导致的训练不稳定,需要改进效率和稳定性。Contribution: 1) 引入分支采样策略降低计算成本;2) 提出基于树的优势估计器结合密集过程奖励;3) 利用路径和深度冗余的修剪策略加速收敛。
Method: BranchGRPO结合分支采样、共享计算前缀、修剪低奖励路径和冗余深度,优化训练效率和探索多样性。
Result: 实验显示,BranchGRPO在图像和视频偏好对齐任务中,对齐分数提升16%,训练时间减少50%。
Insight: 通过结构化的分支和修剪策略,可以在扩散模型中高效平衡计算成本与探索多样性,提升人类偏好对齐效果。
Abstract: Recent advancements in aligning image and video generative models via GRPO have achieved remarkable gains in enhancing human preference alignment. However, these methods still face high computational costs from on-policy rollouts and excessive SDE sampling steps, as well as training instability due to sparse rewards. In this paper, we propose BranchGRPO, a novel method that introduces a branch sampling policy updating the SDE sampling process. By sharing computation across common prefixes and pruning low-reward paths and redundant depths, BranchGRPO substantially lowers the per-update compute cost while maintaining or improving exploration diversity. This work makes three main contributions: (1) a branch sampling scheme that reduces rollout and training cost; (2) a tree-based advantage estimator incorporating dense process-level rewards; and (3) pruning strategies exploiting path and depth redundancy to accelerate convergence and boost performance. Experiments on image and video preference alignment show that BranchGRPO improves alignment scores by 16% over strong baselines, while cutting training time by 50%.
[93] Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models
Jaemin Son,Sujin Choi,Inyong Yun
Main category: cs.CV
TL;DR: 该论文提出了一种轻量级的token修剪框架,用于在视觉语言模型(VLM)处理前过滤文档图像中的非信息性背景区域,以降低计算成本。
Details
Motivation: 当前视觉语言模型在文档理解任务中表现优异,但其高计算成本限制了实际应用,因此需要一种高效的方法来减少计算负担。Contribution: 主要贡献是提出了一种索引保持的轻量级token修剪框架,通过二元补丁级分类器和最大池化细化步骤,有效过滤非文本区域并保持文本完整性。
Method: 方法包括两个步骤:1)使用二元补丁级分类器识别并移除非文本区域;2)通过最大池化细化步骤修复碎片化文本区域。
Result: 实验表明,该方法在显著降低计算成本的同时,保持了与基线相当的准确性。
Insight: 论文揭示了在文档理解任务中,非文本区域的修剪可以显著提升视觉语言模型的效率,而不牺牲性能。
Abstract: Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.
[94] PathoHR: Hierarchical Reasoning for Vision-Language Models in Pathology
Yating Huang,Ziyan Huang,Lintao Xiang,Qijun Yang,Hujun Yin
Main category: cs.CV
TL;DR: 论文提出了PathoHR-Bench基准和一种病理学特异性视觉语言训练方案,用于提升视觉语言模型在病理学领域的层次推理能力。
Details
Motivation: 病理图像分析对自动化肿瘤诊断至关重要,但现有视觉语言模型难以捕捉复杂的跨模态关系,限制了临床应用。Contribution: 提出PathoHR-Bench基准和一种增强的多模态对比学习训练方案,提升了病理图像的细粒度表征能力。
Method: 通过生成增强和扰动样本进行多模态对比学习,实现病理学领域的层次推理和组合推理。
Result: 在PathoHR-Bench和六个病理学数据集上达到SOTA性能。
Insight: 病理学需要更复杂的层次语义理解和组合推理能力,现有VL模型在这一领域仍有改进空间。
Abstract: Accurate analysis of pathological images is essential for automated tumor diagnosis but remains challenging due to high structural similarity and subtle morphological variations in tissue images. Current vision-language (VL) models often struggle to capture the complex reasoning required for interpreting structured pathological reports. To address these limitations, we propose PathoHR-Bench, a novel benchmark designed to evaluate VL models’ abilities in hierarchical semantic understanding and compositional reasoning within the pathology domain. Results of this benchmark reveal that existing VL models fail to effectively model intricate cross-modal relationships, hence limiting their applicability in clinical setting. To overcome this, we further introduce a pathology-specific VL training scheme that generates enhanced and perturbed samples for multimodal contrastive learning. Experimental evaluations demonstrate that our approach achieves state-of-the-art performance on PathoHR-Bench and six additional pathology datasets, highlighting its effectiveness in fine-grained pathology representation.
[95] CARDIE: clustering algorithm on relevant descriptors for image enhancement
Giulia Bonino,Luca Alberto Rizzo
Main category: cs.CV
TL;DR: 论文提出了一种名为CARDIE的无监督聚类算法,专注于图像增强任务,通过颜色和亮度内容对图像进行聚类,并量化了图像增强算法对亮度分布和局部方差的影响。
Details
Motivation: 自动图像聚类在计算机视觉中是基础任务,但在图像增强中的应用受限,主要因为难以定义对该任务有意义的聚类。作者希望通过基于颜色和亮度内容的聚类算法来解决这一问题。Contribution: 1) 提出CARDIE算法,基于颜色和亮度内容进行无监督聚类;2) 引入量化图像增强算法对亮度分布和局部方差影响的方法;3) 展示了CARDIE聚类在图像增强数据集重采样中的有效性,提升了色调映射和去噪算法的性能。
Method: CARDIE是一种无监督聚类算法,利用图像的颜色和亮度特征进行分组。通过量化方法评估图像增强算法对亮度分布和局部方差的改变,验证聚类的有效性。
Result: CARDIE生成的聚类比基于语义属性的聚类更适用于图像增强任务。利用这些聚类重采样数据集,能够提升色调映射和去噪算法的性能。
Insight: 针对特定任务(如图像增强)设计聚类算法时,基于低层视觉特征(如颜色和亮度)可能比语义属性更有意义。此外,量化方法为评估图像增强算法提供了新工具。
Abstract: Automatic image clustering is a cornerstone of computer vision, yet its application to image enhancement remains limited, primarily due to the difficulty of defining clusters that are meaningful for this specific task. To address this issue, we introduce CARDIE, an unsupervised algorithm that clusters images based on their color and luminosity content. In addition, we introduce a method to quantify the impact of image enhancement algorithms on luminance distribution and local variance. Using this method, we demonstrate that CARDIE produces clusters more relevant to image enhancement than those derived from semantic image attributes. Furthermore, we demonstrate that CARDIE clusters can be leveraged to resample image enhancement datasets, leading to improved performance for tone mapping and denoising algorithms. To encourage adoption and ensure reproducibility, we publicly release CARDIE code on our GitHub.
[96] SpecSwin3D: Generating Hyperspectral Imagery from Multispectral Data via Transformer Networks
Tang Sui,Songxi Yang,Qunying Huang
Main category: cs.CV
TL;DR: SpecSwin3D是一种基于Transformer的模型,能够从多光谱数据生成高光谱图像,同时在空间和光谱维度上保持高质量。通过级联训练策略和优化的波段序列设计,该模型显著提升了重建性能。
Details
Motivation: 多光谱和高光谱图像在农业和环境监测等领域应用广泛,但其空间和光谱分辨率之间存在固有矛盾。现有方法难以同时保持空间细节和光谱保真度。Contribution: 提出了SpecSwin3D,一种基于3D shifted-window Transformer的模型,通过级联训练和优化的波段序列设计,显著提升了生成高光谱图像的性能。
Method: 模型使用5个多光谱波段输入,重建224个高光谱波段。引入级联训练策略逐步扩展光谱范围,并设计优化的波段序列以捕捉3D shifted-window Transformer中的波段关系。
Result: 在PSNR、SAM和SSIM指标上优于基线模型MHF-Net(PSNR提升5.6 dB,ERGAS降低超50%),并在下游任务(如土地利用分类和燃烧区域分割)中验证了实用性。
Insight: 级联训练策略和波段序列优化对提升光谱保真度至关重要,Transformer模型在高光谱生成任务中具有显著潜力。
Abstract: Multispectral and hyperspectral imagery are widely used in agriculture, environmental monitoring, and urban planning due to their complementary spatial and spectral characteristics. A fundamental trade-off persists: multispectral imagery offers high spatial but limited spectral resolution, while hyperspectral imagery provides rich spectra at lower spatial resolution. Prior hyperspectral generation approaches (e.g., pan-sharpening variants, matrix factorization, CNNs) often struggle to jointly preserve spatial detail and spectral fidelity. In response, we propose SpecSwin3D, a transformer-based model that generates hyperspectral imagery from multispectral inputs while preserving both spatial and spectral quality. Specifically, SpecSwin3D takes five multispectral bands as input and reconstructs 224 hyperspectral bands at the same spatial resolution. In addition, we observe that reconstruction errors grow for hyperspectral bands spectrally distant from the input bands. To address this, we introduce a cascade training strategy that progressively expands the spectral range to stabilize learning and improve fidelity. Moreover, we design an optimized band sequence that strategically repeats and orders the five selected multispectral bands to better capture pairwise relations within a 3D shifted-window transformer framework. Quantitatively, our model achieves a PSNR of 35.82 dB, SAM of 2.40{\deg}, and SSIM of 0.96, outperforming the baseline MHF-Net by +5.6 dB in PSNR and reducing ERGAS by more than half. Beyond reconstruction, we further demonstrate the practical value of SpecSwin3D on two downstream tasks, including land use classification and burnt area segmentation.
[97] Interleaving Reasoning for Better Text-to-Image Generation
Wenxuan Huang,Shuang Chen,Zheyong Xie,Shaosheng Cao,Shixiang Tang,Yufan Shen,Qingyu Yin,Wenbo Hu,Xiaoman Wang,Yuntian Tang,Junbo Qiao,Yue Guo,Yao Hu,Zhenfei Yin,Philip Torr,Yu Cheng,Wanli Ouyang,Shaohui Lin
Main category: cs.CV
TL;DR: 该论文提出了交替推理生成(IRG)框架,通过交替进行文本推理和图像合成来提升文本到图像(T2I)生成的细节和指令跟随能力,并提出了一种两阶段的训练方法IRGL,在多个评测中实现了显著的性能提升。
Details
Motivation: 尽管多模态理解和生成模型在图像生成方面取得了进步,但与GPT-4o等紧密结合理解的系统相比,指令跟随和细节保留仍存在较大差距。受交替推理研究的启发,本文探索如何通过交替推理提升T2I生成能力。Contribution: 1. 提出了IRG框架,通过交替进行文本推理和图像合成来优化生成的细节和语义一致性;2. 设计了IRGL两阶段训练方法;3. 构建了IRGL-300K数据集。
Method: 采用交替推理生成框架(IRG),通过文本推理和图像合成交替进行优化生成质量。训练分为两阶段:首先强化初始推理和生成阶段,再通过文本反馈和图像优化实现高质量结果。
Result: 在GenEval、WISE等评测中实现了5-10个百分点的绝对性能提升,同时在视觉质量和细节保真度上有显著改进。
Insight: 交替推理可以显著提升T2I生成的细节和指令跟随能力,两阶段训练方法能够有效结合文本和图像模态的优势。
Abstract: Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .
[98] UniVerse-1: Unified Audio-Video Generation via Stitching of Experts
Duomin Wang,Wei Zuo,Aojie Li,Ling-Hao Chen,Xinyao Liao,Deyu Zhou,Zixin Yin,Xili Dai,Daxin Jiang,Gang Yu
Main category: cs.CV
TL;DR: UniVerse-1提出了一种通过专家模型拼接(SoE)统一生成音频和视频的方法,避免了从头训练的低效问题,并开发了在线标注管道以确保音频和视频内容的时空对齐。
Details
Motivation: 现有音频-视频生成模型通常需要从头训练,效率低下,且基于文本的标注可能导致对齐问题。UniVerse-1旨在通过整合预训练专家模型和优化标注流程解决这些问题。Contribution: 1. 提出了专家模型拼接(SoE)技术,高效融合预训练的音频和视频生成模型;2. 开发了在线标注管道,解决了时空对齐问题;3. 发布了Verse-Bench基准数据集和开源代码。
Method: 利用SoE技术深度融合预训练的视频和音乐生成模型块,同时通过在线标注管道动态生成标注数据,确保音频与视频内容的对齐。
Result: 模型在7,600小时的音频-视频数据上微调后,生成了高质量的协调视听内容,尤其在环境声音生成和语音对齐方面表现优异。
Insight: 通过复用预训练专家模型和动态标注,可以显著提升多模态生成任务的效率和性能,为后续研究提供了新的思路。
Abstract: We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.
[99] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning
Huy Le,Nhat Chung,Tung Kieu,Jingkang Yang,Ngan Le
Main category: cs.CV
TL;DR: UNO提出了一种统一的单阶段视频场景图生成框架,通过对象中心的视觉表示学习,同时支持粗粒度的边界框级和细粒度的全景像素级任务,减少了任务特定的架构和多阶段训练需求。
Details
Motivation: 现有视频场景图生成方法通常针对特定任务设计架构,缺乏统一性和效率。UNO旨在通过统一框架解决这一问题,实现多任务共享参数和高效建模。Contribution: 1. 提出统一的单阶段框架UNO,支持多粒度视频场景图生成;2. 引入扩展的槽注意力机制和对象时态一致性学习;3. 设计动态三元组预测模块以捕捉时序交互。
Method: 1. 通过槽注意力机制分解对象和关系表示;2. 利用对象时态一致性学习保持跨帧表示稳定性;3. 动态三元组模块关联对象对与时序关系。
Result: 在边界框级和像素级VidSGG基准测试中,UNO表现竞争性能,同时提高了效率。
Insight: 对象中心的统一设计可简化复杂任务架构,跨帧一致性学习和动态关系建模是视频场景图生成的关键。
Abstract: Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.
[100] Exploring Light-Weight Object Recognition for Real-Time Document Detection
Lucas Wojcik,Luiz Coelho,Roger Granada,David Menotti
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的实时文档检测方法,通过改进IWPOD-Net在合成ID卡数据集上进行训练,结合数据增强和跨数据集验证,优化了OCR检索的性能。实验表明,该模型在保持OCR质量的同时,比其他方法更小、更高效。
Details
Motivation: 实时文档检测和校正是一个未被充分研究的领域,但对自动信息检索至关重要。本文旨在开发一种高效的文档检测流程,既能满足OCR检索需求,又比现有解决方案更快。Contribution: 主要贡献包括:1) 改进IWPOD-Net用于文档检测;2) 在合成ID卡数据集上进行训练和验证;3) 提出了一种基于Levenshtein距离的新型OCR质量评估指标。
Method: 方法包括:1) 将IWPOD-Net适配为文档检测模型;2) 在NBID数据集上进行训练;3) 使用数据增强和跨数据集验证(MIDV数据集);4) 与其他先进方法进行比较。
Result: 实验结果表明,该模型在OCR质量上与现有方法竞争,同时更小、更高效。文档校正无需完美即可达到最佳性能。
Insight: 研究发现,文档校正的精度不必完美即可实现高质量的OCR输出,这为高效实时文档检测提供了新思路。
Abstract: Object Recognition and Document Skew Estimation have come a long way in terms of performance and efficiency. New models follow one of two directions: improving performance using larger models, and improving efficiency using smaller models. However, real-time document detection and rectification is a niche that is largely unexplored by the literature, yet it remains a vital step for automatic information retrieval from visual documents. In this work, we strive towards an efficient document detection pipeline that is satisfactory in terms of Optical Character Recognition (OCR) retrieval and faster than other available solutions. We adapt IWPOD-Net, a license plate detection network, and train it for detection on NBID, a synthetic ID card dataset. We experiment with data augmentation and cross-dataset validation with MIDV (another synthetic ID and passport document dataset) to find the optimal scenario for the model. Other methods from both the Object Recognition and Skew Estimation state-of-the-art are evaluated for comparison with our approach. We use each method to detect and rectify the document, which is then read by an OCR system. The OCR output is then evaluated using a novel OCR quality metric based on the Levenshtein distance. Since the end goal is to improve automatic information retrieval, we use the overall OCR quality as a performance metric. We observe that with a promising model, document rectification does not have to be perfect to attain state-of-the-art performance scores. We show that our model is smaller and more efficient than current state-of-the-art solutions while retaining a competitive OCR quality metric. All code is available at https://github.com/BOVIFOCR/iwpod-doc-corners.git
[101] Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
Mohsen Gholami,Ahmad Rezaei,Zhou Weimin,Yong Zhang,Mohammad Akbari
Main category: cs.CV
TL;DR: 这篇论文提出了Ego3D-Bench,一个新的基准测试,用于评估视觉语言模型(VLM)在自我中心、多视角户外数据中的空间推理能力,并开发了Ego3D-VLM框架以提升VLM的3D空间理解能力。
Details
Motivation: 现有视觉语言模型在3D空间关系理解上存在不足,而现实中的具身智能体(如机器人、自动驾驶汽车)依赖多视角观测。因此,需要一个新的基准和框架来填补这一空白。Contribution: 1. 提出了Ego3D-Bench基准,包含8,600个QA对,用于评估VLM的空间推理能力;2. 提出了Ego3D-VLM框架,通过生成基于全局3D坐标的认知图,显著提升了VLM的空间理解能力。
Method: Ego3D-VLM框架通过估计全局3D坐标生成认知图,并模块化地集成到现有VLM中,以提升其3D空间推理能力。
Result: 实验表明,Ego3D-VLM在多选题QA上平均提升12%,在绝对距离估计上平均提升56%,但仍未达到人类水平。
Insight: 当前VLM在3D空间理解上仍有较大提升空间,Ego3D-VLM框架为未来研究提供了方向和工具。
Abstract: Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.
[102] AI-driven Remote Facial Skin Hydration and TEWL Assessment from Selfie Images: A Systematic Solution
Cecelia Soh,Rizhao Cai,Monalisha Paul,Dennis Sng,Alex Kot
Main category: cs.CV
TL;DR: 该论文提出了一种通过自拍图像远程评估面部皮肤水合作用(SH)和经皮水分流失(TEWL)的系统性解决方案,采用了一种新颖的Skin-Prior Adaptive Vision Transformer模型来解决数据标注不平衡问题。
Details
Motivation: 皮肤屏障功能对皮肤健康和疾病抵抗至关重要,但SH和TEWL测量通常需要专业设备,普通用户难以获取。论文试图通过自拍图像实现远程评估,降低使用门槛。Contribution: 提出首个无需物理测量的自拍图像皮肤评估方法,设计了一种新型ViT模型,并引入对称对比正则化缓解数据不平衡带来的偏差。
Method: 通过多阶段流程(数据收集、预处理)开发Skin-Prior Adaptive Vision Transformer模型,使用对称对比正则化解决数据不平衡问题。
Result: 实验验证了方法的有效性,成功实现了远程皮肤健康评估,推动了计算机视觉与皮肤护理研究的结合。
Insight: 对称对比正则化是一种解决数据不平衡的有效手段,该方法有望推广到其他基于图像的生理参数估计任务中。
Abstract: Skin health and disease resistance are closely linked to the skin barrier function, which protects against environmental factors and water loss. Two key physiological indicators can quantitatively represent this barrier function: skin hydration (SH) and trans-epidermal water loss (TEWL). Measurement of SH and TEWL is valuable for the public to monitor skin conditions regularly, diagnose dermatological issues, and personalize their skincare regimens. However, these measurements are not easily accessible to general users unless they visit a dermatology clinic with specialized instruments. To tackle this problem, we propose a systematic solution to estimate SH and TEWL from selfie facial images remotely with smartphones. Our solution encompasses multiple stages, including SH/TEWL data collection, data preprocessing, and formulating a novel Skin-Prior Adaptive Vision Transformer model for SH/TEWL regression. Through experiments, we identified the annotation imbalance of the SH/TEWL data and proposed a symmetric-based contrastive regularization to reduce the model bias due to the imbalance effectively. This work is the first study to explore skin assessment from selfie facial images without physical measurements. It bridges the gap between computer vision and skin care research, enabling AI-driven accessible skin analysis for broader real-world applications.
[103] Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
Jiangnan Xie,Xiaolong Zheng,Liang Zheng
Main category: cs.CV
TL;DR: 论文提出了一种原型感知的多模态对齐方法(PAML),解决了开放词汇视觉定位中的模态对齐不足、特征融合不充分和语义原型利用低效的问题,并在标准场景和开放词汇场景中均取得优异性能。
Details
Motivation: 当前基于Transformer的视觉定位方法在标准场景中表现良好,但在开放词汇场景中表现不佳,主要由于模态对齐不完美、跨模态特征融合不足以及对语义原型信息的利用低效。Contribution: 提出了PAML框架,通过ALBEF实现初始模态对齐,增强视觉特征编码,引入原型发现与继承机制,并通过多阶段解码器实现全面的多模态融合,显著提升了开放词汇视觉定位的性能。
Method: 1. 使用ALBEF进行初始特征编码;2. 设计视觉判别性特征编码器增强对象表征;3. 引入原型发现与继承机制提取语义原型;4. 通过多阶段解码器进行特征融合和边界框回归。
Result: 在五个基准数据集上的实验表明,PAML在标准场景中表现优异,在开放词汇场景中达到了最先进的性能。
Insight: 通过原型感知和多阶段特征融合,PAML显著提升了开放词汇场景下的视觉定位能力,为跨模态任务提供了新的设计思路。
Abstract: Visual Grounding (VG) aims to utilize given natural language queries to locate specific target objects within images. While current transformer-based approaches demonstrate strong localization performance in standard scene (i.e, scenarios without any novel objects), they exhibit notable limitations in open-vocabulary scene (i.e, both familiar and novel object categories during testing). These limitations primarily stem from three key factors: (1) imperfect alignment between visual and linguistic modalities, (2) insufficient cross-modal feature fusion, and (3) ineffective utilization of semantic prototype information. To overcome these challenges, we present Prototype-Aware Multimodal Learning (PAML), an innovative framework that systematically addresses these issues through several key components: First, we leverage ALBEF to establish robust cross-modal alignment during initial feature encoding. Subsequently, our Visual Discriminative Feature Encoder selectively enhances salient object representations while suppressing irrelevant visual context. The framework then incorporates a novel prototype discovering and inheriting mechanism that extracts and aggregates multi-neighbor semantic prototypes to facilitate open-vocabulary recognition. These enriched features undergo comprehensive multimodal integration through our Multi-stage Decoder before final bounding box regression. Extensive experiments across five benchmark datasets validate our approach, showing competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene. Our code is available at https://github.com/plankXie/PAML.
[104] Video-based Generalized Category Discovery via Memory-Guided Consistency-Aware Contrastive Learning
Zhang Jing,Pu Nan,Xie Yu Xiang,Guo Yanming,Lu Qianqi,Zou Shiwei,Yan Jie,Chen Yan
Main category: cs.CV
TL;DR: 论文提出了一种视频领域的广义类别发现(Video-GCD)问题,并设计了基于记忆引导的对比学习框架(MCCL)来整合时空信息,显著提升了新类别的发现能力。
Details
Motivation: 当前的广义类别发现(GCD)方法主要针对静态图像,而视频的多视角时间信息未被充分利用。视频可以更可靠地发现新类别,但现有方法尚未有效整合这些信息。Contribution: 1. 提出Video-GCD问题,扩展GCD到视频领域;2. 设计MCCL框架,通过一致性感知对比学习和记忆引导表示增强,有效整合时空信息;3. 构建新的Video-GCD基准数据集。
Method: MCCL框架包含两部分:1. Consistency-Aware Contrastive Learning(CACL)利用时间特征估计一致性分数,加权对比损失;2. Memory-Guided Representation Enhancement(MGRE)通过双级记忆缓冲区增强类内紧凑性和类间可分离性。
Result: 实验表明,MCCL显著优于图像GCD方法在视频上的表现,证明了时间信息对视频新类别发现的重要性。
Insight: 视频中的时间信息是发现新类别的关键,而动态整合时空特征的框架能有效提升性能。
Abstract: Generalized Category Discovery (GCD) is an emerging and challenging open-world problem that has garnered increasing attention in recent years. Most existing GCD methods focus on discovering categories in static images. However, relying solely on static visual content is often insufficient to reliably discover novel categories. To bridge this gap, we extend the GCD problem to the video domain and introduce a new setting, termed Video-GCD. Thus, effectively integrating multi-perspective information across time is crucial for accurate Video-GCD. To tackle this challenge, we propose a novel Memory-guided Consistency-aware Contrastive Learning (MCCL) framework, which explicitly captures temporal-spatial cues and incorporates them into contrastive learning through a consistency-guided voting mechanism. MCCL consists of two core components: Consistency-Aware Contrastive Learning(CACL) and Memory-Guided Representation Enhancement (MGRE). CACL exploits multiperspective temporal features to estimate consistency scores between unlabeled instances, which are then used to weight the contrastive loss accordingly. MGRE introduces a dual-level memory buffer that maintains both feature-level and logit-level representations, providing global context to enhance intra-class compactness and inter-class separability. This in turn refines the consistency estimation in CACL, forming a mutually reinforcing feedback loop between representation learning and consistency modeling. To facilitate a comprehensive evaluation, we construct a new and challenging Video-GCD benchmark, which includes action recognition and bird classification video datasets. Extensive experiments demonstrate that our method significantly outperforms competitive GCD approaches adapted from image-based settings, highlighting the importance of temporal information for discovering novel categories in videos. The code will be publicly available.
[105] Text4Seg++: Advancing Image Segmentation via Generative Language Modeling
Mengcheng Lan,Chaofeng Chen,Jiaxing Xu,Zongrui Li,Yiping Ke,Xudong Jiang,Yingchen Yu,Yunqing Zhao,Song Bai
Main category: cs.CV
TL;DR: 该论文提出了Text4Seg++,通过将图像分割任务转化为文本生成问题,利用语义描述符和R-RLE压缩技术,简化分割流程并提升效率,显著优于现有方法。
Details
Motivation: 现有的多模态大语言模型(MLLMs)在视觉语言任务中表现出色,但如何有效整合图像分割仍具挑战。论文旨在探索一种无需额外解码器的轻量化分割方法。Contribution: 1. 提出文本掩码范式,将分割问题转化为文本生成;2. 发明语义描述符和R-RLE压缩技术;3. 开发Text4Seg++框架,通过语义砖块进一步提升分割精度和效率。
Method: 1. 使用语义描述符将图像块映射为文本标签;2. 提出R-RLE压缩技术减少冗余;3. 引入语义砖块优化区域表示。
Result: Text4Seg++在自然和遥感数据集上均优于SOTA模型,无需任务微调,且兼容现有MLLM骨干网络。
Insight: 文本驱动的分割范式展示了在MLLM框架下的高效性和通用性,为未来多模态任务提供了新思路。
Abstract: Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$, without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing MLLM backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the MLLM framework.
[106] Towards scalable organ level 3D plant segmentation: Bridging the data algorithm computing gap
Ruiming Du,Guangxun Zhai,Tian Qiu,Yu Jiang
Main category: cs.CV
TL;DR: 这篇论文提出了一种可扩展的3D植物器官分割方法,通过解决数据、算法和计算的瓶颈,为植物表型分析提供了工具和路线图。
Details
Motivation: 植物形态的精确表征对研究植物与环境相互作用和遗传演化至关重要,但3D分割在植物表型分析中的应用受限于数据稀缺、算法适配性和标准化评估的缺乏。Contribution: 1) 综述现有3D植物数据集和深度学习分割方法;2) 提出开源框架Plant Segmentation Studio (PSS);3) 评估稀疏卷积和Transformer分割的性能,强调合成数据对sim-to-real学习的作用。
Method: 论文通过系统综述数据和方法,提出PSS框架用于标准化评估,并实验验证稀疏卷积和Transformer架构的优越性。
Result: 稀疏卷积和Transformer架构表现优异,合成数据生成(建模和增强)能有效降低标注需求。
Insight: 数据高效和泛化的3D植物表型分析需要结合算法创新和标准化工具,合成数据是缓解数据稀缺的关键。
Abstract: The precise characterization of plant morphology provides valuable insights into plant environment interactions and genetic evolution. A key technology for extracting this information is 3D segmentation, which delineates individual plant organs from complex point clouds. Despite significant progress in general 3D computer vision domains, the adoption of 3D segmentation for plant phenotyping remains limited by three major challenges: i) the scarcity of large-scale annotated datasets, ii) technical difficulties in adapting advanced deep neural networks to plant point clouds, and iii) the lack of standardized benchmarks and evaluation protocols tailored to plant science. This review systematically addresses these barriers by: i) providing an overview of existing 3D plant datasets in the context of general 3D segmentation domains, ii) systematically summarizing deep learning-based methods for point cloud semantic and instance segmentation, iii) introducing Plant Segmentation Studio (PSS), an open-source framework for reproducible benchmarking, and iv) conducting extensive quantitative experiments to evaluate representative networks and sim-to-real learning strategies. Our findings highlight the efficacy of sparse convolutional backbones and transformer-based instance segmentation, while also emphasizing the complementary role of modeling-based and augmentation-based synthetic data generation for sim-to-real learning in reducing annotation demands. In general, this study bridges the gap between algorithmic advances and practical deployment, providing immediate tools for researchers and a roadmap for developing data-efficient and generalizable deep learning solutions in 3D plant phenotyping. Data and code are available at https://github.com/perrydoremi/PlantSegStudio.
[107] Multi-Modal Camera-Based Detection of Vulnerable Road Users
Penelope Brown,Julie Stephany Berrio Perez,Mao Shan,Stewart Worrall
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态(RGB和热红外)的脆弱道路使用者(VRUs)检测框架,通过微调YOLOv8模型并结合数据增强和类别加权损失,显著提高了在恶劣条件下的检测性能。
Details
Motivation: 脆弱道路使用者(如行人、骑行者)占全球交通事故死亡总数的一半以上,但在光照不足、恶劣天气和数据不平衡的情况下检测效果较差。多模态融合和类别不平衡处理是提升检测能力的关键。Contribution: 1. 提出了一种融合RGB和热红外图像的多模态检测框架;2. 使用微调YOLOv8模型和类别加权损失优化了对罕见VRUs的检测;3. 实验表明热红外模型精度最高,RGB到热红外的数据增强显著提升了召回率。
Method: 1. 结合RGB和热红外图像作为输入;2. 基于YOLOv8模型微调,采用部分骨干网络冻结和640分辨率优化效率;3. 使用类别加权损失和数据增强(如RGB到热红外的转换)提升少数类性能和鲁棒性。
Result: 实验结果表明,热红外模型的检测精度最高,而RGB到热红外的数据增强显著提高了召回率。多模态框架在恶劣环境下表现优于单一模态。
Insight: 多模态融合和数据增强是提升VRU检测鲁棒性的有效手段,尤其是在光照不足和数据不平衡的场景下。热红外图像的引入为恶劣条件下的检测提供了新的可能性。
Abstract: Vulnerable road users (VRUs) such as pedestrians, cyclists, and motorcyclists represent more than half of global traffic deaths, yet their detection remains challenging in poor lighting, adverse weather, and unbalanced data sets. This paper presents a multimodal detection framework that integrates RGB and thermal infrared imaging with a fine-tuned YOLOv8 model. Training leveraged KITTI, BDD100K, and Teledyne FLIR datasets, with class re-weighting and light augmentations to improve minority-class performance and robustness, experiments show that 640-pixel resolution and partial backbone freezing optimise accuracy and efficiency, while class-weighted losses enhance recall for rare VRUs. Results highlight that thermal models achieve the highest precision, and RGB-to-thermal augmentation boosts recall, demonstrating the potential of multimodal detection to improve VRU safety at intersections.
[108] Harnessing Object Grounding for Time-Sensitive Video Understanding
Tz-Ying Wu,Sharath Nittur Sridhar,Subarna Tripathi
Main category: cs.CV
TL;DR: 本文提出了一种名为GO-Tokenizer的轻量级模块,通过利用现成的目标检测器动态编码紧凑的目标信息,提升了视频大语言模型在时间敏感视频理解任务中的性能。
Details
Motivation: 时间敏感视频理解(TSV)任务通常需要模型捕捉视频帧中的目标信息。虽然直接在提示中添加目标标注的文本描述可以提升性能,但会导致额外的令牌长度和对噪声信息的敏感性。因此,需要一个更高效的解决方案。Contribution: 主要贡献是提出了GO-Tokenizer,一个轻量级的附加模块,动态编码目标信息,避免了文本描述带来的缺点,显著提升了模型的TSV能力。
Method: GO-Tokenizer利用现成的目标检测器动态提取并编码目标信息,将这些信息紧凑地集成到视频大语言模型中,减少了噪声影响和计算开销。
Result: 实验表明,使用GO-Tokenizer预训练的模型在多个数据集和任务(如时间推理定位和密集描述生成)中优于基线模型及其文本描述增强的变体。
Insight: 动态编码目标信息比文本描述更高效,能显著提升视频大语言模型的时间敏感理解能力,同时减少噪声和计算负担。
Abstract: We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual description of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart utilizing textual description of objects in the prompt. The gain generalizes across different models, datasets and video understanding tasks such as reasoning temporal localization and dense captioning.
[109] Multi View Slot Attention Using Paraphrased Texts For Face Anti-Spoofing
Jeongmin Yu,Susang Kim,Kisu Lee,Taekyoung Kwon,Won-Yong Shin,Ha Young Kim
Main category: cs.CV
TL;DR: MVP-FAS提出了一种新的面部防欺骗框架,通过多视角槽注意力和多文本块对齐模块,利用多段同义文本生成广义特征,提升了跨域性能。
Details
Motivation: 现有基于CLIP的面部防欺骗方法未能充分利用CLIP的块嵌入标记,且依赖于单一文本提示,限制了泛化能力。Contribution: 提出了MVP-FAS框架,包含多视角槽注意力(MVS)和多文本块对齐(MTPA)模块,利用多段同义文本增强特征和语义鲁棒性。
Method: MVS从多视角文本中提取局部和全局特征;MTPA将图像块与多文本表示对齐,提升语义一致性。
Result: 在跨域数据集上表现优异,超越了现有的最先进方法。
Insight: 多文本输入能显著提升模型的泛化能力和语义鲁棒性,尤其在跨域任务中。
Abstract: Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP’s patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., ‘live’ or ‘fake’), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.
[110] A Multi-Modal Deep Learning Framework for Colorectal Pathology Diagnosis: Integrating Histological and Colonoscopy Data in a Pilot Study
Krithik Ramesh,Ritvik Koneru
Main category: cs.CV
TL;DR: 该论文提出了一种多模态深度学习框架,整合了组织病理学和结肠镜数据,用于结直肠疾病的诊断,并通过CNN实现高效分类。
Details
Motivation: 传统结直肠疾病诊断依赖单独的病理切片和结肠镜评估,导致效率低下和结果不一致。论文旨在通过统一的多模态深度学习框架提高诊断效率和准确性。Contribution: 提出了一种统一的深度学习网络,整合了组织病理学和结肠镜数据,实现了多模态结直肠疾病分类。
Method: 使用ResNet-50架构,结合类平衡学习、鲁棒数据增强和校准方法,对PathMNIST和HyperKvasir数据集进行分类。
Result: 展示了可解释且可重复的诊断流程,提高了结直肠疾病检测的效率和准确性。
Insight: 多模态数据整合能显著提升医学影像诊断的精度和效率,尤其是在复杂疾病(如结直肠癌)中。
Abstract: Colorectal diseases, including inflammatory conditions and neoplasms, require quick, accurate care to be effectively treated. Traditional diagnostic pipelines require extensive preparation and rely on separate, individual evaluations on histological images and colonoscopy footage, introducing possible variability and inefficiencies. This pilot study proposes a unified deep learning network that uses convolutional neural networks (CN N s) to classify both histopathological slides and colonoscopy video frames in one pipeline. The pipeline integrates class-balancing learning, robust augmentation, and calibration methods to ensure accurate results. Static colon histology images were taken from the PathMNIST dataset, and the lower gastrointestinal (colonoscopy) videos were drawn from the HyperKvasir dataset. The CNN architecture used was ResNet-50. This study demonstrates an interpretable and reproducible diagnostic pipeline that unifies multiple diagnostic modalities to advance and ease the detection of colorectal diseases.
[111] MRD-LiNet: A Novel Lightweight Hybrid CNN with Gradient-Guided Unlearning for Improved Drought Stress Identification
Aswini Kumar Patra,Lingaraj Sahoo
Main category: cs.CV
TL;DR: 论文提出了一个轻量级混合CNN框架MRD-LiNet,结合梯度引导的遗忘机制,显著减少参数与计算成本,同时保持高准确性,适用于干旱胁迫识别。
Details
Motivation: 干旱胁迫是全球农作物生产力的主要威胁,传统方法耗时耗力,现有深度学习模型参数多、计算复杂,限制了在资源受限农业场景的应用。Contribution: 1)提出轻量级混合CNN框架MRD-LiNet,参数减少15倍;2)引入基于梯度范数的影响函数机制,针对性去除训练数据影响。
Method: 融合ResNet、DenseNet和MobileNet架构,结合梯度范数指引的遗忘机制优化模型适应性。
Result: 在土豆田航拍数据集上验证,框架既保持高准确率,又大幅降低计算成本。
Insight: 轻量化与遗忘机制的结合为资源受限的精准农业提供了实用且可扩展的解决方案。
Abstract: Drought stress is a major threat to global crop productivity, making its early and precise detection essential for sustainable agricultural management. Traditional approaches, though useful, are often time-consuming and labor-intensive, which has motivated the adoption of deep learning methods. In recent years, Convolutional Neural Network (CNN) and Vision Transformer architectures have been widely explored for drought stress identification; however, these models generally rely on a large number of trainable parameters, restricting their use in resource-limited and real-time agricultural settings. To address this challenge, we propose a novel lightweight hybrid CNN framework inspired by ResNet, DenseNet, and MobileNet architectures. The framework achieves a remarkable 15-fold reduction in trainable parameters compared to conventional CNN and Vision Transformer models, while maintaining competitive accuracy. In addition, we introduce a machine unlearning mechanism based on a gradient norm-based influence function, which enables targeted removal of specific training data influence, thereby improving model adaptability. The method was evaluated on an aerial image dataset of potato fields with expert-annotated healthy and drought-stressed regions. Experimental results show that our framework achieves high accuracy while substantially lowering computational costs. These findings highlight its potential as a practical, scalable, and adaptive solution for drought stress monitoring in precision agriculture, particularly under resource-constrained conditions.
[112] AI-based response assessment and prediction in longitudinal imaging for brain metastases treated with stereotactic radiosurgery
Lorenz Achim Kuhn,Daniel Abler,Jonas Richiardi,Andreas F. Hottinger,Luis Schiappacasse,Vincent Dunet,Adrien Depeursinge,Vincent Andrearczyk
Main category: cs.CV
TL;DR: 论文提出了一种基于AI的自动化管道,用于纵向MRI中脑转移瘤对SRS治疗的响应评估和预测,包括数据聚类和机器学习方法,展现出高分预测性能。
Details
Motivation: 脑转移瘤(BM)对癌症患者的死亡率贡献巨大,当前纵向MRI的分析工作量大且依赖人工观察。研究目标是实现自动化的响应评估和早期预测。Contribution: 1) 构建了一个大型纵向数据集;2) 通过数据驱动聚类揭示了5种主要生长轨迹;3) 使用传统和图机器学习方法预测BM响应,AUC高达0.90。
Method: 1) 自动化管道构建数据集;2) 数据驱动聚类分析生长轨迹;3) 梯度提升和图机器学习(GML)用于预测12个月的BM响应。
Result: 聚类识别了5种响应类别;梯度提升预测AUC达0.90,GML则为0.88,显示出灵活性和高性能。
Insight: 该方法实现了BM响应的自动化评估和预测,为临床决策支持系统提供了基础,有望优化个性化治疗。
Abstract: Brain Metastases (BM) are a large contributor to mortality of patients with cancer. They are treated with Stereotactic Radiosurgery (SRS) and monitored with Magnetic Resonance Imaging (MRI) at regular follow-up intervals according to treatment guidelines. Analyzing and quantifying this longitudinal imaging represents an intractable workload for clinicians. As a result, follow-up images are not annotated and merely assessed by observation. Response to treatment in longitudinal imaging is being studied, to better understand growth trajectories and ultimately predict treatment success or toxicity as early as possible. In this study, we implement an automated pipeline to curate a large longitudinal dataset of SRS treatment data, resulting in a cohort of 896 BMs in 177 patients who were monitored for >360 days at approximately two-month intervals at Lausanne University Hospital (CHUV). We use a data-driven clustering to identify characteristic trajectories. In addition, we predict 12 months lesion-level response using classical as well as graph machine learning Graph Machine Learning (GML). Clustering revealed 5 dominant growth trajectories with distinct final response categories. Response prediction reaches up to 0.90 AUC (CI95%=0.88-0.92) using only pre-treatment and first follow-up MRI with gradient boosting. Similarly, robust predictive performance of up to 0.88 AUC (CI95%=0.86-0.90) was obtained using GML, offering more flexibility with a single model for multiple input time-points configurations. Our results suggest potential automation and increased precision for the comprehensive assessment and prediction of BM response to SRS in longitudinal MRI. The proposed pipeline facilitates scalable data curation for the investigation of BM growth patterns, and lays the foundation for clinical decision support systems aiming at optimizing personalized care.
[113] VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results
Yixiao Li,Xin Li,Chris Wei Zhou,Shuo Xing,Hadi Amirpour,Xiaoshuai Hao,Guanghui Yue,Baoquan Zhao,Weide Liu,Xiaoyuan Yang,Zhengzhong Tu,Xinyu Li,Chuanbiao Song,Chenqi Zhang,Jun Lan,Huijia Zhu,Weiqiang Wang,Xiaoyan Sun,Shishun Tian,Dongyang Yan,Weixia Zhang,Junlin Chen,Wei Sun,Zhihua Wang,Zhuohang Shi,Zhizun Luo,Hang Ouyang,Tianxin Xiao,Fan Yang,Zhaowang Wu,Kaixin Deng
Main category: cs.CV
TL;DR: 论文介绍了ISRGC-Q挑战赛,旨在评估生成式超分辨率图像的质量,聚焦于GAN和扩散模型产生的图像。挑战赛吸引了108名参与者,4支团队提交了创新解决方案,并在ISRGen-QA数据集上取得了领先性能。
Details
Motivation: 现有超分辨率图像质量评估数据集未充分涵盖生成式方法(如GAN和扩散模型)产生的独特伪影,需开发新的评估工具。Contribution: 提出了ISRGen-QA数据集和ISRGC-Q挑战赛,专注于评估生成式超分辨率技术的图像质量。
Method: 基于GAN和扩散模型生成的SR图像构建数据集,通过挑战赛形式征集SOTA解决方案。
Result: 4支团队的方案在数据集上表现优异,推动了该领域的研究。
Insight: 生成式方法在超分辨率领域的伪影评估需要专门的数据集和评估框架。
Abstract: This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generative approaches, including Generative Adversarial Networks (GANs) and diffusion models. The primary goal of this challenge is to analyze the unique artifacts introduced by modern super-resolution techniques and to evaluate their perceptual quality effectively. A total of 108 participants registered for the challenge, with 4 teams submitting valid solutions and fact sheets for the final testing phase. These submissions demonstrated state-of-the-art (SOTA) performance on the ISRGen-QA dataset. The project is publicly available at: https://github.com/Lighting-YXLI/ISRGen-QA.
[114] Phantom-Insight: Adaptive Multi-cue Fusion for Video Camouflaged Object Detection with Multimodal LLM
Hua Zhang,Changjiang Luo,Ruoyu Chen
Main category: cs.CV
TL;DR: Phantom-Insight是一种基于SAM和MLLM的视频伪装目标检测方法,通过多模态LLM的特征融合和解耦学习策略,解决了目标边缘分离性和前景背景混淆问题。
Details
Motivation: 现有SAM方法因模型冻结无法分离伪装目标边缘,MLLM方法因语言模型合并前景和背景导致目标混淆,亟需一种新方法解决这些问题。Contribution: 1. 提出了SAM和MLLM结合的视频伪装目标检测框架;2. 引入动态前景视觉标记评分模块和提示网络;3. 设计解耦的前景背景学习策略。
Method: 通过LLM融合时空线索增强信息密度,动态调整SAM模型以适应纹理;独立生成前景和背景线索并解耦训练以提高目标分离性。
Result: 在MoCA-Mask数据集上取得SOTA性能,并在CAD2016数据集上展示了强大的泛化能力。
Insight: 多模态LLM的特征融合和解耦训练策略是提升视频伪装目标检测性能的关键。
Abstract: Video camouflaged object detection (VCOD) is challenging due to dynamic environments. Existing methods face two main issues: (1) SAM-based methods struggle to separate camouflaged object edges due to model freezing, and (2) MLLM-based methods suffer from poor object separability as large language models merge foreground and background. To address these issues, we propose a novel VCOD method based on SAM and MLLM, called Phantom-Insight. To enhance the separability of object edge details, we represent video sequences with temporal and spatial clues and perform feature fusion via LLM to increase information density. Next, multiple cues are generated through the dynamic foreground visual token scoring module and the prompt network to adaptively guide and fine-tune the SAM model, enabling it to adapt to subtle textures. To enhance the separability of objects and background, we propose a decoupled foreground-background learning strategy. By generating foreground and background cues separately and performing decoupled training, the visual token can effectively integrate foreground and background information independently, enabling SAM to more accurately segment camouflaged objects in the video. Experiments on the MoCA-Mask dataset show that Phantom-Insight achieves state-of-the-art performance across various metrics. Additionally, its ability to detect unseen camouflaged objects on the CAD2016 dataset highlights its strong generalization ability.
[115] When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection
Rabin Dulal,Lihong Zheng,Muhammad Ashad Kabir
Main category: cs.CV
TL;DR: 该研究提出了一种基于Grounding DINO的零样本牛鼻检测框架,利用自然语言提示实现无需标注数据的检测,在牛群监测中表现出良好的适应性和性能。
Details
Motivation: 传统的牛鼻检测方法依赖手动标注或监督学习模型,成本高且泛化性差,亟需一种无需标注数据的自动化解决方案。Contribution: 首次提出基于视觉语言模型的零样本牛鼻检测方法,无需任务特定训练或标注数据,实现了高准确率的检测。
Method: 采用Grounding DINO模型,通过自然语言提示(如“牛鼻”)指导检测,避免了传统监督学习的依赖。
Result: 在零样本条件下,模型达到76.8%的mAP@0.5,展示了无需标注数据的实际应用潜力。
Insight: 视觉语言模型在生物识别任务中表现出色,能够通过自然语言提示实现灵活且可扩展的检测。
Abstract: Muzzle patterns are among the most effective biometric traits for cattle identification. Fast and accurate detection of the muzzle region as the region of interest is critical to automatic visual cattle identification.. Earlier approaches relied on manual detection, which is labor-intensive and inconsistent. Recently, automated methods using supervised models like YOLO have become popular for muzzle detection. Although effective, these methods require extensive annotated datasets and tend to be trained data-dependent, limiting their performance on new or unseen cattle. To address these limitations, this study proposes a zero-shot muzzle detection framework based on Grounding DINO, a vision-language model capable of detecting muzzles without any task-specific training or annotated data. This approach leverages natural language prompts to guide detection, enabling scalable and flexible muzzle localization across diverse breeds and environments. Our model achieves a mean Average Precision (mAP)@0.5 of 76.8%, demonstrating promising performance without requiring annotated data. To our knowledge, this is the first research to provide a real-world, industry-oriented, and annotation-free solution for cattle muzzle detection. The framework offers a practical alternative to supervised methods, promising improved adaptability and ease of deployment in livestock monitoring applications.
[116] Cross3DReg: Towards a Large-scale Real-world Cross-source Point Cloud Registration Benchmark
Zongyi Xu,Zhongpeng Lang,Yilong Chen,Shanshan Zhao,Xiaoshui Huang,Yifan Zuo,Yan Zhang,Qianni Zhang,Xinbo Gao
Main category: cs.CV
TL;DR: 该论文构建了目前最大的真实世界多模态跨源点云配准数据集Cross3DReg,并提出了一种基于重叠区域的跨源配准框架,结合视觉几何注意力引导匹配模块,显著提升了配准精度。
Details
Motivation: 跨源点云配准因缺乏大规模真实数据集和传感器间固有差异而面临挑战,论文旨在解决这些问题并推动该领域研究。Contribution: 1. 构建了最大真实世界跨源点云配准数据集Cross3DReg;2. 提出基于重叠区域的配准框架和视觉几何注意力匹配模块。
Method: 设计重叠区域预测框架和视觉-几何注意力引导匹配模块,融合图像与几何信息提升特征一致性。
Result: 实验显示,RRE和RTE分别降低63.2%和40.2%,RR提升5.4%,验证了方法的有效性。
Insight: 融合多模态信息和关注重叠区域是解决跨源点云配准问题的有效策略。
Abstract: Cross-source point cloud registration, which aims to align point cloud data from different sensors, is a fundamental task in 3D vision. However, compared to the same-source point cloud registration, cross-source registration faces two core challenges: the lack of publicly available large-scale real-world datasets for training the deep registration models, and the inherent differences in point clouds captured by multiple sensors. The diverse patterns induced by the sensors pose great challenges in robust and accurate point cloud feature extraction and matching, which negatively influence the registration accuracy. To advance research in this field, we construct Cross3DReg, the currently largest and real-world multi-modal cross-source point cloud registration dataset, which is collected by a rotating mechanical lidar and a hybrid semi-solid-state lidar, respectively. Moreover, we design an overlap-based cross-source registration framework, which utilizes unaligned images to predict the overlapping region between source and target point clouds, effectively filtering out redundant points in the irrelevant regions and significantly mitigating the interference caused by noise in non-overlapping areas. Then, a visual-geometric attention guided matching module is proposed to enhance the consistency of cross-source point cloud features by fusing image and geometric information to establish reliable correspondences and ultimately achieve accurate and robust registration. Extensive experiments show that our method achieves state-of-the-art registration performance. Our framework reduces the relative rotation error (RRE) and relative translation error (RTE) by $63.2%$ and $40.2%$, respectively, and improves the registration recall (RR) by $5.4%$, which validates its effectiveness in achieving accurate cross-source registration.
[117] IGAff: Benchmarking Adversarial Iterative and Genetic Affine Algorithms on Deep Neural Networks
Sebastian-Vasile Echim,Andrei-Alexandru Preda,Dumitru-Clementin Cercel,Florin Pop
Main category: cs.CV
TL;DR: 论文提出并评估了两种黑盒对抗攻击算法:基于仿射变换的迭代算法(ATA)和结合遗传算法的仿射攻击(AGA),在多种网络架构和数据集上表现优异,相比现有方法提升了8.82%的准确率。
Details
Motivation: 深度神经网络在多种任务中表现优异,但其黑盒场景下的对抗攻击研究仍具挑战性。本文旨在探索新算法以揭示模型弱点并提升对抗攻击效果。Contribution: 1) 提出了两种新型黑盒对抗攻击算法ATA和AGA;2) 在多架构和数据集上进行了全面评估;3) 在图像分类任务中表现优于现有方法。
Method: 1) ATA通过迭代随机仿射变换优化攻击评分函数;2) AGA结合遗传算法与随机噪声及仿射变换。
Result: 实验表明,新算法在对抗攻击中表现优异,准确率提升最高达8.82%,并揭示了对抗防御的有效策略。
Insight: 通过参数调整和算法变体,对抗攻击在全局和定向攻击中均表现出色,同时增强了模型的鲁棒性理解。
Abstract: Deep neural networks currently dominate many fields of the artificial intelligence landscape, achieving state-of-the-art results on numerous tasks while remaining hard to understand and exhibiting surprising weaknesses. An active area of research focuses on adversarial attacks, which aim to generate inputs that uncover these weaknesses. However, this proves challenging, especially in the black-box scenario where model details are inaccessible. This paper explores in detail the impact of such adversarial algorithms on ResNet-18, DenseNet-121, Swin Transformer V2, and Vision Transformer network architectures. Leveraging the Tiny ImageNet, Caltech-256, and Food-101 datasets, we benchmark two novel black-box iterative adversarial algorithms based on affine transformations and genetic algorithms: 1) Affine Transformation Attack (ATA), an iterative algorithm maximizing our attack score function using random affine transformations, and 2) Affine Genetic Attack (AGA), a genetic algorithm that involves random noise and affine transformations. We evaluate the performance of the models in the algorithm parameter variation, data augmentation, and global and targeted attack configurations. We also compare our algorithms with two black-box adversarial algorithms, Pixle and Square Attack. Our experiments yield better results on the image classification task than similar methods in the literature, achieving an accuracy improvement of up to 8.82%. We provide noteworthy insights into successful adversarial defenses and attacks at both global and targeted levels, and demonstrate adversarial robustness through algorithm parameter variation.
[118] Focusing by Contrastive Attention: Enhancing VLMs’ Visual Reasoning
Yuyao Ge,Shenghua Liu,Yiwei Wang,Lingrui Mei,Baolong Bi,Xuanshan Zhou,Jiayu Yao,Jiafeng Guo,Xueqi Cheng
Main category: cs.CV
TL;DR: 论文提出了一种无需训练的注意力对比方法CARVE,通过分析VLMs的注意力模式,提升视觉推理能力。
Details
Motivation: 视觉语言模型(VLMs)在复杂视觉环境中表现下降,现有方法依赖额外训练或外部工具,忽略了VLMs内在的注意力能力。Contribution: 1)发现视觉复杂度与注意力熵的强相关性;2)提出CARVE方法,通过对比注意力图提取任务相关信号;3)理论证明了注意力对比的有效性。
Method: CARVE是一种训练自由的方法,通过对比通用查询和任务特定查询的注意力图,分离语义信号和视觉噪声。
Result: CARVE显著提升性能,开源模型上最高提升75%。
Insight: 视觉复杂度影响注意力模式,注意力对比是提升视觉推理的有效途径。
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs’ attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.
[119] Does DINOv3 Set a New Medical Vision Standard?
Che Liu,Yinda Chen,Haoyuan Shi,Jinpeng Lu,Bailiang Jian,Jiazhen Pan,Linghan Cai,Jiayi Wang,Yundi Zhang,Jun Li,Cosmin I. Bercea,Cheng Ouyang,Chen Chen,Zhiwei Xiong,Benedikt Wiestler,Christian Wachinger,Daniel Rueckert,Wenjia Bai,Rossella Arcucci
Main category: cs.CV
TL;DR: DINOv3,一种先进的自监督视觉Transformer模型,是否能在无需领域特定预训练的情况下,成为医学影像任务的强大统一编码器?通过广泛的实验,研究发现DINOv3在多种医学任务中表现优异,甚至超越了一些医学专用基础模型,但在需要深度领域适应的场景(如病理切片、电子显微镜和PET成像)中存在局限性。此外,DINOv3在医学领域的扩展行为不一致。
Details
Motivation: 尽管大规模视觉基础模型在自然图像领域表现出色,但其在专业领域(如医学影像)的迁移效果仍未充分探索。本文旨在评估DINOv3能否直接应用于医学影像任务,为领域提供新基准。Contribution: 1. 系统评估DINOv3在多种医学视觉任务中的性能;2. 发现DINOv3可作为医学任务的强大统一编码器,但在深度领域适应场景存在局限;3. 揭示了DINOv3在医学领域的扩展行为不一致性。
Method: 在广泛的医学影像任务(如2D/3D分类和分割)中,对不同模型大小和输入分辨率的DINOv3进行基准测试,并分析其性能与扩展行为。
Result: DINOv3在多个医学任务中表现出色,超越了部分医学专用模型,但在需要深度领域适应的任务中表现不佳。此外,其性能与模型大小或分辨率的关系不一致。
Insight: DINOv3的强大视觉特征可作为医学任务的有效先验,但其在深度领域适应和扩展行为上的局限性提示未来研究需关注针对性优化和多视角一致性增强。
Abstract: The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models’ efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model’s features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.
[120] WS$^2$: Weakly Supervised Segmentation using Before-After Supervision in Waste Sorting
Andrea Marelli,Alberto Foresti,Leonardo Pesce,Giacomo Boracchi,Mario Grosso
Main category: cs.CV
TL;DR: 论文提出了一种弱监督分割方法WS²,利用工业废物分拣中操作员移除物品前后的图像差异进行监督,避免了密集标注的需求。
Details
Motivation: 工业废物分拣中的自动化视觉识别通常依赖人工操作员,而完全监督方法标注成本高,弱监督方法未被充分探索。Contribution: 1. 提出了Before-After Supervision概念;2. 发布了WS²数据集(11,000+高分辨率帧);3. 设计了一个端到端管道用于方法基准测试。
Method: 利用操作员移除物品前后的图像差异训练分割网络,避免了像素级标注,采用弱监督分割方法。
Result: 在WS²数据集上验证了几种先进的弱监督分割方法,展示了方法的有效性。
Insight: 工业场景中的动作差异可以转化为弱监督信号,减少对人工标注的依赖。
Abstract: In industrial quality control, to visually recognize unwanted items within a moving heterogeneous stream, human operators are often still indispensable. Waste-sorting stands as a significant example, where operators on multiple conveyor belts manually remove unwanted objects to select specific materials. To automate this recognition problem, computer vision systems offer great potential in accurately identifying and segmenting unwanted items in such settings. Unfortunately, considering the multitude and the variety of sorting tasks, fully supervised approaches are not a viable option to address this challange, as they require extensive labeling efforts. Surprisingly, weakly supervised alternatives that leverage the implicit supervision naturally provided by the operator in his removal action are relatively unexplored. In this paper, we define the concept of Before-After Supervision, illustrating how to train a segmentation network by leveraging only the visual differences between images acquired \textit{before} and \textit{after} the operator. To promote research in this direction, we introduce WS$^2$ (Weakly Supervised segmentation for Waste-Sorting), the first multiview dataset consisting of more than 11 000 high-resolution video frames captured on top of a conveyor belt, including “before” and “after” images. We also present a robust end-to-end pipeline, used to benchmark several state-of-the-art weakly supervised segmentation methods on WS$^2$.
[121] TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement
Jibai Lin,Bo Ma,Yating Yang,Rong Ma,Turghun Osman,Ahtamjan Ahmat,Rui Dong,Lei Wang,Xi Zhou
Main category: cs.CV
TL;DR: 论文提出了一种名为TIDE的框架,通过目标监督和偏好学习解决了主题驱动图像生成(SDIG)中主题保留与指令遵循之间的冲突,无需测试时微调。
Details
Motivation: 主题驱动图像生成需要平衡主题身份保留和动态编辑指令的矛盾,现有方法未能有效解决这一问题。Contribution: 提出了TIDE框架,引入目标监督的三元组对齐和直接主题扩散(DSD)目标,通过偏好学习实现主题保留与指令遵循的最佳平衡。
Method: 利用(参考图像、指令、目标图像)三元组建模主题适应动态,并通过系统生成的‘获胜’(平衡)和‘失败’(扭曲)目标训练模型。
Result: 在多个标准基准测试中,TIDE在主题忠实性和指令遵循性上均优于基线方法,并可应用于多样化任务。
Insight: TIDE通过隐式奖励建模和定量指标评估,实现了主题驱动生成任务中的高效平衡,为扩散模型的应用提供了新思路。
Abstract: Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired “winning” (balanced preservation-compliance) and “losing” (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE’s superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE’s versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at https://github.com/KomJay520/TIDE.
[122] On the Reproducibility of “FairCLIP: Harnessing Fairness in Vision-Language Learning’’
Hua Chang Bakker,Stan Fris,Angela Madelon Bernardy,Stan Deutekom
Main category: cs.CV
TL;DR: 论文研究了FairCLIP的可复现性,发现其实验设置与原实现不符,因此提出了A-FairCLIP和FairCLIP+改进设计。结果表明FairCLIP并未显著提升CLIP的性能或公平性。
Details
Motivation: 验证FairCLIP方法能否真正提升CLIP模型的公平性和性能,发现原方法的实验设计与实现存在差异,需改进。Contribution: 1. 提出A-FairCLIP和FairCLIP+改进方法;2. 验证FairCLIP的公平性和性能提升效果,发现其未达预期。
Method: 通过复现FairCLIP实验,比较其原实现与新实现A-FairCLIP的差异,并扩展FairCLIP+以处理多属性公平性。
Result: 实验表明FairCLIP未能改善CLIP在零样本青光眼分类中的公平性或性能。
Insight: 公平性方法的验证需严格实验设计,简单的距离最小化可能不足以提升性能或公平性。
Abstract: We investigated the reproducibility of FairCLIP, proposed by Luo et al. (2024), for improving the group fairness of CLIP (Radford et al., 2021) by minimizing image-text similarity score disparities across sensitive groups using the Sinkhorn distance. The experimental setup of Luo et al. (2024) was reproduced to primarily investigate the research findings for FairCLIP. The model description by Luo et al. (2024) was found to differ from the original implementation. Therefore, a new implementation, A-FairCLIP, is introduced to examine specific design choices. Furthermore, FairCLIP+ is proposed to extend the FairCLIP objective to include multiple attributes. Additionally, the impact of the distance minimization on FairCLIP’s fairness and performance was explored. In alignment with the original authors, CLIP was found to be biased towards certain demographics when applied to zero-shot glaucoma classification using medical scans and clinical notes from the Harvard-FairVLMed dataset. However, the experimental results on two datasets do not support their claim that FairCLIP improves the performance and fairness of CLIP. Although the regularization objective reduces Sinkhorn distances, both the official implementation and the aligned implementation, A-FairCLIP, were not found to improve performance nor fairness in zero-shot glaucoma classification.
[123] Benchmarking EfficientTAM on FMO datasets
Senem Aktas,Charles Markham,John McDonald,Rozenn Dahyot
Main category: cs.CV
TL;DR: 这篇论文介绍了FMO数据集的JSON元数据文件(FMOX),并扩展了目标大小信息。它以TIoU分数评估了EfficientTAM在FMO数据集上的表现,表现与专门设计的流程相当。工具开源共享。
Details
Motivation: 快速微小目标跟踪是计算机视觉的挑战,作者希望通过标准化元数据和扩展信息,为其他机器学习流程提供支持。Contribution: 1) 引入FMOX格式扩展数据集信息;2) 评估EfficientTAM在FMO数据集上的表现;3) 开源代码和JSON文件。
Method: 使用FMOX格式的JSON文件补充数据集信息,并通过TIoU分数评测EfficientTAM的性能。
Result: EfficientTAM在FMO数据集上表现良好,与专门设计的流程相当。
Insight: 标准化元数据格式(如FMOX)可为其他研究提供便利,而通用模型(如EfficientTAM)在特定任务中也可能表现优异。
Abstract: Fast and tiny object tracking remains a challenge in computer vision and in this paper we first introduce a JSON metadata file associated with four open source datasets of Fast Moving Objects (FMOs) image sequences. In addition, we extend the description of the FMOs datasets with additional ground truth information in JSON format (called FMOX) with object size information. Finally we use our FMOX file to test a recently proposed foundational model for tracking (called EfficientTAM) showing that its performance compares well with the pipelines originally taylored for these FMO datasets. Our comparison of these state-of-the-art techniques on FMOX is provided with Trajectory Intersection of Union (TIoU) scores. The code and JSON is shared open source allowing FMOX to be accessible and usable for other machine learning pipelines aiming to process FMO datasets.
[124] Evolving from Unknown to Known: Retentive Angular Representation Learning for Incremental Open Set Recognition
Runqing Yang,Yimin Fu,Changyuan Wu,Zhunga Liu
Main category: cs.CV
TL;DR: 该论文提出了保留角表征学习(RARL)方法,用于增量开放集识别(IOSR),通过在角空间中对齐未知表征以减少知识更新时的表征漂移,并结合虚拟-内在交互训练和分层矫正策略来优化决策边界。
Details
Motivation: 现有开放集识别(OSR)方法通常针对静态场景设计,无法适应连续数据流中新兴未知类的识别和知识获取需求。这导致决策边界的区分性难以维持,引发严重的类间混淆问题。Contribution: 提出了RARL方法,用于IOSR任务;设计了虚拟-内在交互训练策略和分层矫正策略,优化了表征学习和决策边界;在CIFAR100和TinyImageNet上建立了新的IOSR基准。
Method: 通过角空间中的非活跃原型对齐未知表征;使用虚拟类强化已知类的类间边界;采用分层矫正策略缓解新旧类及正负样本的不平衡问题。
Result: 在多种任务设置下,RARL方法取得了最优性能,显著优于现有方法。
Insight: 角空间表征和虚拟类交互对维持增量学习中的决策边界区分性至关重要;分层矫正策略能有效缓解数据不平衡导致的表征偏差。
Abstract: Existing open set recognition (OSR) methods are typically designed for static scenarios, where models aim to classify known classes and identify unknown ones within fixed scopes. This deviates from the expectation that the model should incrementally identify newly emerging unknown classes from continuous data streams and acquire corresponding knowledge. In such evolving scenarios, the discriminability of OSR decision boundaries is hard to maintain due to restricted access to former training data, causing severe inter-class confusion. To solve this problem, we propose retentive angular representation learning (RARL) for incremental open set recognition (IOSR). In RARL, unknown representations are encouraged to align around inactive prototypes within an angular space constructed under the equiangular tight frame, thereby mitigating excessive representation drift during knowledge updates. Specifically, we adopt a virtual-intrinsic interactive (VII) training strategy, which compacts known representations by enforcing clear inter-class margins through boundary-proximal virtual classes. Furthermore, a stratified rectification strategy is designed to refine decision boundaries, mitigating representation bias and feature space distortion caused by imbalances between old/new and positive/negative class samples. We conduct thorough evaluations on CIFAR100 and TinyImageNet datasets and establish a new benchmark for IOSR. Experimental results across various task setups demonstrate that the proposed method achieves state-of-the-art performance.
[125] CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis
Xin Kong,Daniel Watson,Yannick Strümpler,Michael Niemeyer,Federico Tombari
Main category: cs.CV
TL;DR: CausNVS提出了一种自回归多视角扩散模型,解决了现有非自回归方法在灵活性(固定视角数量)和推理速度上的局限性,支持任意输入-输出视角配置,并实现了高质量的新视角合成。
Details
Motivation: 当前的多视角扩散模型多为非自回归形式,限制了其在世界建模中的应用(如支持固定视角数量、推理速度慢)。CausNVS旨在通过自回归形式解决这些问题。Contribution: 1. 提出首个自回归多视角扩散模型CausNVS,支持任意视角配置;2. 采用因果掩码和逐帧噪声训练,结合相对相机姿态编码(CaPE)实现精确控制;3. 提出空间感知滑动窗口和噪声条件增强技术,缓解漂移问题。
Method: 1. 自回归模型设计,逐帧生成视角;2. 使用CaPE编码相机姿态;3. 训练时采用因果掩码和逐帧噪声;4. 推理时结合滑动窗口和噪声条件增强。
Result: 实验表明,CausNVS支持多样化的相机轨迹,灵活性强,并在多种场景下保持一致的视觉质量。
Insight: 自回归形式是多视角扩散模型的可行方向,CaPE编码和噪声条件增强是提升稳定性和控制能力的关键技术。
Abstract: Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings. Project page: https://kxhit.github.io/CausNVS.html.
[126] Hybrid Swin Attention Networks for Simultaneously Low-Dose PET and CT Denoising
Yichao Liu,YueYang Teng
Main category: cs.CV
TL;DR: 该论文提出了一种混合 Swin 注意力网络(HSANet),用于低剂量 PET 和 CT 图像的去噪,通过高效全局注意力模块和混合上采样模块提升性能,同时在轻量级模型上表现出色。
Details
Motivation: 低剂量 CT(LDCT)和 PET 成像减少了辐射暴露,但伴随的噪声和伪影影响诊断准确性。研究旨在开发一种高效的去噪方法,以提升图像质量并适用于临床环境。Contribution: 论文的主要贡献是提出 HSANet,结合了高效全局注意力(EGA)模块和混合上采样模块,增强了空间和通道间的交互能力,同时避免了过拟合噪声。
Method: HSANet 采用了 Efficient Global Attention(EGA)模块来捕获全局特征,并结合混合上采样模块优化图像重建。该方法在公开 LDCT/PET 数据集上验证。
Result: 实验结果表明,HSANet 在去噪性能上优于现有方法,同时保持了轻量级模型,适合标准 GPU 部署。
Insight: 结合全局注意力和混合上采样可以有效提升低剂量医学图像的去噪效果,同时兼顾模型的实用性和计算效率。
Abstract: Low-dose computed tomography (LDCT) and positron emission tomography (PET) have emerged as safer alternatives to conventional imaging modalities by significantly reducing radiation exposure. However, this reduction often results in increased noise and artifacts, which can compromise diagnostic accuracy. Consequently, denoising for LDCT/PET has become a vital area of research aimed at enhancing image quality while maintaining radiation safety. In this study, we introduce a novel Hybrid Swin Attention Network (HSANet), which incorporates Efficient Global Attention (EGA) modules and a hybrid upsampling module. The EGA modules enhance both spatial and channel-wise interaction, improving the network’s capacity to capture relevant features, while the hybrid upsampling module mitigates the risk of overfitting to noise. We validate the proposed approach using a publicly available LDCT/PET dataset. Experimental results demonstrate that HSANet achieves superior denoising performance compared to existing methods, while maintaining a lightweight model size suitable for deployment on GPUs with standard memory configurations. This makes our approach highly practical for real-world clinical applications.
[127] Investigating Location-Regularised Self-Supervised Feature Learning for Seafloor Visual Imagery
Cailei Liang,Adrian Bodenmann,Emma J Curtis,Samuel Simmons,Kazunori Nagano,Stan Brown,Adam Riese,Blair Thornton
Main category: cs.CV
TL;DR: 研究探讨了位置正则化对自监督特征学习(SSL)在海底图像分析中的影响,发现它能显著提升分类性能,尤其对低维潜在表示和ViT模型效果显著。
Details
Motivation: 海底图像的高效解析对海洋监测和勘探至关重要。位置元数据可能增强SSL性能,但其在不同SSL策略和模型中的具体效果尚不明确。Contribution: 1. 评估了位置正则化对六种SSL框架的影响;2. 发现位置正则化显著提升分类性能,尤其对低维潜在表示;3. ViT模型展示了强大的泛化能力。
Method: 1. 选择六种SSL框架(CNN和ViT);2. 应用位置正则化;3. 在三个海底图像数据集上评估性能;4. 分析不同潜在维度的影响。
Result: 位置正则化平均提升CNN和ViT的F1分数分别为4.9%和6.3%。ViT预训练模型泛化能力强,与最佳位置正则化SSL性能相当。
Insight: 1. 位置元数据对SSL正则化有价值;2. 低维潜在表示更适合CNN;3. 高维ViT在海底图像分析中表现优异。
Abstract: High-throughput interpretation of robotically gathered seafloor visual imagery can increase the efficiency of marine monitoring and exploration. Although recent research has suggested that location metadata can enhance self-supervised feature learning (SSL), its benefits across different SSL strategies, models and seafloor image datasets are underexplored. This study evaluates the impact of location-based regularisation on six state-of-the-art SSL frameworks, which include Convolutional Neural Network (CNN) and Vision Transformer (ViT) models with varying latent-space dimensionality. Evaluation across three diverse seafloor image datasets finds that location-regularisation consistently improves downstream classification performance over standard SSL, with average F1-score gains of $4.9 \pm 4.0%$ for CNNs and $6.3 \pm 8.9%$ for ViTs, respectively. While CNNs pretrained on generic datasets benefit from high-dimensional latent representations, dataset-optimised SSL achieves similar performance across the high (512) and low (128) dimensional latent representations. Location-regularised SSL improves CNN performance over pre-trained models by $2.7 \pm 2.7%$ and $10.1 \pm 9.4%$ for high and low-dimensional latent representations, respectively. For ViTs, high-dimensionality benefits both pre-trained and dataset-optimised SSL. Although location-regularisation improves SSL performance compared to standard SSL methods, pre-trained ViTs show strong generalisation, matching the best-performing location-regularised SSL with F1-scores of $0.795 \pm 0.075$ and $0.795 \pm 0.077$, respectively. The findings highlight the value of location metadata for SSL regularisation, particularly when using low-dimensional latent representations, and demonstrate strong generalisation of high-dimensional ViTs for seafloor image analysis.
[128] Online Clustering of Seafloor Imagery for Interpretation during Long-Term AUV Operations
Cailei Liang,Adrian Bodenmann,Sam Fenton,Blair Thornton
Main category: cs.CV
TL;DR: 论文提出了一种在线聚类框架(OCF),用于在长期AUV操作中实时解释海底图像,支持无监督学习和动态聚类,具有高准确性和低计算成本。
Details
Motivation: 随着AUV长期操作的需求增加,实时解释海底图像成为关键,但传统离线方法依赖完整数据和人工标注,无法满足实时性要求,因此需要开发一种无监督的在线聚类方法。Contribution: 提出了一种在线聚类框架(OCF),能够实时处理连续数据流,动态调整聚类结构(合并和分裂),并通过代表性样本高效总结历史数据,避免重新处理整个图像历史。
Method: OCF利用代表性样本捕捉动态特征分布,支持无需重新处理历史数据的聚类调整,并通过不同采样策略优化聚类准确性和计算效率。
Result: 在三个不同的海底图像数据集上,OCF的平均F1得分为0.68,优于其他在线聚类方法,且计算时间随数据量增长保持稳定,表现出高鲁棒性和效率。
Insight: OCF的高效性和实时性使其适用于长期自主海洋勘探中的数据摘要生成和路径规划,展示了在动态环境中无监督学习的潜力。
Abstract: As long-endurance and seafloor-resident AUVs become more capable, there is an increasing need for extended, real-time interpretation of seafloor imagery to enable adaptive missions and optimise communication efficiency. Although offline image analysis methods are well established, they rely on access to complete datasets and human-labelled examples to manage the strong influence of environmental and operational conditions on seafloor image appearance-requirements that cannot be met in real-time settings. To address this, we introduce an online clustering framework (OCF) capable of interpreting seafloor imagery without supervision, which is designed to operate in real-time on continuous data streams in a scalable, adaptive, and self-consistent manner. The method enables the efficient review and consolidation of common patterns across the entire data history in constant time by identifying and maintaining a set of representative samples that capture the evolving feature distribution, supporting dynamic cluster merging and splitting without reprocessing the full image history. We evaluate the framework on three diverse seafloor image datasets, analysing the impact of different representative sampling strategies on both clustering accuracy and computational cost. The OCF achieves the highest average F1 score of 0.68 across the three datasets among all comparative online clustering approaches, with a standard deviation of 3% across three distinct survey trajectories, demonstrating its superior clustering capability and robustness to trajectory variation. In addition, it maintains consistently lower and bounded computational time as the data volume increases. These properties are beneficial for generating survey data summaries and supporting informative path planning in long-term, persistent autonomous marine exploration.
[129] BioLite U-Net: Edge-Deployable Semantic Segmentation for In Situ Bioprinting Monitoring
Usman Haider,Lukasz Szemet,Daniel Kelly,Vasileios Sergis,Andrew C. Daly,Karl Mason
Main category: cs.CV
TL;DR: 论文提出了一种轻量级的语义分割框架BioLite U-Net,用于实时生物打印监测。它通过深度可分离卷积显著减少计算负担,同时保持了高精度,适用于资源受限的嵌入式设备。
Details
Motivation: 生物打印过程中实时监测打印质量和生物活性的需求迫切,但现有方法在有限图像数据和资源受限的硬件条件下难以实现高效分割。Contribution: 1. 提出了新的BioLite U-Net架构;2. 发布了包含787张RGB图像的标注数据集;3. 展示了在嵌入式设备上的高效性能。
Method: 采用深度可分离卷积优化的U-Net架构,显著减少计算量。
Result: 在Raspberry Pi 4B上实现92.85%的mIoU和96.17%的Dice得分,推理速度达335 ms/帧。
Insight: 轻量化设计和高效推理的结合为边缘设备上的实时语义分割提供了可行方案。
Abstract: Bioprinting is a rapidly advancing field that offers a transformative approach to fabricating tissue and organ models through the precise deposition of cell-laden bioinks. Ensuring the fidelity and consistency of printed structures in real-time remains a core challenge, particularly under constraints imposed by limited imaging data and resource-constrained embedded hardware. Semantic segmentation of the extrusion process, differentiating between nozzle, extruded bioink, and surrounding background, enables in situ monitoring critical to maintaining print quality and biological viability. In this work, we introduce a lightweight semantic segmentation framework tailored for real-time bioprinting applications. We present a novel, manually annotated dataset comprising 787 RGB images captured during the bioprinting process, labeled across three classes: nozzle, bioink, and background. To achieve fast and efficient inference suitable for integration with bioprinting systems, we propose a BioLite U-Net architecture that leverages depthwise separable convolutions to drastically reduce computational load without compromising accuracy. Our model is benchmarked against MobileNetV2 and MobileNetV3-based segmentation baselines using mean Intersection over Union (mIoU), Dice score, and pixel accuracy. All models were evaluated on a Raspberry Pi 4B to assess real-world feasibility. The proposed BioLite U-Net achieves an mIoU of 92.85% and a Dice score of 96.17%, while being over 1300x smaller than MobileNetV2-DeepLabV3+. On-device inference takes 335 ms per frame, demonstrating near real-time capability. Compared to MobileNet baselines, BioLite U-Net offers a superior tradeoff between segmentation accuracy, efficiency, and deployability, making it highly suitable for intelligent, closed-loop bioprinting systems.
[130] Cortex-Synth: Differentiable Topology-Aware 3D Skeleton Synthesis with Hierarchical Graph Attention
Mohamed Zayaan S
Main category: cs.CV
TL;DR: Cortex-Synth是一个端到端可微分框架,从单一2D图像生成3D骨架几何和拓扑结构。它通过分层图注意力机制、可微分谱拓扑优化和对抗几何一致性训练实现了显著的性能提升。
Details
Motivation: 现有方法在从2D图像生成3D骨架时,通常在几何和拓扑结构上存在误差,缺乏联合优化的能力。Cortex-Synth旨在解决这一问题。Contribution: 1. 分层图注意力机制;2. 可微分谱拓扑优化;3. 对抗几何一致性训练;4. 端到端可微分框架。
Method: 结合伪3D点云生成器、增强型PointNet编码器、骨架坐标解码器和可微分图构建网络(DGCN),通过多模块协同实现优化。
Result: 在ShapeNet上,MPJPE提升18.7%,图编辑距离提升27.3%,拓扑错误减少42%。
Insight: 端到端可微分性为机器人操作、医学成像和自动角色绑定等应用提供了新可能性。
Abstract: We present Cortex Synth, a novel end-to-end differentiable framework for joint 3D skeleton geometry and topology synthesis from single 2D images. Our architecture introduces three key innovations: (1) A hierarchical graph attention mechanism with multi-scale skeletal refinement, (2) Differentiable spectral topology optimization via Laplacian eigen decomposition, and (3) Adversarial geometric consistency training for pose structure alignment. The framework integrates four synergistic modules: a pseudo 3D point cloud generator, an enhanced PointNet encoder, a skeleton coordinate decoder, and a novel Differentiable Graph Construction Network (DGCN). Our experiments demonstrate state-of-the-art results with 18.7 percent improvement in MPJPE and 27.3 percent in Graph Edit Distance on ShapeNet, while reducing topological errors by 42 percent compared to previous approaches. The model’s end-to-end differentiability enables applications in robotic manipulation, medical imaging, and automated character rigging.
[131] Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training
Ruicheng Zhang,Jun Zhou,Zunnan Xu,Zihao Liu,Jiehui Huang,Mingyang Zhang,Yu Sun,Xiu Li
Main category: cs.CV
TL;DR: Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training (Zo3T) introduces a novel framework to generate realistic videos from images using motion instructions without expensive fine-tuning.
Details
Motivation: Existing methods for trajectory-guided video generation are computationally expensive or produce unrealistic motion due to ignored 3D perspectives and latent misalignment.Contribution: Zo3T introduces three innovations: 3D-Aware Kinematic Projection, Trajectory-Guided Test-Time LoRA, and Guidance Field Rectification, significantly improving motion accuracy and 3D realism.
Method: The framework leverages 3D scene depth, dynamically optimizes LoRA adapters, and refines the guidance field through a one-step lookahead strategy.
Result: Zo3T outperforms both training-based and zero-shot approaches in generating realistic and accurate trajectory-controlled videos.
Insight: Test-time training and dynamic adaptation during inference can effectively address motion alignment and 3D consistency in video generation tasks.
Abstract: Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network’s noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.
[132] Event Spectroscopy: Event-based Multispectral and Depth Sensing using Structured Light
Christian Geckeler,Niklas Neugebauer,Manasi Muglikar,Davide Scaramuzza,Stefano Mintchev
Main category: cs.CV
TL;DR: 该论文提出了一种新型的事件光谱系统,结合结构光和高分辨率光谱成像,用于无人机在森林环境中的导航和数据采集。系统在深度重建和光谱成像方面表现优异,显著提升了材料区分的准确性。
Details
Motivation: 无人机在森林环境中面临的传统传感器(如被动多光谱和RGB成像)存在延迟、深度分辨率低和对环境光依赖性强的问题。需要一种轻量级、高性能的感知系统来解决这些限制。Contribution: 1. 开发了一种集成结构光和多光谱成像的单传感器系统。2. 实现了高分辨率、低延迟的深度重建和多光谱成像。3. 在真实森林环境中验证了系统的性能,证明了其在材料区分和RGB重建中的优势。
Method: 利用结构光进行深度重建,并通过调制投影光的波长控制光谱采集(650-850 nm波段)。便携版本支持RGB(三波长)采集。
Result: 1. 深度估计RMSE比商用传感器提升60%。2. 光谱精度与商用多光谱相机相当。3. 结合深度数据,材料区分准确性比纯颜色方法提高30%。
Insight: 结构光和多光谱成像的集成为解决复杂自然环境中的无人机感知问题提供了新的技术路径,尤其在森林等低光照和密集场景中具有显著优势。
Abstract: Uncrewed aerial vehicles (UAVs) are increasingly deployed in forest environments for tasks such as environmental monitoring and search and rescue, which require safe navigation through dense foliage and precise data collection. Traditional sensing approaches, including passive multispectral and RGB imaging, suffer from latency, poor depth resolution, and strong dependence on ambient light - especially under forest canopies. In this work, we present a novel event spectroscopy system that simultaneously enables high-resolution, low-latency depth reconstruction and multispectral imaging using a single sensor. Depth is reconstructed using structured light, and by modulating the wavelength of the projected structured light, our system captures spectral information in controlled bands between 650 nm and 850 nm. We demonstrate up to $60%$ improvement in RMSE over commercial depth sensors and validate the spectral accuracy against a reference spectrometer and commercial multispectral cameras, demonstrating comparable performance. A portable version limited to RGB (3 wavelengths) is used to collect real-world depth and spectral data from a Masoala Rainforest. We demonstrate the use of this prototype for color image reconstruction and material differentiation between leaves and branches using spectral and depth data. Our results show that adding depth (available at no extra effort with our setup) to material differentiation improves the accuracy by over $30%$ compared to color-only method. Our system, tested in both lab and real-world rainforest environments, shows strong performance in depth estimation, RGB reconstruction, and material differentiation - paving the way for lightweight, integrated, and robust UAV perception and data collection in complex natural environments.
[133] Pothole Detection and Recognition based on Transfer Learning
Mang Hu,Qianqian Xia
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于迁移学习的深度特征提取网络 ResNet50-EfficientNet-RegNet,用于道路坑洞检测与识别,通过数据增强和模型优化,在测试集上表现出高准确性和计算效率。
Details
Motivation: 随着计算机视觉和机器学习的发展,基于图像和视频数据的道路坑洞自动检测方法对社会发展具有重要意义,因此本研究旨在通过迁移学习提升检测性能。Contribution: 构建了一个结合 ResNet50、EfficientNet 和 RegNet 的迁移学习模型,并通过实验验证了其在坑洞检测任务中的优越性能。
Method: 采用标准化、归一化和数据增强等预处理技术,结合迁移学习方法优化网络模型,并通过多种评估指标(如准确率、召回率等)与其他模型进行比较。
Result: 模型在 90 个测试样本和 900 个扩展测试集上的分类准确率分别达到 97.78% 和 98.89%,性能优于随机森林、MLP、SVM 和 LightGBM 等模型。
Insight: 迁移学习方法结合多模型融合能够显著提升道路坑洞检测的性能,同时数据增强和模型优化对结果有重要影响。
Abstract: With the rapid development of computer vision and machine learning, automated methods for pothole detection and recognition based on image and video data have received significant attention. It is of great significance for social development to conduct an in-depth analysis of road images through feature extraction, thereby achieving automatic identification of the pothole condition in new images. Consequently, this is the main issue addressed in this study. Based on preprocessing techniques such as standardization, normalization, and data augmentation applied to the collected raw dataset, we continuously improved the network model based on experimental results. Ultimately, we constructed a deep learning feature extraction network ResNet50-EfficientNet-RegNet model based on transfer learning. This model exhibits high classification accuracy and computational efficiency. In terms of model evaluation, this study employed a comparative evaluation approach by comparing the performance of the proposed transfer learning model with other models, including Random Forest, MLP, SVM, and LightGBM. The comparison analysis was conducted based on metrics such as Accuracy, Recall, Precision, F1-score, and FPS, to assess the classification performance of the transfer learning model proposed in this paper. The results demonstrate that our model exhibits high performance in terms of recognition speed and accuracy, surpassing the performance of other models. Through careful parameter selection and model optimization, our transfer learning model achieved a classification accuracy of 97.78% (88/90) on the initial set of 90 test samples and 98.89% (890/900) on the expanded test set.
[134] Raw2Event: Converting Raw Frame Camera into Event Camera
Zijie Ning,Enmin Lin,Sudarshan R. Iyengar,Patrick Vandewalle
Main category: cs.CV
TL;DR: Raw2Event是一个硬件-软件系统,能将低成本原始帧相机实时转换为事件相机,克服了传统事件相机的高成本和功能限制。
Details
Motivation: 事件相机在低光条件下表现优越,但高成本、低分辨率和缺乏自动对焦等功能限制了其广泛应用。Contribution: 提出了Raw2Event系统,通过原始Bayer数据和DVS-Voltmeter模型,实现了高动态范围、高分辨率的事件流生成。
Method: 利用直接访问原始Bayer数据,绕过ISP,结合仿真框架和同步数据采集流水线,优化嵌入式平台部署。
Result: 实验显示生成的事件流接近真实事件相机,同时支持更高分辨率和自动对焦。
Insight: 系统为用户提供了灵活的配置选项,适用于低成本的事件视觉研究和早期开发。
Abstract: Event cameras offer unique advantages such as high temporal resolution, low latency, and high dynamic range, making them more and more popular for vision tasks under challenging light conditions. However, their high cost, limited resolution, and lack of features such as autofocus hinder their broad adoption, particularly for early-stage development and prototyping. In this work, we present Raw2Event, a complete hardware-software system that enables real-time event generation from low-cost raw frame-based cameras. By leveraging direct access to raw Bayer data and bypassing traditional image signal processors (ISP), our system is able to utilize the full potential of camera hardware, delivering higher dynamic range, higher resolution, and more faithful output than RGB-based frame-to-event converters. Built upon the DVS-Voltmeter model, Raw2Event features a configurable simulation framework optimized for deployment on embedded platforms. We further design a data acquisition pipeline that supports synchronized recording of raw, RGB, and event streams, facilitating downstream evaluation and dataset creation. Experimental results show that Raw2Event can generate event streams closely resembling those from real event cameras, while benefiting from higher resolution and autofocus capabilities. The system also supports user-intuitive parameter tuning, enabling flexible adaptation to various application requirements. Finally, we deploy the system on a Raspberry Pi for real-time operation, providing a scalable and cost-effective solution for event-based vision research and early-stage system development. The codes are available online: https://anonymous.4open.science/r/raw2event-BFF2/README.md.
[135] D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning
Sai Kartheek Reddy Kasu,Mohammad Zia Ur Rehman,Shahid Shafi Dar,Rishi Bharat Junghare,Dhanvin Sanjay Namboodiri,Nagendra Kumar
Main category: cs.CV
TL;DR: 该论文提出了一个多模态推理增强框架D-HUMOR,用于理解黑暗幽默,并通过新数据集和Tri-stream Cross-Reasoning Network(TCRNet)在黑暗幽默检测、目标识别和强度预测任务上表现优越。
Details
Motivation: 网络迷因中的黑暗幽默因其依赖隐性、敏感和文化语境线索而具有独特挑战。现有资源和方法不足,需要新的数据集和模型来解决这一问题。Contribution: 1. 引入了一个包含4,379个Reddit迷因的新数据集,标注了黑暗幽默、目标类别和强度等级。2. 提出了D-HUMOR框架,结合多模态推理和自反式循环(Role-Reversal Self-Loop)来优化解释生成。3. 设计了TCRNet,通过跨模态注意力机制融合文本、图像和推理特征。
Method: 1. 使用大型视觉语言模型(VLM)生成结构化解释并通过Role-Reversal Self-Loop迭代优化。2. 通过文本编码器和视觉Transformer提取文本和视觉特征。3. TCRNet通过三流注意力机制融合多模态特征进行分类。
Result: D-HUMOR在黑暗幽默检测、目标识别和强度预测任务上优于基线方法。
Insight: 1. 黑暗幽默理解需要结合多模态信息和上下文推理。2. 自反式循环有助于提升解释的完整性和对齐性。3. 跨模态注意力机制能有效融合多源信息。
Abstract: Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author’s perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
[136] P3-SAM: Native 3D Part Segmentation
Changfeng Ma,Yang Li,Xinhao Yan,Jiachen Xu,Yunhan Yang,Chunshi Wang,Zibo Zhao,Yanwen Guo,Zhuo Chen,Chunchao Guo
Main category: cs.CV
TL;DR: P3-SAM提出了一种原生3D点提示部件分割模型,旨在全自动分割3D物体,并在复杂物体上展现出色鲁棒性。
Details
Motivation: 现有3D部件分割方法在复杂物体上鲁棒性差且难以全自动化,限制了模型重用和应用扩展。Contribution: 1. 提出P3-SAM模型,支持全自动化3D部件分割;2. 设计自动选择与合并掩码的算法;3. 构建包含370万个模型的新数据集。
Method: P3-SAM结合特征提取器、多分割头和IoU预测器,实现交互式分割;提出自动掩码选择与合并算法。
Result: 在复杂物体上获得高精度分割结果和鲁棒性,达到SOTA性能。
Insight: 通过点提示和自动掩码处理,P3-SAM在3D分割任务中展现出高效性和通用性。
Abstract: Segmenting 3D assets into their constituent parts is crucial for enhancing 3D understanding, facilitating model reuse, and supporting various applications such as part generation. However, current methods face limitations such as poor robustness when dealing with complex objects and cannot fully automate the process. In this paper, we propose a native 3D point-promptable part segmentation model termed P3-SAM, designed to fully automate the segmentation of any 3D objects into components. Inspired by SAM, P3-SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor, enabling interactive segmentation for users. We also propose an algorithm to automatically select and merge masks predicted by our model for part instance segmentation. Our model is trained on a newly built dataset containing nearly 3.7 million models with reasonable segmentation labels. Comparisons show that our method achieves precise segmentation results and strong robustness on any complex objects, attaining state-of-the-art performance. Our code will be released soon.
[137] SynthDrive: Scalable Real2Sim2Real Sensor Simulation Pipeline for High-Fidelity Asset Generation and Driving Data Synthesis
Zhengqing Chen,Ruohong Mei,Xiaoyang Guo,Qingjie Wang,Yubin Hu,Wei Yin,Weiqiang Ren,Qian Zhang
Main category: cs.CV
TL;DR: SynthDrive提出了一种可扩展的real2sim2real传感器仿真管道,通过3D生成技术自动化资产挖掘和罕见场景数据合成,弥补了现有CG和学习方法在多样性和通用性上的不足。
Details
Motivation: 现有传感器仿真方法在多样性和通用性上存在局限:CG方法(如CARLA)缺乏多样性且难以扩展,学习方法(如NeuSim)仅限于特定对象类别且依赖大量多传感器数据。Contribution: 提出了一个可扩展的real2sim2real系统,利用3D生成技术自动化资产挖掘和罕见场景数据合成,提升了传感器仿真的多样性和适用性。
Method: 通过real2sim2real方法结合3D生成技术,实现了资产的自动化挖掘、生成和罕见场景数据合成。
Result: 系统能够生成多样且高保真的传感器数据,适用于泛化对象的仿真需求。
Insight: 3D生成技术与传感器仿真的结合能为自动驾驶感知训练提供更丰富和罕见的数据支持。
Abstract: In the field of autonomous driving, sensor simulation is essential for generating rare and diverse scenarios that are difficult to capture in real-world environments. Current solutions fall into two categories: 1) CG-based methods, such as CARLA, which lack diversity and struggle to scale to the vast array of rare cases required for robust perception training; and 2) learning-based approaches, such as NeuSim, which are limited to specific object categories (vehicles) and require extensive multi-sensor data, hindering their applicability to generic objects. To address these limitations, we propose a scalable real2sim2real system that leverages 3D generation to automate asset mining, generation, and rare-case data synthesis.
[138] MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration
George Ciubotariu,Zhuyun Zhou,Zongwei Wu,Radu Timofte
Main category: cs.CV
TL;DR: MIORe和VAR-MIORe是两个新型多任务数据集,旨在解决当前运动恢复基准的关键限制,通过高帧率(1000 FPS)采集和专业级光学设备捕捉复杂运动场景,生成一致的运动模糊,并提供可变运动幅度的控制。
Details
Motivation: 当前的运动恢复基准存在局限性,缺乏对复杂运动场景和高帧率数据的支持,限制了算法的评估和改进。Contribution: 1. 提出了MIORe和VAR-MIORe数据集,支持多任务恢复研究;2. 通过高帧率采集生成一致运动模糊;3. VAR-MIORe引入可变运动幅度控制;4. 提供高分辨率、可扩展的基准数据。
Method: 1. 使用1000 FPS的专业设备捕捉复杂运动场景;2. 基于计算的光流指标自适应平均帧生成运动模糊;3. VAR-MIORe扩展为可变运动幅度范围。
Result: 数据集挑战了现有算法在控制和不利条件下的性能,为下一代图像和视频恢复任务研究奠定基础。
Insight: 通过高帧率和多任务设计,未来的恢复算法可以更好地处理复杂运动和模糊效果。
Abstract: We introduce MIORe and VAR-MIORe, two novel multi-task datasets that address critical limitations in current motion restoration benchmarks. Designed with high-frame-rate (1000 FPS) acquisition and professional-grade optics, our datasets capture a broad spectrum of motion scenarios, which include complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects. By adaptively averaging frames based on computed optical flow metrics, MIORe generates consistent motion blur, and preserves sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark to offer explicit control over motion amplitude. We provide high-resolution, scalable ground truths that challenge existing algorithms under both controlled and adverse conditions, paving the way for next-generation research of various image and video restoration tasks.
[139] UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Yufeng Cheng,Wenxu Wu,Shaojin Wu,Mengqi Huang,Fei Ding,Qian He
Main category: cs.CV
TL;DR: UMO(统一多身份优化框架)通过强化学习和扩散模型解决多参考图像中的身份一致性和混淆问题,提出了新的数据集和评估指标,显著提升了图像定制方法在身份保持方面的性能。
Details
Motivation: 人类对脸部敏感,现有图像定制方法在多参考图像中难以保持身份一致性且容易混淆身份,亟需一种可扩展的解决方案。Contribution: 1. 提出了UMO框架,通过全局分配优化和强化学习实现多身份一致性;2. 构建了包含合成和真实数据的多参考图像数据集;3. 设计了新的身份混淆评估指标。
Method: UMO采用“多对多匹配”范式,将多身份生成建模为全局分配优化问题,并结合扩散模型和强化学习。
Result: 实验表明,UMO显著提升身份一致性并减少混淆,在开源方法中达到SOTA水平。
Insight: 强化学习在扩散模型中的应用为多身份一致性优化提供了新思路,数据集和指标的创新为未来研究奠定了基础。
Abstract: Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With “multi-to-multi matching” paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO
[140] Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning
Dipta Neogi,Nourash Azmine Chowdhury,Muhammad Rafsan Kabir,Mohammad Ashrafuzzaman Khan
Main category: cs.CV
TL;DR: 该论文提出一种基于视频的MPAA分级预测方法,通过对比学习和注意力驱动的混合架构,显著提升了分类性能,特别是在PG-13和R级内容的区分上。
Details
Motivation: 随着视觉内容消费的快速增长,传统方法在大规模标记数据需求、泛化能力差和特征学习效率低等方面存在不足,需要一种自动化且高效的视频分类方法以满足MPAA分级标准。Contribution: 1) 提出了一种结合对比学习和注意力机制的混合架构;2) 探索了三种对比学习框架,验证了Contextual Contrastive Learning的最优性能;3) 实现了88%的准确率和0.8815的F1分数。
Method: 结合CNN提取空间特征、LSTM建模时序信息,并使用Bahdanau注意力机制动态分配帧权重,同时通过对比学习(Instance Discrimination、Contextual Contrastive Learning、Multi-View Contrastive Learning)提升模型的区分能力。
Result: 在Contextual Contrastive Learning框架下,模型达到了88%的准确率和0.8815的F1分数,显著优于传统方法。
Insight: 对比学习与注意力机制的结合能有效提升视频分类任务的性能,特别是在需要细粒度区分的情境下(如PG-13与R级内容的区分)。
Abstract: The rapid growth of visual content consumption across platforms necessitates automated video classification for age-suitability standards like the MPAA rating system (G, PG, PG-13, R). Traditional methods struggle with large labeled data requirements, poor generalization, and inefficient feature learning. To address these challenges, we employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning. Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815. By combining CNNs for spatial features, LSTMs for temporal modeling, and attention mechanisms for dynamic frame prioritization, the model excels in fine-grained borderline distinctions, such as differentiating PG-13 and R-rated content. We evaluate the model’s performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Margin Triplet, demonstrating the robustness of our proposed architecture. To ensure practical application, the model is deployed as a web application for real-time MPAA rating classification, offering an efficient solution for automated content compliance across streaming platforms.
[141] Curia: A Multi-Modal Foundation Model for Radiology
Corentin Dancette,Julien Khlaut,Antoine Saporta,Helene Philippe,Elodie Ferreres,Baptiste Callard,Théo Danielou,Léo Alberge,Léo Machado,Daniel Tordjman,Julie Dupuis,Korentin Le Floch,Jean Du Terrail,Mariam Moshiri,Laurent Dercle,Tom Boeken,Jules Gregory,Maxime Ronot,François Legou,Pascal Roux,Marc Sapoval,Pierre Manceron,Paul Hérent
Main category: cs.CV
TL;DR: Curia是一个针对放射学的多模态基础模型,训练于多年医院真实数据(150,000次检查,130 TB),在19项任务验证中表现优异,超越放射科医生和其他基础模型。
Details
Motivation: 当前AI辅助放射诊断依赖单一任务模型,难以覆盖广泛的模态和疾病。基础模型(FMs)可实现跨模态、低数据泛化,但在放射学中尚未实现。Contribution: 提出了Curia,首个基于大规模真实医院数据训练的放射学基础模型,支持跨模态和低数据场景,性能优于现有方法和放射科医生。
Method: 训练数据涵盖医院多年跨模态影像输出,使用新型验证基准(19项任务),评估器官识别、疾病检测和预后预测等能力。
Result: 在19项任务中表现优异,达到或超越放射科医生和其他基础模型,表现出跨模态和低数据下的新兴临床意义特性。
Insight: 大规模真实数据训练的基础模型在放射学中具有显著潜力,可推动AI辅助诊断的实际应用。
Abstract: AI-assisted radiological interpretation is based on predominantly narrow, single-task models. This approach is impractical for covering the vast spectrum of imaging modalities, diseases, and radiological findings. Foundation models (FMs) hold the promise of broad generalization across modalities and in low-data settings. However, this potential has remained largely unrealized in radiology. We introduce Curia, a foundation model trained on the entire cross-sectional imaging output of a major hospital over several years, which to our knowledge is the largest such corpus of real-world data-encompassing 150,000 exams (130 TB). On a newly curated 19-task external validation benchmark, Curia accurately identifies organs, detects conditions like brain hemorrhages and myocardial infarctions, and predicts outcomes in tumor staging. Curia meets or surpasses the performance of radiologists and recent foundation models, and exhibits clinically significant emergent properties in cross-modality, and low-data regimes. To accelerate progress, we release our base model’s weights at https://huggingface.co/raidium/curia.
[142] Leveraging Generic Foundation Models for Multimodal Surgical Data Analysis
Simon Pezold,Jérôme A. Kurylec,Jan S. Liechti,Beat P. Müller,Joël L. Lavanchy
Main category: cs.CV
TL;DR: 该研究探讨了如何通过迁移学习调整通用基础模型并整合手术室的多模态数据以支持手术数据科学。实验表明,领域适应和多模态数据整合提升了模型性能。
Details
Motivation: 研究旨在探索如何利用通用基础模型和多模态数据(如视频和时间分辨数据)来提升手术数据科学的性能,从而为手术决策支持提供更准确的工具。Contribution: 1. 展示了通用基础模型V-JEPA在手术数据科学中的应用潜力;2. 验证了通过领域适应(微调)和多模态数据整合对提升模型性能的有效性。
Method: 1. 使用V-JEPA作为基础模型,通过微调适应手术视频数据;2. 整合手术室的多模态数据(如时间分辨数据),训练单独编码器形成共享表示空间。
Result: 1. 微调显著提升了模型性能;2. 多模态数据整合在内部数据集上表现优异;3. 在公开数据集HeiCo上,性能与最佳提交结果相当。
Insight: 通用基础模型可通过领域适应和多模态数据整合有效应用于手术数据科学,为未来的研究和技术应用提供了方向。
Abstract: We investigate how both the adaptation of a generic foundation model via transfer learning and the integration of complementary modalities from the operating room (OR) can support surgical data science. To this end, we use V-JEPA as the single-modality foundation of a multimodal model for minimally invasive surgery support. We analyze how the model’s downstream performance can benefit (a) from finetuning on unlabeled surgical video data and (b) from providing additional time-resolved data streams from the OR in a multimodal setup. In an in-house dataset of liver surgery videos, we analyze the tasks of predicting hospital length of stay and postoperative complications. In videos of the public HeiCo dataset, we analyze the task of surgical phase recognition. As a baseline, we apply pretrained V-JEPA to all tasks. We then finetune it on unlabeled, held-out videos to investigate its change in performance after domain adaptation. Following the idea of modular decision support networks, we integrate additional data streams from the OR by training a separate encoder to form a shared representation space with V-JEPA’s embeddings. Our experiments show that finetuning on domain-specific data increases model performance. On the in-house data, integrating additional time-resolved data likewise benefits the model. On the HeiCo data, accuracy of the pretrained video-only, single-modality baseline setup is on par with the top-performing submissions of the EndoVis2017 challenge, while finetuning on domain-specific data increases accuracy further. Our results thus demonstrate how surgical data science can leverage public, generic foundation models. Likewise, they indicate the potential of domain adaptation and of integrating suitable complementary data streams from the OR. To support further research, we release our code and model weights at https://github.com/DigitalSurgeryLab-Basel/ML-CDS-2025.
[143] ToonOut: Fine-tuned Background-Removal for Anime Characters
Matteo Muratori,Joël Seytre
Main category: cs.CV
TL;DR: 论文《ToonOut》针对现有背景去除模型在动漫风格内容上表现不佳的问题,通过微调BiRefNet模型并结合自定义数据集,显著提升了背景去除的精度。
Details
Motivation: 现有的背景去除模型在处理现实主义图像时表现出色,但在动漫风格内容(尤其是复杂的头发和透明效果)上表现不佳,因此需要针对这一领域进行优化。Contribution: 1. 收集并标注了一个包含1,228张高质量动漫图像的自定义数据集;2. 通过微调BiRefNet模型,显著提升了动漫图像背景去除的精度(从95.3%提升至99.5%)。
Method: 使用自定义的动漫图像数据集对开源的BiRefNet模型进行微调。
Result: 提出的方法在Pixel Accuracy指标上将动漫图像背景去除的精度从95.3%提升至99.5%。
Insight: 领域特定的数据集和微调方法可以显著提升背景去除任务在特定风格内容(如动漫)上的性能。
Abstract: While state-of-the-art background removal models excel at realistic imagery, they frequently underperform in specialized domains such as anime-style content, where complex features like hair and transparency present unique challenges. To address this limitation, we collected and annotated a custom dataset of 1,228 high-quality anime images of characters and objects, and fine-tuned the open-sourced BiRefNet model on this dataset. This resulted in marked improvements in background removal accuracy for anime-style images, increasing from 95.3% to 99.5% for our newly introduced Pixel Accuracy metric. We are open-sourcing the code, the fine-tuned model weights, as well as the dataset at: https://github.com/MatteoKartoon/BiRefNet.
[144] Automated Radiographic Total Sharp Score (ARTSS) in Rheumatoid Arthritis: A Solution to Reduce Inter-Intra Reader Variation and Enhancing Clinical Practice
Hajar Moradmand,Lei Ren
Main category: cs.CV
TL;DR: 该研究提出了一种基于深度学习的自动化放射性锐利评分框架(ARTSS),用于评估类风湿关节炎的严重程度,旨在减少评分过程中的主观性和时间消耗,并提高临床实践的效率。
Details
Motivation: 类风湿关节炎(RA)的严重程度评估通常依赖人工评分(如TSS),但这种方法耗时且主观性强。因此,研究团队希望通过自动化解决方案减少评分的变异性,并解决关节消失和图像序列长度不一的问题。Contribution: 1. 提出了ARTSS框架,结合多种深度学习模型(如ResNet50、UNet.3、YOLOv7、ViT等)实现全自动RA评分;2. 解决了关节消失和多关节数量的问题;3. 显著降低了评分的时间和变异性。
Method: 研究分为四个阶段:1)使用ResNet50进行图像预处理和重定向;2)使用UNet.3进行手部分割;3)使用YOLOv7进行关节识别;4)使用VGG16、VGG19、ResNet50、DenseNet201、EfficientNetB0和ViT等模型预测TSS。评估指标包括IoU、MAP、MAE、RMSE和Huber损失。
Result: 关节识别模型的准确率达到99%;最佳TSS预测模型ViT的Huber损失低至0.87。研究证明ARTSS能够显著减少评分变异性,并提高临床决策的准确性。
Insight: 深度学习可以有效自动化RA评分,解决传统方法的局限性。同时,框架的模块化设计(如多阶段模型组合)为其他医学图像分析任务提供了参考。
Abstract: Assessing the severity of rheumatoid arthritis (RA) using the Total Sharp/Van Der Heijde Score (TSS) is crucial, but manual scoring is often time-consuming and subjective. This study introduces an Automated Radiographic Sharp Scoring (ARTSS) framework that leverages deep learning to analyze full-hand X-ray images, aiming to reduce inter- and intra-observer variability. The research uniquely accommodates patients with joint disappearance and variable-length image sequences. We developed ARTSS using data from 970 patients, structured into four stages: I) Image pre-processing and re-orientation using ResNet50, II) Hand segmentation using UNet.3, III) Joint identification using YOLOv7, and IV) TSS prediction using models such as VGG16, VGG19, ResNet50, DenseNet201, EfficientNetB0, and Vision Transformer (ViT). We evaluated model performance with Intersection over Union (IoU), Mean Average Precision (MAP), mean absolute error (MAE), Root Mean Squared Error (RMSE), and Huber loss. The average TSS from two radiologists was used as the ground truth. Model training employed 3-fold cross-validation, with each fold consisting of 452 training and 227 validation samples, and external testing included 291 unseen subjects. Our joint identification model achieved 99% accuracy. The best-performing model, ViT, achieved a notably low Huber loss of 0.87 for TSS prediction. Our results demonstrate the potential of deep learning to automate RA scoring, which can significantly enhance clinical practice. Our approach addresses the challenge of joint disappearance and variable joint numbers, offers timesaving benefits, reduces inter- and intra-reader variability, improves radiologist accuracy, and aids rheumatologists in making more informed decisions.
[145] Matching Shapes Under Different Topologies: A Topology-Adaptive Deformation Guided Approach
Aymen Merrouche,Stefanie Wuhrer,Edmond Boyer
Main category: cs.CV
TL;DR: 本文提出了一种拓扑自适应的变形模型,用于解决非刚性3D网格匹配中因拓扑结构变化导致的问题,适用于包含拓扑噪声的实际场景。
Details
Motivation: 现有的方法(如Functional Maps和ARAP)假设变形是近似等距或ARAP的,但在实际场景(如多视角重建)中因拓扑噪声这些假设不成立。Contribution: 提出了一种拓扑自适应的变形模型,联合优化模板网格及其对齐,适用于高度非等距和含拓扑噪声的形状匹配。
Method: 通过拓扑自适应变形模型结合ARAP和双射关联约束,联合优化模板网格及其对齐,提取对应关系。
Result: 该方法无需数据驱动先验,在含拓扑噪声的3D对齐任务中表现优异,甚至优于在大数据集上训练的方法。
Insight: 拓扑自适应模型在处理形状匹配时可以放宽对拓扑一致性的假设,更适合实际应用中复杂的拓扑变化场景。
Abstract: Non-rigid 3D mesh matching is a critical step in computer vision and computer graphics pipelines. We tackle matching meshes that contain topological artefacts which can break the assumption made by current approaches. While Functional Maps assume the deformation induced by the ground truth correspondences to be near-isometric, ARAP-like deformation-guided approaches assume the latter to be ARAP. Neither assumption holds in certain topological configurations of the input shapes. We are motivated by real-world scenarios such as per-frame multi-view reconstructions, often suffering from topological artefacts. To this end, we propose a topology-adaptive deformation model allowing changes in shape topology to align shape pairs under ARAP and bijective association constraints. Using this model, we jointly optimise for a template mesh with adequate topology and for its alignment with the shapes to be matched to extract correspondences. We show that, while not relying on any data-driven prior, our approach applies to highly non-isometric shapes and shapes with topological artefacts, including noisy per-frame multi-view reconstructions, even outperforming methods trained on large datasets in 3D alignment quality.
[146] Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers
Morteza Kiani Haftlang,Mohammadhossein Malmir,Foroutan Parand,Umberto Michelucci,Safouane El Ghazouali
Main category: cs.CV
TL;DR: 论文提出了Barlow-Swin,一种结合Swin Transformer编码器和U-Net解码器的轻量级实时医学图像分割架构,通过Barlow Twins自监督预训练提升性能。
Details
Motivation: 卷积架构(如U-Net)在医学图像分割中的感受野有限,无法全局建模;现有Transformer模型则计算复杂且不适用于实时场景。Contribution: 1)设计了轻量级的Swin Transformer-U-Net混合架构;2)使用Barlow Twins自监督预训练提升特征学习能力;3)在参数量减少的情况下实现竞争性精度。
Method: 1)Swin-like编码器与U-Net-like解码器通过跳跃连接结合;2)先用Barlow Twins预训练编码器,再微调整个模型。
Result: 在基准任务上验证了模型的竞争性精度、高效推理和低参数量,适用于实时临床部署。
Insight: 自监督预训练能显著提升轻量级模型的特征学习能力,混合架构在资源受限场景中具有实用性。
Abstract: Medical image segmentation is a critical task in clinical workflows, particularly for the detection and delineation of pathological regions. While convolutional architectures like U-Net have become standard for such tasks, their limited receptive field restricts global context modeling. Recent efforts integrating transformers have addressed this, but often result in deep, computationally expensive models unsuitable for real-time use. In this work, we present a novel end-to-end lightweight architecture designed specifically for real-time binary medical image segmentation. Our model combines a Swin Transformer-like encoder with a U-Net-like decoder, connected via skip pathways to preserve spatial detail while capturing contextual information. Unlike existing designs such as Swin Transformer or U-Net, our architecture is significantly shallower and competitively efficient. To improve the encoder’s ability to learn meaningful features without relying on large amounts of labeled data, we first train it using Barlow Twins, a self-supervised learning method that helps the model focus on important patterns by reducing unnecessary repetition in the learned features. After this pretraining, we fine-tune the entire model for our specific task. Experiments on benchmark binary segmentation tasks demonstrate that our model achieves competitive accuracy with substantially reduced parameter count and faster inference, positioning it as a practical alternative for deployment in real-time and resource-limited clinical environments. The code for our method is available at Github repository: https://github.com/mkianih/Barlow-Swin.
[147] BIR-Adapter: A Low-Complexity Diffusion Model Adapter for Blind Image Restoration
Cem Eteke,Alexander Griessel,Wolfgang Kellerer,Eckehard Steinbach
Main category: cs.CV
TL;DR: BIR-Adapter是一种低复杂度的扩散模型适配器,用于盲图像恢复,无需额外训练特征提取器。
Details
Motivation: 利用预训练大规模扩散模型的先验知识,避免复杂的特征提取器训练,同时提升盲图像恢复的性能。Contribution: 1. 提出BIR-Adapter,一种低复杂度适配器;2. 扩展自注意力机制以利用退化特征;3. 引入采样引导机制减少幻觉;4. 适配器设计使模型更灵活。
Method: 从退化图像中提取特征,扩展自注意力机制,引入采样引导机制,减少幻觉。
Result: 在合成和真实退化数据上表现优于或接近SOTA方法,且复杂度更低。
Insight: 适配器设计使预训练扩散模型更灵活,适用于多种任务,展示了扩展单一任务模型的潜力。
Abstract: This paper introduces BIR-Adapter, a low-complexity blind image restoration adapter for diffusion models. The BIR-Adapter enables the utilization of the prior of pre-trained large-scale diffusion models on blind image restoration without training any auxiliary feature extractor. We take advantage of the robustness of pretrained models. We extract features from degraded images via the model itself and extend the self-attention mechanism with these degraded features. We introduce a sampling guidance mechanism to reduce hallucinations. We perform experiments on synthetic and real-world degradations and demonstrate that BIR-Adapter achieves competitive or better performance compared to state-of-the-art methods while having significantly lower complexity. Additionally, its adapter-based design enables integration into other diffusion models, enabling broader applications in image restoration tasks. We showcase this by extending a super-resolution-only model to perform better under additional unknown degradations.
[148] FoMo4Wheat: Toward reliable crop vision foundation models with globally curated data
Bing Han,Chen Zhu,Dong Han,Rui Yu,Songliang Cao,Jianhui Wu,Scott Chapman,Zijian Wang,Bangyou Zheng,Wei Guo,Marie Weiss,Benoit de Solan,Andreas Hund,Lukas Roth,Kirchgessner Norbert,Andrea Visioni,Yufeng Ge,Wenjuan Li,Alexis Comar,Dong Jiang,Dejun Han,Fred Baret,Yanfeng Ding,Hao Lu,Shouyang Liu
Main category: cs.CV
TL;DR: FoMo4Wheat 是一种基于小麦图像数据的自监督预训练视觉基础模型,通过全球数据集 ImAg4Wheat(250 万张高分辨率小麦图像)训练,在小麦及其他作物和杂草的视觉任务中表现优异。
Details
Motivation: 现有基于通用域预训练模型的农业视觉任务泛化能力不足,难以适应多变的田间条件和作物结构。Contribution: 提出了首个小麦领域的视觉基础模型 FoMo4Wheat,并发布了全球最大、最多样化的小麦图像数据集 ImAg4Wheat。
Method: 利用自监督学习在 ImAg4Wheat 数据集上预训练模型,生成鲁棒且可迁移的视觉表征。
Result: 在 10 项田间视觉任务中,FoMo4Wheat 表现优于通用域预训练模型,证明了作物专用基础模型的价值。
Insight: 作物领域的专用预训练模型能显著提升农业视觉任务的性能,并为跨物种和跨任务的通用作物基础模型指明方向。
Abstract: Vision-driven field monitoring is central to digital agriculture, yet models built on general-domain pretrained backbones often fail to generalize across tasks, owing to the interaction of fine, variable canopy structures with fluctuating field conditions. We present FoMo4Wheat, one of the first crop-domain vision foundation model pretrained with self-supervision on ImAg4Wheat, the largest and most diverse wheat image dataset to date (2.5 million high-resolution images collected over a decade at 30 global sites, spanning >2,000 genotypes and >500 environmental conditions). This wheat-specific pretraining yields representations that are robust for wheat and transferable to other crops and weeds. Across ten in-field vision tasks at canopy and organ levels, FoMo4Wheat models consistently outperform state-of-the-art models pretrained on general-domain dataset. These results demonstrate the value of crop-specific foundation models for reliable in-field perception and chart a path toward a universal crop foundation model with cross-species and cross-task capabilities. FoMo4Wheat models and the ImAg4Wheat dataset are publicly available online: https://github.com/PheniX-Lab/FoMo4Wheat and https://huggingface.co/PheniX-Lab/FoMo4Wheat. The demonstration website is: https://fomo4wheat.phenix-lab.com/.
[149] H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
Wenhao Li,Mengyuan Liu,Hong Liu,Pichao Wang,Shijian Lu,Nicu Sebe
Main category: cs.CV
TL;DR: H$_{2}$OT是一个用于高效视频3D人体姿态估计的分层hourglass tokenizer框架,通过动态剪枝和恢复姿态令牌来减少计算成本。
Details
Motivation: 现有的视频姿态变换器(VPTs)计算成本高,难以在资源受限的设备上实用。作者希望通过减少冗余帧的令牌数量来提高效率。Contribution: 提出了H$_{2}$OT框架,包含Token剪枝模块(TPM)和Token恢复模块(TRM),显著降低了计算成本,同时保持精度。
Method: 采用分层hourglass结构,逐步剪枝冗余帧的姿态令牌,再通过TRM恢复完整序列。
Result: 在多基准测试中验证了其高效性和准确性。
Insight: 仅保留代表性帧的姿态令牌即可实现高效且精确的姿态估计。
Abstract: Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H${2}$OT), for efficient transformer-based 3D human pose estimation from videos. H${2}$OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H$_{2}$OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.
eess.IV [Back]
[150] Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance
Mohamed Mohamed,Brennan Nichyporuk,Douglas L. Arnold,Tal Arbel
Main category: eess.IV
TL;DR: 该论文提出了一种通过语言引导生成高分辨率3D反事实医学图像的框架,填补了当前3D医学图像生成领域的空白,并展示了其在神经影像数据中的应用。
Details
Motivation: 当前视觉语言模型在2D图像生成中表现优异,但在3D领域缺乏预训练基础模型,限制了高分辨率3D医学图像生成的进展。论文旨在探索语言引导的3D反事实医学图像生成潜力。Contribution: 论文首次提出了一种语言引导的3D扩散模型框架,能够生成高质量的3D反事实医学图像,并针对神经影像数据的特殊性进行了优化。
Method: 论文改进了先进的3D扩散模型(基于Simple Diffusion),通过增强的条件机制提升文本对齐和图像质量,专注于神经影像数据的生成。
Result: 在两个神经MRI数据集上,框架成功模拟了多发性硬化症(MS)的不同病灶负荷和阿尔茨海默病的认知状态,生成了高质量且保留主体保真度的图像。
Insight: 该方法为3D医学影像中的提示驱动疾病进展分析奠定了基础,展现了语言引导生成在医学研究和临床应用中的潜力。
Abstract: Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however the impressive performance of these models in 2D is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained foundation models do not exist for 3D, significantly limiting progress in this domain. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language descriptions remains completely unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression scenarios, and enhanced medical training by visualizing hypothetical medical conditions in realistic detail. Our work takes a meaningful step toward addressing this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this represents the first demonstration of a language-guided native-3D diffusion model applied specifically to neurological imaging data, where faithful three-dimensional modeling is essential to represent the brain’s three-dimensional structure. Through results on two distinct neurological MRI datasets, our framework successfully simulates varying counterfactual lesion loads in Multiple Sclerosis (MS), and cognitive states in Alzheimer’s disease, generating high-quality images while preserving subject fidelity in synthetically generated medical images. Our results lay the groundwork for prompt-driven disease progression analysis within 3D medical imaging.
[151] FASL-Seg: Anatomy and Tool Segmentation of Surgical Scenes
Muraam Abdel-Ghani,Mahmoud Ali,Mohamed Ali,Fatmaelzahraa Ahmed,Mohamed Arsalan,Abdulaziz Al-Ali,Shidin Balakrishnan
Main category: eess.IV
TL;DR: FASL-Seg提出了一种多级特征捕获模型,通过低层和高层特征投影流精确分割手术场景中的解剖结构和工具,显著提升了语义分割性能。
Details
Motivation: 当前基于深度学习的手术训练研究中,多数工作仅关注手术工具而忽略解剖结构,且现有模型难以平衡高低层特征的捕获。Contribution: 提出了FASL-Seg模型,通过LLFP和HLFP双流设计,实现了对手术场景中解剖结构和工具的多分辨率特征精确分割。
Method: 采用低层特征投影(LLFP)和高层特征投影(HLFP)双流结构,分别捕获细节和上下文特征,提升分割精度。
Result: 在EndoVis18数据集上,解剖结构和工具分割mIoU达72.71%,较SOTA提升5%;在EndoVis18和EndoVis17的工具类型分割中分别达到85.61%和72.78%的mIoU。
Insight: 双流设计有效平衡了高低层特征的捕获,为手术场景的分割任务提供了一种新思路。
Abstract: The growing popularity of robotic minimally invasive surgeries has made deep learning-based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state-of-the-art (SOTA) models struggle to balance capturing high-level contextual features and low-level edge features. We propose a Feature-Adaptive Spatial Localization model (FASL-Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low-Level Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream, for varying feature resolutions - enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per-class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.
[152] Contrastive Anatomy-Contrast Disentanglement: A Domain-General MRI Harmonization Method
Daniel Scholz,Ayhan Can Erdur,Robbie Holland,Viktoria Ehm,Jan C. Peeken,Benedikt Wiestler,Daniel Rueckert
Main category: eess.IV
TL;DR: 该论文提出了一种新颖的MRI扫描仪和谐化方法,通过对比解剖-对比解缠技术,使用条件扩散自动编码器和对比损失,实现了跨扫描仪的图像一致化。
Details
Motivation: MRI图像因扫描仪和采集参数的差异导致对比度不一致,影响数据的可比性和重现性。现有方法需要旅行受试者或难以泛化到未见域。Contribution: 提出了一种域通用的MRI和谐化方法,结合对比损失和域不可知对比增强,实现了无需微调的跨扫描器图像一致化。
Method: 采用条件扩散自动编码器,结合对比损失和域不可知对比增强技术,保留了特定受试者的解剖信息。
Result: 在旅行受试者数据集上PSNR提高了7%,在未见域的年龄回归任务中提高了18%。
Insight: 该方法不仅提升了MRI图像的泛化性和可比性,还为多站点和纵向临床研究提供了技术支持。
Abstract: Magnetic resonance imaging (MRI) is an invaluable tool for clinical and research applications. Yet, variations in scanners and acquisition parameters cause inconsistencies in image contrast, hindering data comparability and reproducibility across datasets and clinical studies. Existing scanner harmonization methods, designed to address this challenge, face limitations, such as requiring traveling subjects or struggling to generalize to unseen domains. We propose a novel approach using a conditioned diffusion autoencoder with a contrastive loss and domain-agnostic contrast augmentation to harmonize MR images across scanners while preserving subject-specific anatomy. Our method enables brain MRI synthesis from a single reference image. It outperforms baseline techniques, achieving a +7% PSNR improvement on a traveling subjects dataset and +18% improvement on age regression in unseen. Our model provides robust, effective harmonization of brain MRIs to target scanners without requiring fine-tuning. This advancement promises to enhance comparability, reproducibility, and generalizability in multi-site and longitudinal clinical studies, ultimately contributing to improved healthcare outcomes.
[153] MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis
Daniel Scholz,Ayhan Can Erdur,Viktoria Ehm,Anke Meyer-Baese,Jan C. Peeken,Daniel Rueckert,Benedikt Wiestler
Main category: eess.IV
TL;DR: MM-DINOv2是一个新颖的框架,将DINOv2基础视觉模型适配到多模态医学图像分析领域,通过多模态补丁嵌入和全模态掩码技术解决了多模态数据和缺失模态的问题,同时在半监督学习中利用了未标记数据,性能优于现有监督方法。
Details
Motivation: 现有基础视觉模型(如DINOv2)虽在自然图像领域表现优异,但在多模态医学图像分析中存在局限性,无法有效处理多模态数据和模态缺失问题。此外,监督模型难以利用未标记数据。Contribution: 1. 提出MM-DINOv2框架,扩展DINOv2以支持多模态医学图像分析;2. 引入多模态补丁嵌入和全模态掩码技术;3. 利用半监督学习提升模型性能。
Method: 1. 采用多模态补丁嵌入处理多模态数据;2. 使用全模态掩码技术解决模态缺失问题;3. 结合半监督学习训练模型。
Result: 在胶质瘤亚型分类任务中,MCC达到0.6,比现有监督方法提升了11.1%。
Insight: MM-DINOv2展示了基础视觉模型在多模态医学图像任务中的潜力,并通过有效利用未标记数据和解决模态缺失问题,为临床应用提供了可靠解决方案。
Abstract: Vision foundation models like DINOv2 demonstrate remarkable potential in medical imaging despite their origin in natural image domains. However, their design inherently works best for uni-modal image analysis, limiting their effectiveness for multi-modal imaging tasks that are common in many medical fields, such as neurology and oncology. While supervised models perform well in this setting, they fail to leverage unlabeled datasets and struggle with missing modalities, a frequent challenge in clinical settings. To bridge these gaps, we introduce MM-DINOv2, a novel and efficient framework that adapts the pre-trained vision foundation model DINOv2 for multi-modal medical imaging. Our approach incorporates multi-modal patch embeddings, enabling vision foundation models to effectively process multi-modal imaging data. To address missing modalities, we employ full-modality masking, which encourages the model to learn robust cross-modality relationships. Furthermore, we leverage semi-supervised learning to harness large unlabeled datasets, enhancing both the accuracy and reliability of medical predictions. Applied to glioma subtype classification from multi-sequence brain MRI, our method achieves a Matthews Correlation Coefficient (MCC) of 0.6 on an external test set, surpassing state-of-the-art supervised approaches by +11.1%. Our work establishes a scalable and robust solution for multi-modal medical imaging tasks, leveraging powerful vision foundation models pre-trained on natural images while addressing real-world clinical challenges such as missing data and limited annotations.
cs.LG [Back]
[154] Outcome-based Exploration for LLM Reasoning
Yuda Song,Julia Kempe,Remi Munos
Main category: cs.LG
TL;DR: 该论文研究了基于结果的强化学习(RL)在提升大语言模型(LLM)推理能力时的多样性损失问题,并提出两种探索算法以平衡准确性与多样性。
Details
Motivation: 研究发现,基于结果的RL虽然显著提高了LLM的准确性,但会导致生成多样性下降,影响了实际部署中的扩展性能。Contribution: 论文的主要贡献包括:(i) 分析RL后训练中多样性的系统性损失;(ii) 提出两种基于结果的探索算法(历史探索和批量探索)以解决多样性问题;(iii) 通过新型结果优化模型为方法提供理论支持。
Method: 论文提出了两种互补算法:(i) 历史探索(UCB-style奖励罕见答案);(ii) 批量探索(惩罚批量内重复以促进多样性)。实验基于Llama和Qwen模型在数学推理任务上进行。
Result: 实验表明,两种方法在提升准确性的同时缓解了多样性崩溃问题,验证了方法的有效性。
Insight: 论文揭示了目标空间中多样性的重要性,并提出了一种实用的RL方法,既能提升推理能力,又不牺牲多样性的需求。
Abstract: Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models (LLMs). Outcome-based RL, which rewards policies solely for the correctness of the final answer, yields substantial accuracy gains but also induces a systematic loss in generation diversity. This collapse undermines real-world performance, where diversity is critical for test-time scaling. We analyze this phenomenon by viewing RL post-training as a sampling process and show that, strikingly, RL can reduce effective diversity even on the training set relative to the base model. Our study highlights two central findings: (i) a transfer of diversity degradation, where reduced diversity on solved problems propagates to unsolved ones, and (ii) the tractability of the outcome space, since reasoning tasks admit only a limited set of distinct answers. Motivated by these insights, we propose outcome-based exploration, which assigns exploration bonuses according to final outcomes. We introduce two complementary algorithms: historical exploration, which encourages rarely observed answers via UCB-style bonuses, and batch exploration, which penalizes within-batch repetition to promote test-time diversity. Experiments on standard competition math with Llama and Qwen models demonstrate that both methods improve accuracy while mitigating diversity collapse. On the theoretical side, we formalize the benefit of outcome-based exploration through a new model of outcome-based bandits. Together, these contributions chart a practical path toward RL methods that enhance reasoning without sacrificing the diversity essential for scalable deployment.
[155] Feed Two Birds with One Scone: Exploiting Function-Space Regularization for Both OOD Robustness and ID Fine-Tuning Performance
Xiang Yuan,Jun Shu,Deyu meng,Zongben Xu
Main category: cs.LG
TL;DR: 该论文提出了一种新的正则化方法,通过约束微调模型和预训练模型在函数空间上的距离,同时利用模拟OOD样本,以保持预训练模型的OOD鲁棒性,并引入一致性正则化进一步提升性能。实验表明该方法在多种CLIP骨干网络上均优于现有方法。
Details
Motivation: 现有的稳健微调方法主要关注保留预训练权重、特征或logits,但这些方法无法在所有模型架构上提升OOD鲁棒性。论文的动机是解决这一问题,通过直接在函数空间优化,更好地保持OOD鲁棒性。Contribution: 1. 提出了一种新颖的函数空间正则化方法,直接优化模型在OOD样本上的预测稳定性;2. 引入一致性正则化进一步增强微调模型的OOD鲁棒性。
Method: 1. 在函数空间中约束微调模型与预训练模型的距离;2. 利用模拟OOD样本优化模型;3. 通过一致性正则化提升模型对扰动样本的预测稳定性。
Result: 在多种CLIP骨干网络上,该方法在ID微调性能和OOD鲁棒性上均优于现有正则化方法。
Insight: 直接优化函数空间而非权重或特征空间,能更有效地保持预训练模型的OOD鲁棒性,同时提升下游任务的性能。
Abstract: Robust fine-tuning aims to achieve competitive in-distribution (ID) performance while maintaining the out-of-distribution (OOD) robustness of a pre-trained model when transferring it to a downstream task. To remedy this, most robust fine-tuning methods aim to preserve the pretrained weights, features, or logits. However, we find that these methods cannot always improve OOD robustness for different model architectures. This is due to the OOD robustness requiring the model function to produce stable prediction for input information of downstream tasks, while existing methods might serve as a poor proxy for the optimization in the function space. Based on this finding, we propose a novel regularization that constrains the distance of fine-tuning and pre-trained model in the function space with the simulated OOD samples, aiming to preserve the OOD robustness of the pre-trained model. Besides, to further enhance the OOD robustness capability of the fine-tuning model, we introduce an additional consistency regularization to promote stable predictions of perturbed samples. Extensive experiments demonstrate our approach could consistently improve both downstream task ID fine-tuning performance and OOD robustness across a variety of CLIP backbones, outperforming existing regularization-based robust fine-tuning methods.
[156] ProfilingAgent: Profiling-Guided Agentic Reasoning for Adaptive Model Optimization
Sadegh Jafari,Aishwarya Sarkar,Mohiuddin Bilwal,Ali Jannesari
Main category: cs.LG
TL;DR: ProfilingAgent是一个基于推理分析的方法,通过LLM自动设计剪枝和量化策略,针对模型各层的静态和动态特性优化计算和内存开销,在保持精度的同时显著减少资源占用。
Details
Motivation: 基础模型在资源受限的平台上面临计算和内存瓶颈,传统压缩方法忽略架构和运行时的异构性,缺乏对动态指标的考虑,因此需要更智能的优化方法。Contribution: 1. 提出ProfilingAgent,一个模块化的多智能体系统,利用LLM自动设计剪枝和量化策略;2. 引入静态和动态信号的综合分析,针对各层特性优化;3. 在多个数据集和模型上验证了其有效性。
Method: 1. 使用LLM分析每层的延迟、内存和计算成本;2. 结合静态指标(如MACs、参数量)和动态信号(如延迟、内存)设计剪枝和量化策略;3. 模块化多智能体系统进行推理优化。
Result: 1. 剪枝在ImageNet-1K上精度下降约1%,在小数据集上ViT-B/16提升2%;2. 量化减少74%内存占用且精度损失<0.5%;3. 推理速度提升高达1.74倍。
Insight: 1. LLM的分析能力对剪枝效果至关重要;2. 针对各层特性的优化优于统一启发式方法;3. 智能体系统可扩展性强,适用于模型优化。
Abstract: Foundation models face growing compute and memory bottlenecks, hindering deployment on resource-limited platforms. While compression techniques such as pruning and quantization are widely used, most rely on uniform heuristics that ignore architectural and runtime heterogeneity. Profiling tools expose per-layer latency, memory, and compute cost, yet are rarely integrated into automated pipelines. We propose ProfilingAgent, a profiling-guided, agentic approach that uses large language models (LLMs) to automate compression via structured pruning and post-training dynamic quantization. Our modular multi-agent system reasons over static metrics (MACs, parameter counts) and dynamic signals (latency, memory) to design architecture-specific strategies. Unlike heuristic baselines, ProfilingAgent tailors layer-wise decisions to bottlenecks. Experiments on ImageNet-1K, CIFAR-10, and CIFAR-100 with ResNet-101, ViT-B/16, Swin-B, and DeiT-B/16 show pruning maintains competitive or improved accuracy (about 1% drop on ImageNet-1K, +2% gains for ViT-B/16 on smaller datasets), while quantization achieves up to 74% memory savings with <0.5% accuracy loss. Our quantization also yields consistent inference speedups of up to 1.74 times faster. Comparative studies with GPT-4o and GPT-4-Turbo highlight the importance of LLM reasoning quality for iterative pruning. These results establish agentic systems as scalable solutions for profiling-guided model optimization.
[157] Performance of Conformal Prediction in Capturing Aleatoric Uncertainty
Misgina Tsighe Hagos,Claes Lundström
Main category: cs.LG
TL;DR: 该研究评估了共形预测(conformal prediction)在捕捉数据固有不确定性(aleatoric uncertainty)方面的表现,发现其预测集大小与人类标注的多标签相关性较弱,暗示其在此方面的能力有限。
Details
Motivation: 共形预测作为一种模型无关的方法,理论上能通过预测集的大小反映数据不确定性,但其实际表现缺乏验证。本研究旨在填补这一空白,探讨其是否真的能捕捉数据中的固有模糊性(如类别重叠)。Contribution: 1. 首次实证评估共形预测在量化数据不确定性方面的有效性;2. 通过人类标注的多标签数据验证预测集与实际模糊性的相关性;3. 使用了三种共形预测方法和八种深度学习模型,覆盖四种数据集。
Method: 1. 使用共形预测生成预测集;2. 计算预测集大小与人类标注的多标签数量的相关性;3. 比较预测集与人类标注的一致性。实验基于三个数据集,每个实例标注由5至50人完成。
Result: 大多数共形预测输出的预测集与人类标注的相关性为非常弱到弱,仅少数呈现中等相关性。结果表明,共形预测虽能覆盖真实类别,但在反映数据不确定性方面效果有限。
Insight: 1. 共形预测的预测集大小不能直接等同于数据不确定性;2. 需重新审视共形预测的应用场景和局限性;3. 未来研究可能需要结合其他方法来更准确地量化不确定性。
Abstract: Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty remains limited.
[158] Tackling Device Data Distribution Real-time Shift via Prototype-based Parameter Editing
Zheqi Lv,Wenqiao Zhang,Kairui Fu,Qi Tian,Shengyu Zhang,Jiajie Su,Jingyuan Chen,Kun Kuang,Fei Wu
Main category: cs.LG
TL;DR: 该论文提出了一种名为Persona的新方法,通过原型基础的参数编辑框架,无需往后部署重训练,就可以应对设备上的实时数据分布假移问题。
Details
Motivation: 当前的研究多依赖于数据密集和计算厚重的微调方法,而设备上的实时数据分布假移对轻量级模型的普遍化构成挑战,这个问题被忽视了。Contribution: 1. 提出Persona,一种不需重训练的样本化方法,通过原型基础的参数编辑框架实现。2. 使用云端神经适配器生成参数编辑矩阵,动态调整原型模型。3. 通过跨层知识转移,保证多层参数变化的一致性和上下文感知。
Method: 方法包括:1. 在云端使用神经适配器生成参数编辑矩阵。2. 用于动态配置原型模型。3. 通过跨层知识转移优化原型分配。
Result: 在多个数据集上进行的视觉任务和推荐任务实验表明,Persona方法具有高效和普遍性。
Insight: 1. 原型基础的参数编辑框架可以有效应对实时数据分布假移。2. 跨层知识转移对于保证参数优化的一致性具有重要作用。
Abstract: The on-device real-time data distribution shift on devices challenges the generalization of lightweight on-device models. This critical issue is often overlooked in current research, which predominantly relies on data-intensive and computationally expensive fine-tuning approaches. To tackle this, we introduce Persona, a novel personalized method using a prototype-based, backpropagation-free parameter editing framework to enhance model generalization without post-deployment retraining. Persona employs a neural adapter in the cloud to generate a parameter editing matrix based on real-time device data. This matrix adeptly adapts on-device models to the prevailing data distributions, efficiently clustering them into prototype models. The prototypes are dynamically refined via the parameter editing matrix, facilitating efficient evolution. Furthermore, the integration of cross-layer knowledge transfer ensures consistent and context-aware multi-layer parameter changes and prototype assignment. Extensive experiments on vision task and recommendation task on multiple datasets confirm Persona’s effectiveness and generality.
cs.CY [Back]
[159] Authorship Without Writing: Large Language Models and the Senior Author Analogy
Clint Hurshman,Sebastian Porsdam Mann,Julian Savulescu,Brian D. Earp
Main category: cs.CY
TL;DR: 本文探讨了大语言模型(LLM)在科研和医学写作中的作者身份争议,提出在特定条件下,LLM的使用可以类比为一种高级作者身份,并认为应承认其合法性或修订现有作者标准。
Details
Motivation: 在生物伦理、科学和医学写作中,LLM的使用引发了作者身份的争议。尽管普遍认为LLM本身不能作为作者,但对于人类使用LLM是否应被视为作者尚无共识。本文希望通过类比高级作者的角色,解决这一问题。Contribution: 主要贡献在于提出LLM的使用(在特定条件下)可以类比为一种高级作者身份,并论证这种使用方式在当前作者标准下的合法性。
Method: 通过类比高级作者的角色(如确定研究范围和担保研究完整性),论证LLM的使用可以视为一种合法的作者形式。
Result: 结论认为,要么承认LLM的合法使用作为作者身份,要么对现有作者标准进行根本性修订。
Insight: 本文指出,科研中作者身份的界定需要适应新技术的发展,尤其是LLM的广泛应用,可能需要对传统标准进行反思。
Abstract: The use of large language models (LLMs) in bioethical, scientific, and medical writing remains controversial. While there is broad agreement in some circles that LLMs cannot count as authors, there is no consensus about whether and how humans using LLMs can count as authors. In many fields, authorship is distributed among large teams of researchers, some of whom, including paradigmatic senior authors who guide and determine the scope of a project and ultimately vouch for its integrity, may not write a single word. In this paper, we argue that LLM use (under specific conditions) is analogous to a form of senior authorship. On this view, the use of LLMs, even to generate complete drafts of research papers, can be considered a legitimate form of authorship according to the accepted criteria in many fields. We conclude that either such use should be recognized as legitimate, or current criteria for authorship require fundamental revision. AI use declaration: GPT-5 was used to help format Box 1. AI was not used for any other part of the preparation or writing of this manuscript.
cs.GR [Back]
[160] From Skin to Skeleton: Towards Biomechanically Accurate 3D Digital Humans
Marilyn Keller,Keenon Werling,Soyong Shin,Scott Delp,Sergi Pujades,C. Karen Liu,Michael J. Black
Main category: cs.GR
TL;DR: 论文提出SKEL模型,通过在SMPL人体模型中引入生物力学准确的骨骼结构,解决了现有方法在生物力学应用中关节位置和运动不准确的问题。
Details
Motivation: 现有的人体模型(如SMPL)虽然便于姿态和形状估计,但其简化的运动学结构不符合真实骨骼系统的关节位置和运动,限制了生物力学应用的需求。Contribution: 提出了SKEL模型,将生物力学准确的骨骼结构引入SMPL模型,并通过优化数据和回归学习方法实现。
Method: 1. 构建包含生物力学准确骨骼的SMPL网格数据集;2. 学习从SMPL顶点到优化关节位置和骨骼旋转的回归器;3. 用新的运动学参数重新参数化SMPL网格。
Result: SKEL模型的关节位置生物力学准确性优于SMPL,且骨骼与身体表面更贴合。还能通过拟合SKEL升级现有数据集。
Insight: SKEL为生物力学‘野外’研究提供了工具,同时为视觉和图形学研究提供了更真实的人体关节模型。
Abstract: Great progress has been made in estimating 3D human pose and shape from images and video by training neural networks to directly regress the parameters of parametric human models like SMPL. However, existing body models have simplified kinematic structures that do not correspond to the true joint locations and articulations in the human skeletal system, limiting their potential use in biomechanics. On the other hand, methods for estimating biomechanically accurate skeletal motion typically rely on complex motion capture systems and expensive optimization methods. What is needed is a parametric 3D human model with a biomechanically accurate skeletal structure that can be easily posed. To that end, we develop SKEL, which re-rigs the SMPL body model with a biomechanics skeleton. To enable this, we need training data of skeletons inside SMPL meshes in diverse poses. We build such a dataset by optimizing biomechanically accurate skeletons inside SMPL meshes from AMASS sequences. We then learn a regressor from SMPL mesh vertices to the optimized joint locations and bone rotations. Finally, we re-parametrize the SMPL mesh with the new kinematic parameters. The resulting SKEL model is animatable like SMPL but with fewer, and biomechanically-realistic, degrees of freedom. We show that SKEL has more biomechanically accurate joint locations than SMPL, and the bones fit inside the body surface better than previous methods. By fitting SKEL to SMPL meshes we are able to “upgrade” existing human pose and shape datasets to include biomechanical parameters. SKEL provides a new tool to enable biomechanics in the wild, while also providing vision and graphics researchers with a better constrained and more realistic model of human articulation. The model, code, and data are available for research at https://skel.is.tue.mpg.de..
cs.SD [Back]
[161] TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
Minh N. H. Nguyen,Anh Nguyen Tran,Dung Truong Dinh,Nam Van Vo
Main category: cs.SD
TL;DR: 提出了一种针对越南语-英语代码切换语音识别的两阶段音素中心模型(TSPC),通过扩展越南语音素集作为中间表示,显著降低了词错误率(20.8%),并减少了训练资源。
Details
Motivation: 代码切换(CS)对通用语音识别系统构成挑战,尤其是越南语-英语这类语言对,现有方法难以捕捉音素层面的细微变化。Contribution: 提出了TSPC模型,利用音素中心的方法和两阶段架构,显著提高了越南语-英语CS语音识别的性能,同时降低了训练成本。
Method: 采用两阶段架构,第一阶段扩展越南语音素集,第二阶段通过音素适应和语言转换增强混合语言建模能力。
Result: 实验表明,TSPC在越南语-英语CS ASR中表现优于基准模型,词错误率降至20.8%。
Insight: 音素中心的中间表示和两阶段架构为混合语言语音识别提供了一种高效解决方案。
Abstract: Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios. The challenge is particularly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). The TSPC employs a phoneme-centric approach, built upon an extended Vietnamese phoneme set as an intermediate representation to facilitate mixed-lingual modeling. Experimental results demonstrate that TSPC consistently outperforms existing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 20.8% with reduced training resources. Furthermore, the phonetic-based two-stage architecture enables phoneme adaptation and language conversion to enhance ASR performance in complex CS Vietnamese-English ASR scenarios.
cs.AI [Back]
[162] Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation
James Mooney,Josef Woldense,Zheng Robert Jia,Shirley Anugrah Hayati,My Ha Nguyen,Vipul Raheja,Dongyeop Kang
Main category: cs.AI
TL;DR: 该论文探讨了大型语言模型(LLM)代理在不同实验设置下是否保持行为一致性的问题,发现尽管模型能生成与人类相似的回答,但内部一致性不足,无法完全替代人类受试者研究。
Details
Motivation: 随着LLM能力的显著提升,研究者希望将合成代理用于人类行为研究中的替代参与者。然而,目前的研究多关注模型生成的数据是否与人类一致,而忽略了代理在不同情境下的行为一致性。该论文试图填补这一空白。Contribution: 论文提出了一个研究设计,用于揭示代理的内部状态并检测其行为一致性。研究发现LLM在不同实验情境中存在显著内部不一致性,揭示了其在人类研究中的局限性。
Method: 通过设计实验(如揭示代理内部状态和对话行为分析),探索行为假设,评估代理在不同情境下的行为一致性。分析了不同模型家族和规模的LLM。
Result: 结果显示,LLM生成的回答虽能与人类匹配,但内部行为一致性不足,无法完全替代人类参与者。
Insight: 研究发现LLM生成回答的能力与内部一致性之间存在脱节,这为未来研究指出了改进方向,也提醒了在人类研究中使用合成代理的局限性。
Abstract: The impressive capabilities of Large Language Models (LLMs) have fueled the notion that synthetic agents can serve as substitutes for real participants in human-subject research. In an effort to evaluate the merits of this claim, social science researchers have largely focused on whether LLM-generated survey data corresponds to that of a human counterpart whom the LLM is prompted to represent. In contrast, we address a more fundamental question: Do agents maintain internal consistency, retaining similar behaviors when examined under different experimental settings? To this end, we develop a study designed to (a) reveal the agent’s internal state and (b) examine agent behavior in a basic dialogue setting. This design enables us to explore a set of behavioral hypotheses to assess whether an agent’s conversation behavior is consistent with what we would expect from their revealed internal state. Our findings on these hypotheses show significant internal inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be internally consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research. Our simulation code and data are publicly accessible.
[163] Reverse-Engineered Reasoning for Open-Ended Generation
Haozhe Wang,Haoran Que,Qixin Xu,Minghao Liu,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Wei Ye,Tong Yang,Wenhao Huang,Ge Zhang,Fangzhen Lin
Main category: cs.AI
TL;DR: 论文提出了一种名为REER的新范式,通过反向工程从已知良好解决方案中推导潜在的逐步推理过程,解决了开放式生成任务中深度推理的难题,并开源了DeepWriting-20K数据集。
Details
Motivation: 当前强化学习和指令蒸馏方法在开放式生成任务中存在局限,如缺乏清晰奖励信号或依赖昂贵的高质量教师模型。Contribution: 提出REER方法,通过反向工程推导推理过程;开源DeepWriting-20K数据集;训练的DeepWriter-8B模型性能超越开源基线,并媲美GPT-4o和Claude 3.5。
Method: REER通过从已知解决方案反向推导潜在推理步骤,避免传统方法的试错或模仿学习,采用无需梯度的可扩展方法。
Result: DeepWriter-8B在开放式任务中表现优于开源基线,与GPT-4o和Claude 3.5性能相当或更优。
Insight: 反向工程方法为开放式生成任务提供了一种新的推理训练范式,减少对奖励信号或教师模型的依赖。
Abstract: While the deep reasoning'' paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoning -- reinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process forwards’’ through trial-and-error or imitation, REER works ``backwards’’ from known-good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.
[164] From Long to Short: LLMs Excel at Trimming Own Reasoning Chains
Wei Han,Geng Zhan,Sicheng Yu,Chenyu Wang,Bryan Hooi
Main category: cs.AI
TL;DR: 该论文提出了一种名为EDIT的动态推理修剪方法,旨在解决大型推理模型(LRMs)在处理简单问题时过度复杂化的倾向。通过平衡生成的正确性和简洁性,EDIT显著提升了推理效率。
Details
Motivation: 大型推理模型(LRMs)在处理复杂任务时表现出色,但在简单问题上容易“过度思考”,导致冗长的推理链和策略切换。这降低了模型的可解释性和用户体验。Contribution: 论文的主要贡献是提出EDIT方法,通过动态修剪推理路径,帮助LRMs在测试时找到最短的正确推理路径,从而平衡简洁性和正确性。
Method: EDIT使用约束引导生成,同时跟踪长度和答案分布,选择最优平衡点的响应。该方法在多种模型和数据集上进行了广泛实验验证。
Result: 实验表明,EDIT显著提高了推理效率,生成的输出既简洁又准确,改善了可读性和用户体验。
Insight: 研究发现,LRMs在生成目标(如正确性和简洁性)之间存在平衡困难,EDIT通过动态调整约束解决了这一问题。
Abstract: O1/R1 style large reasoning models (LRMs) signal a substantial leap forward over conventional instruction-following LLMs. By applying test-time scaling to generate extended reasoning paths, they establish many SOTAs across a wide range of complex reasoning tasks. However, recent studies show that LRMs are prone to suffer from overthinking – the tendency to overcomplicate simple problems, leading to excessive strategy switching and long, convoluted reasoning traces that hinder their interpretability. To mitigate this issue, we conduct a systematic investigation into the reasoning efficiency of a broad set of LRMs and uncover a common dilemma: the difficulty in balancing multiple generation objectives such as correctness and brevity. Based on this discovery, we propose a test-time scaling method, EDIT (Efficient Dynamic Inference Trimming), which efficiently guides LRMs to identify the shortest correct reasoning paths at test time. EDIT employs constraint-guided generation while jointly tracking length and answer distributions under varying constraints, allowing it to select responses that strike an optimal balance between conciseness and correctness. Extensive experiments across diverse models and datasets show that EDIT substantially enhance the reasoning efficiency, producing compact yet informative outputs that improve readability and user experience.
[165] SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
Xuan-Phi Nguyen,Shrey Pandit,Revanth Gangi Reddy,Austin Xu,Silvio Savarese,Caiming Xiong,Shafiq Joty
Main category: cs.AI
TL;DR: 论文提出了一种基于强化学习的自主单智能体模型(SFR-DeepResearch),用于优化复杂推理任务中的动态行为决策,并通过合成数据训练模型,在基准测试中表现优异。
Details
Motivation: 当前的多智能体系统在复杂任务中(如深度研究)需要静态工作流程和人工干预,而自主单智能体能在动态环境中自主决策,更适合复杂推理任务。Contribution: 1) 提出了针对深度研究的自主单智能体模型;2) 设计了基于合成数据的强化学习训练方法;3) 在公开基准测试中验证了模型的有效性(性能提升28.7%)。
Method: 1) 使用强化学习(RL)优化推理能力;2) 完全基于合成数据训练模型;3) 动态决策模型取代静态工作流程。
Result: SFR-DR-20B模型在Humanity’s Last Exam基准测试中达到28.7%的性能提升。
Insight: 自主单智能体通过动态决策能力在复杂任务中表现优于静态多智能体系统,且合成数据训练方法可推广至其他任务。
Abstract: Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking’’) models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity’s Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.
[166] Reinforcement Learning Foundations for Deep Research Systems: A Survey
Wenjun Li,Zhi Chen,Jingru Lin,Hannan Cao,Wei Han,Sheng Liang,Zhi Zhang,Kuicai Dong,Dexun Li,Chen Zhang,Yong Liu
Main category: cs.AI
TL;DR: 这篇论文是一篇关于强化学习在深度研究系统中基础的综述,首次系统化地总结了DeepSeek-R1之后的工作,涵盖了数据合成、强化学习方法以及训练系统等多个方面。
Details
Motivation: 当前深度研究系统的训练方法(如SFT和DPO)存在模仿偏差和其他局限性,而强化学习能够更好地支持复杂的闭环任务和工具交互。Contribution: 首次专为深度研究系统的强化学习基础提供综述,并系统化了相关研究的具体方法与实践。
Method: 综述内容围绕数据合成与整理、强化学习方法(包括稳定性、样本效率等)以及训练系统三个轴展开。
Result: 总结了强化学习在深度研究系统中的优势与实践模式,并提供了训练鲁棒代理的实用指导。
Insight: 强化学习能够有效减少对人类先验知识和标注偏差的依赖,同时更好地支持多目标优化和长期任务。
Abstract: Deep research systems, agentic AI that solve complex, multi-step tasks by coordinating reasoning, search across the open web and user files, and tool use, are moving toward hierarchical deployments with a Planner, Coordinator, and Executors. In practice, training entire stacks end-to-end remains impractical, so most work trains a single planner connected to core tools such as search, browsing, and code. While SFT imparts protocol fidelity, it suffers from imitation and exposure biases and underuses environment feedback. Preference alignment methods such as DPO are schema and proxy-dependent, off-policy, and weak for long-horizon credit assignment and multi-objective trade-offs. A further limitation of SFT and DPO is their reliance on human defined decision points and subskills through schema design and labeled comparisons. Reinforcement learning aligns with closed-loop, tool-interaction research by optimizing trajectory-level policies, enabling exploration, recovery behaviors, and principled credit assignment, and it reduces dependence on such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundations of deep research systems. It systematizes work after DeepSeek-R1 along three axes: (i) data synthesis and curation; (ii) RL methods for agentic research covering stability, sample efficiency, long context handling, reward and credit design, multi-objective optimization, and multimodal integration; and (iii) agentic RL training systems and frameworks. We also cover agent architecture and coordination, as well as evaluation and benchmarks, including recent QA, VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We distill recurring patterns, surface infrastructure bottlenecks, and offer practical guidance for training robust, transparent deep research agents with RL.
[167] VehicleWorld: A Highly Integrated Multi-Device Environment for Intelligent Vehicle Interaction
Jie Yang,Jiajun Chen,Zhangyue Yin,Shuo Chen,Yuxin Wang,Yiran Guo,Yuan Li,Yining Zheng,Xuanjing Huang,Xipeng Qiu
Main category: cs.AI
TL;DR: 论文提出了VehicleWorld,一个高度集成的多设备环境,用于智能车辆交互,提出了基于状态的函数调用(SFC)方法,显著优于传统函数调用方法。
Details
Motivation: 智能车辆驾驶舱的复杂性要求协调紧密耦合的子系统,传统的函数调用方法效率低且缺乏错误恢复能力。Contribution: 提出VehicleWorld环境,包含30个模块和250个API,支持实时状态信息;提出SFC方法,通过直接状态预测提升性能。
Method: 采用State-based Function Call(SFC),维护显式系统状态感知并实现直接状态转换。
Result: 实验表明SFC在准确性和延迟方面显著优于传统FC方法。
Insight: 直接状态预测在环境控制中比函数调用更有效。
Abstract: Intelligent vehicle cockpits present unique challenges for API Agents, requiring coordination across tightly-coupled subsystems that exceed typical task environments’ complexity. Traditional Function Calling (FC) approaches operate statelessly, requiring multiple exploratory calls to build environmental awareness before execution, leading to inefficiency and limited error recovery. We introduce VehicleWorld, the first comprehensive environment for the automotive domain, featuring 30 modules, 250 APIs, and 680 properties with fully executable implementations that provide real-time state information during agent execution. This environment enables precise evaluation of vehicle agent behaviors across diverse, challenging scenarios. Through systematic analysis, we discovered that direct state prediction outperforms function calling for environmental control. Building on this insight, we propose State-based Function Call (SFC), a novel approach that maintains explicit system state awareness and implements direct state transitions to achieve target conditions. Experimental results demonstrate that SFC significantly outperforms traditional FC approaches, achieving superior execution accuracy and reduced latency. We have made all implementation code publicly available on Github https://github.com/OpenMOSS/VehicleWorld.
[168] RAFFLES: Reasoning-based Attribution of Faults for LLM Systems
Chenyang Zhu,Spencer Hong,Jingyu Wu,Kushal Chawla,Charlotte Tang,Youbing Yin,Nathan Wolfe,Erin Babinsky,Daben Liu
Main category: cs.AI
TL;DR: RAFFLES是一种评估架构,通过推理和迭代优化,识别多组件LLM系统中的故障点,显著提高了故障检测的准确性。
Details
Motivation: 现有的LLM系统评估方法局限于单一指标或端到端结果,难以诊断复杂系统中的故障原因。Contribution: 提出了RAFFLES框架,结合推理和迭代优化,能够系统地检测和诊断LLM系统中的故障。
Method: 采用多组件管道设计,包括中心法官和专门评估器,通过迭代推理构建假设历史。
Result: 在Who&When数据集上,RAFFLES显著优于基线,故障检测准确率达43%(算法生成数据)和20%(手工数据)。
Insight: 自动化故障检测框架可替代人工审查,为LLM系统的开发和优化提供关键支持。
Abstract: We have reached a critical roadblock in the development and enhancement of long-horizon, multi-component LLM agentic systems: it is incredibly tricky to identify where these systems break down and why. Evaluation capabilities that currently exist today (e.g., single pass LLM-as-a-judge) are limited in that they often focus on individual metrics or capabilities, end-to-end outcomes, and are narrowly grounded on the preferences of humans. We argue that to match the agentic capabilities, evaluation frameworks must also be able to reason, probe, iterate, and understand the complex logic passing through these systems over long horizons. In this paper, we present RAFFLES - an evaluation architecture that incorporates reasoning and iterative refinement. Specifically, RAFFLES operates as an iterative, multi-component pipeline, using a central Judge to systematically investigate faults and a set of specialized Evaluators to assess not only the system’s components but also the quality of the reasoning by the Judge itself, thereby building a history of hypotheses. We tested RAFFLES against several baselines on the Who&When dataset, a benchmark designed to diagnose the “who” (agent) and “when” (step) of a system’s failure. RAFFLES outperforms these baselines, achieving an agent-step fault pair accuracy of over 43% on the Algorithmically-Generated dataset (a substantial increase from the previously published best of 16.6%) and over 20% on the Hand-Crafted dataset (surpassing the previously published best of 8.8%). These results demonstrate a key step towards introducing automated fault detection for autonomous systems over labor-intensive manual human review.
[169] Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet
James Xu Zhao,Bryan Hooi,See-Kiong Ng
Main category: cs.AI
TL;DR: 研究表明,测试时扩展推理链虽然在其他领域表现优异,但在知识密集型任务中并不能有效提升事实准确性,反而可能增加幻觉现象。
Details
Motivation: 测试时扩展通常通过生成长推理链来增加推理计算,已在多领域表现优异。但研究团队质疑其在知识密集型任务中的有效性,因这类任务对事实准确性和低幻觉率要求极高。Contribution: 通过12个推理模型在两个知识密集型基准上的全面评估,揭示了测试时扩展无法一致提升准确性,甚至可能增加幻觉。同时分析了扩展推理对幻觉行为的影响机制。
Method: 使用12个推理模型在两个知识密集型任务上进行测试时扩展实验,评估其在事实准确性和幻觉率方面的表现,并通过案例分析扩展推理对模型行为的影响。
Result: 测试时扩展未显著提升知识密集型任务的准确性,反而在多数情况下增加幻觉现象。研究发现,幻觉减少更多源于模型选择放弃回答而非事实记忆改善。
Insight: 扩展推理可能引发确认偏差,导致过度自信的幻觉。尽管如此,相对于完全不推理,允许模型思考仍有一定益处。
Abstract: Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge
[170] From Image Generation to Infrastructure Design: a Multi-agent Pipeline for Street Design Generation
Chenguang Wang,Xiang Yan,Yilong Dai,Ziyi Wang,Susu Xu
Main category: cs.AI
TL;DR: 这篇论文提出了一种多智能体系统,用于在真实街道图像上编辑和重新设计自行车设施。该系统通过整合车道定位、提示优化、设计生成和自动评估,生成了视觉连贯且符合指令的设计方案。
Details
Motivation: 传统街道设计方案生成方法耗时耗力,限制了公众参与交通规划的效率和协作决策。AI辅助的生成设计虽有潜力,但现有方法通常需要大量领域特定的训练数据,并且难以在复杂的街道场景中实现精确的空间变化。Contribution: 论文的主要贡献是提出了一种多智能体框架,能够直接在真实街道图像上生成和编辑自行车设施设计,适应不同道路几何和环境条件,生成视觉连贯且符合指令的设计方案。
Method: 方法包括四个主要步骤:车道定位(lane localization)、提示优化(prompt optimization)、设计生成(design generation)和自动评估(automated evaluation)。这些步骤通过多智能体系统协同工作,生成符合上下文的设计。
Result: 实验结果表明,系统能够适应多样化的城市场景,生成视觉连贯且符合指令的设计方案,为交通基础设施规划和设施设计提供了新的可能性。
Insight: 多智能体系统在交通基础设施规划中具有潜力,能够通过AI辅助生成设计减少人工干预,提高设计效率和协作能力。
Abstract: Realistic visual renderings of street-design scenarios are essential for public engagement in active transportation planning. Traditional approaches are labor-intensive, hindering collective deliberation and collaborative decision-making. While AI-assisted generative design shows transformative potential by enabling rapid creation of design scenarios, existing generative approaches typically require large amounts of domain-specific training data and struggle to enable precise spatial variations of design/configuration in complex street-view scenes. We introduce a multi-agent system that edits and redesigns bicycle facilities directly on real-world street-view imagery. The framework integrates lane localization, prompt optimization, design generation, and automated evaluation to synthesize realistic, contextually appropriate designs. Experiments across diverse urban scenarios demonstrate that the system can adapt to varying road geometries and environmental conditions, consistently yielding visually coherent and instruction-compliant results. This work establishes a foundation for applying multi-agent pipelines to transportation infrastructure planning and facility design.
[171] Towards Meta-Cognitive Knowledge Editing for Multimodal LLMs
Zhaoyu Fan,Kaihang Pan,Mingze Zhou,Bosheng Qin,Juncheng Li,Shengyu Zhang,Wenqiao Zhang,Siliang Tang,Fei Wu,Yueting Zhuang
Main category: cs.AI
TL;DR: 该论文提出了一个新颖的基准CogEdit,用于评估多模态大语言模型(MLLMs)在元认知层面的知识编辑能力,并提出了MIND框架以提升性能。
Details
Motivation: 现有的知识编辑基准主要关注认知层面的修改,忽视了元认知过程(如知识自我感知和边界约束)。为了解决这一问题,论文提出了一个更全面的评估框架。Contribution: 1)提出了CogEdit基准,覆盖三类元认知知识编辑任务;2)开发了MIND框架,整合元知识记忆和博弈论方法,提升编辑性能。
Method: MIND框架通过构建元知识记忆增强自我感知,利用博弈论交互监控知识激活,并引入标签精细化提升噪声鲁棒性。
Result: MIND在传统和元认知知识编辑基准上均显著优于现有方法。
Insight: 元认知层面的知识编辑是多模态大语言模型的重要研究方向,未来可能涉及更复杂的动态知识更新和交互机制。
Abstract: Knowledge editing enables multimodal large language models (MLLMs) to efficiently update outdated or incorrect information. However, existing benchmarks primarily emphasize cognitive-level modifications while lacking a focus on deeper meta-cognitive processes. To bridge this gap, we introduce CogEdit, a novel benchmark designed to evaluate MLLMs’ meta-cognitive knowledge editing abilities across three levels: (1) Counterfactual-Driven Editing, assessing self-awareness of knowledge correctness changes; (2) Boundary Constraint Editing, ensuring appropriate generalization without unintended interference; and (3) Noise-Robust Editing, promoting reflective evaluation of uncertain information. To advance meta-cognitive editing, we propose MIND (Meta-cognitive INtegrated Dynamic Knowledge Editing), a framework that constructs a meta-knowledge memory for self-awareness, employs game-theoretic interactions to monitor knowledge activation, and incorporates label refinement for noise-robust updates. Extensive experiments show that MIND significantly outperforms existing cognitive editing approaches, achieving strong performance on both traditional and meta-cognitive knowledge editing benchmarks.
cs.DB [Back]
[172] Language Native Lightly Structured Databases for Large Language Model Driven Composite Materials Research
Yuze Liu,Zhaoyuan Zhang,Xiangsheng Zeng,Yihe Zhang,Leping Yu,Lejia Wang,Xi Yu
Main category: cs.DB
TL;DR: 该论文提出了一种语言原生的轻结构化数据库,用于支持大型语言模型驱动的复合材料研究,特别是针对氮化硼纳米片聚合物导热复合材料。
Details
Motivation: 传统的化学和材料研究主要依赖语言描述,而非结构化数据,这限制了数据库和机器学习技术的应用。因此,需要一种新的方法将这些语言信息转化为可利用的结构化数据。Contribution: 1. 提出了一个语言原生的轻结构化数据库;
2. 实现了复合检索方法,结合语义、关键词和值过滤;
3. 支持检索增强生成(RAG)和工具增强代理,生成可操作的标准化操作程序(SOP)。
Method: 通过从论文中提取轻结构化信息(如制备、表征、理论计算和机理推理),并将其组织为异构数据库。采用复合检索方法(语义、关键词和值过滤)实现高效查询。
Result: 系统能够生成准确、可验证且符合专家风格的文献综述,为LLM驱动的材料发现提供了语言丰富的基础。
Insight: 轻结构化数据库和复合检索方法可以弥补传统语言描述与机器学习需求之间的鸿沟,为材料研究提供新的范式。
Abstract: Chemical and materials research has traditionally relied heavily on knowledge narrative, with progress often driven by language-based descriptions of principles, mechanisms, and experimental experiences, rather than tables, limiting what conventional databases and ML can exploit. We present a language-native database for boron nitride nanosheet (BNNS) polymer thermally conductive composites that captures lightly structured information from papers across preparation, characterization, theory-computation, and mechanistic reasoning, with evidence-linked snippets. Records are organized in a heterogeneous database and queried via composite retrieval with semantics, key words and value filters. The system can synthesizes literature into accurate, verifiable, and expert style guidance. This substrate enables high fidelity efficient Retrieval Augmented Generation (RAG) and tool augmented agents to interleave retrieval with reasoning and deliver actionable SOP. The framework supplies the language rich foundation required for LLM-driven materials discovery.
cs.RO [Back]
[173] LocoMamba: Vision-Driven Locomotion via End-to-End Deep Reinforcement Learning with Mamba
Yinuo Wang,Gavin Tao
Main category: cs.RO
TL;DR: LocoMamba是首个基于选择性状态空间模型(Mamba)的视觉驱动跨模态DRL框架,通过高效的序列建模和长距离依赖捕捉,实现了在复杂环境中高效的运动策略训练。
Details
Motivation: 现有DRL框架在处理视觉输入和长序列建模时面临计算复杂性和训练效率的挑战,无法有效应对复杂环境中的运动任务。Contribution: 1) 提出首个基于Mamba的视觉驱动跨模态DRL框架;2) 通过轻量级CNN和多层感知机生成紧凑状态表征;3) 引入选择性扫描机制降低延迟和内存占用;4) 采用地形和外观随机化及障碍密度课程提升泛化能力。
Method: 1) 使用CNN和多层感知机编码视觉和本体感受状态;2) 堆叠Mamba层进行高效序列融合;3) 基于PPO的端到端训练,结合随机化和课程学习;4) 设计紧凑的以状态为核心的奖励函数。
Result: 在复杂仿真环境中,LocoMamba比SOTA基线获得更高回报和成功率,碰撞更少,泛化能力更强,且训练效率更高。
Insight: Mamba的选择性状态空间模型在DRL任务中展现出高效性和鲁棒性,为视觉驱动的运动策略提供了新思路。
Abstract: We introduce LocoMamba, a vision-driven cross-modal DRL framework built on selective state-space models, specifically leveraging Mamba, that achieves near-linear-time sequence modeling, effectively captures long-range dependencies, and enables efficient training with longer sequences. First, we embed proprioceptive states with a multilayer perceptron and patchify depth images with a lightweight convolutional neural network, producing compact tokens that improve state representation. Second, stacked Mamba layers fuse these tokens via near-linear-time selective scanning, reducing latency and memory footprint, remaining robust to token length and image resolution, and providing an inductive bias that mitigates overfitting. Third, we train the policy end-to-end with Proximal Policy Optimization under terrain and appearance randomization and an obstacle-density curriculum, using a compact state-centric reward that balances progress, smoothness, and safety. We evaluate our method in challenging simulated environments with static and moving obstacles as well as uneven terrain. Compared with state-of-the-art baselines, our method achieves higher returns and success rates with fewer collisions, exhibits stronger generalization to unseen terrains and obstacle densities, and improves training efficiency by converging in fewer updates under the same compute budget.
[174] ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory
Ying Li,Xiaobao Wei,Xiaowei Chi,Yuming Li,Zhongyu Zhao,Hao Wang,Ningning Ma,Ming Lu,Shanghang Zhang
Main category: cs.RO
TL;DR: ManipDreamer3D提出了一种新框架,通过结合3D轨迹规划和重建的3D占用图,以及轨迹到视频的扩散模型,生成逼真的机器人操作视频。
Details
Motivation: 数据稀缺是机器人操作领域的主要挑战,现有方法依赖2D轨迹导致3D空间模糊性问题难以解决。Contribution: 提出了一个结合3D轨迹规划和占用感知的视频生成框架,减少了人工干预需求。
Method: 通过输入图像重建3D占用表示,优化3D末端执行器轨迹,并利用轨迹到视频的扩散模型生成操作视频。
Result: 实验表明方法在视觉质量上优于现有方法。
Insight: 3D占用表示和优化的轨迹规划能够显著提升机器人操作视频的真实性和可行性。
Abstract: Data scarcity continues to be a major challenge in the field of robotic manipulation. Although diffusion models provide a promising solution for generating robotic manipulation videos, existing methods largely depend on 2D trajectories, which inherently face issues with 3D spatial ambiguity. In this work, we present a novel framework named ManipDreamer3D for generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction. Our method combines 3D trajectory planning with a reconstructed 3D occupancy map created from a third-person perspective, along with a novel trajectory-to-video diffusion model. Specifically, ManipDreamer3D first reconstructs the 3D occupancy representation from the input image and then computes an optimized 3D end-effector trajectory, minimizing path length while avoiding collisions. Next, we employ a latent editing technique to create video sequences from the initial image latent and the optimized 3D trajectory. This process conditions our specially trained trajectory-to-video diffusion model to produce robotic pick-and-place videos. Our method generates robotic videos with autonomously planned plausible 3D trajectories, significantly reducing human intervention requirements. Experimental results demonstrate superior visual quality compared to existing methods.
[175] Evaluation of Large Language Models for Anomaly Detection in Autonomous Vehicles
Petros Loukas,David Bassir,Savvas Chatzichristofis,Angelos Amanatiadis
Main category: cs.RO
TL;DR: 该论文评估了大语言模型(LLMs)在自动驾驶车辆中作为异常检测器的潜力,针对真实世界边缘场景进行了实验和分析。
Details
Motivation: 现有研究多局限于合成或人工驾驶数据集,缺乏对真实边缘场景下LLMs表现的系统评估,论文旨在填补这一空白。Contribution: 提出了一种结合开放词汇目标检测器、提示工程和LLM上下文推理的架构,并在真实边缘场景中评估了多种SOTA模型。
Method: 采用开放词汇目标检测器与提示工程结合,利用LLM进行上下文推理,评估其在自动驾驶异常检测中的表现。
Result: 提供了定性比较结果,讨论了LLMs在自动驾驶中作为异常检测器的潜在应用。
Insight: 论文揭示了LLMs在复杂真实场景下的能力边界,为后续研究提供了方向。
Abstract: The rapid evolution of large language models (LLMs) has pushed their boundaries to many applications in various domains. Recently, the research community has started to evaluate their potential adoption in autonomous vehicles and especially as complementary modules in the perception and planning software stacks. However, their evaluation is limited in synthetic datasets or manually driving datasets without the ground truth knowledge and more precisely, how the current perception and planning algorithms would perform in the cases under evaluation. For this reason, this work evaluates LLMs on real-world edge cases where current autonomous vehicles have been proven to fail. The proposed architecture consists of an open vocabulary object detector coupled with prompt engineering and large language model contextual reasoning. We evaluate several state-of-the-art models against real edge cases and provide qualitative comparison results along with a discussion on the findings for the potential application of LLMs as anomaly detectors in autonomous vehicles.
[176] O$^3$Afford: One-Shot 3D Object-to-Object Affordance Grounding for Generalizable Robotic Manipulation
Tongxuan Tian,Xuhui Kang,Yen-Ling Kuo
Main category: cs.RO
TL;DR: 该论文提出了O$^3$Afford,一种一次性学习的3D物体间功能接地方法,结合视觉基础模型和大语言模型,显著提升了机器人在新物体和类别上的泛化能力。
Details
Motivation: 现有的物体功能接地研究多集中于单物体,而现实世界中的交互通常是物体对的互动。论文旨在解决数据有限条件下物体间功能接地的挑战。Contribution: 1)提出了一种一次性学习的3D物体对功能接地方法;2)结合视觉基础模型和点云表示优化几何理解;3)将3D功能表示与大语言模型集成,增强其对物体交互的理解和推理能力。
Method: 使用视觉基础模型的语义特征和点云表示进行几何理解,并通过一次性学习实现泛化;进一步与大语言模型结合,生成任务特定的约束函数。
Result: 实验表明,O$^3$Afford在准确性和泛化能力上显著优于现有基线方法。
Insight: 结合视觉基础模型和语言模型可以提升机器人对物体间交互的理解能力,为复杂任务的泛化提供了新思路。
Abstract: Grounding object affordance is fundamental to robotic manipulation as it establishes the critical link between perception and action among interacting objects. However, prior works predominantly focus on predicting single-object affordance, overlooking the fact that most real-world interactions involve relationships between pairs of objects. In this work, we address the challenge of object-to-object affordance grounding under limited data contraints. Inspired by recent advances in few-shot learning with 2D vision foundation models, we propose a novel one-shot 3D object-to-object affordance learning approach for robotic manipulation. Semantic features from vision foundation models combined with point cloud representation for geometric understanding enable our one-shot learning pipeline to generalize effectively to novel objects and categories. We further integrate our 3D affordance representation with large language models (LLMs) for robotics manipulation, significantly enhancing LLMs’ capability to comprehend and reason about object interactions when generating task-specific constraint functions. Our experiments on 3D object-to-object affordance grounding and robotic manipulation demonstrate that our O$^3$Afford significantly outperforms existing baselines in terms of both accuracy and generalization capability.
[177] LLaDA-VLA: Vision Language Diffusion Action Models
Yuqing Wen,Hebei Li,Kefan Gu,Yucheng Zhao,Tiancai Wang,Xiaoyan Sun
Main category: cs.RO
TL;DR: LLaDA-VLA是首个基于预训练扩散视觉语言模型(d-VLM)的视觉-语言-扩散-动作模型,用于机器人操作任务。通过局部特殊标记分类策略和分层动作结构解码策略,显著提升了性能。
Details
Motivation: 尽管扩散模型在文本生成和多模态任务中表现优异,但其在机器人策略学习中的应用尚未充分探索。Contribution: 1. 提出了首个基于d-VLM的VLA模型LLaDA-VLA;2. 设计了局部特殊标记分类和分层动作解码策略,有效解决了机器人领域的适应问题。
Method: 1. 通过局部特殊标记分类简化适应过程;2. 采用分层动作结构解码策略处理动作内的依赖关系。
Result: LLaDA-VLA在仿真和真实机器人任务中显著优于现有VLA模型。
Insight: 扩散模型在机器人策略学习中具有潜力,局部化和分层处理是关键优化方向。
Abstract: The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.
[178] F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Qi Lv,Weijie Kong,Hao Li,Jia Zeng,Zherui Qiu,Delin Qu,Haoming Song,Qizhi Chen,Xiang Deng,Jiangmiao Pang
Main category: cs.RO
TL;DR: F1提出了一种预训练的视觉-语言-动作(VLA)模型,通过整合视觉预见生成到决策流程中,解决了动态视觉环境中语言条件任务的执行问题。它采用混合Transformer架构和多阶段训练方法,显著提升了任务成功率和泛化能力。
Details
Motivation: 现有VLA模型多为反应式状态到动作的映射,容易导致短视行为和动态场景中的鲁棒性不足。通过引入视觉预见生成机制,F1旨在更好地规划动作以实现长期目标。Contribution: 1. 提出F1框架,结合视觉预见生成与决策;2. 采用混合Transformer架构和模块化设计;3. 通过三阶段训练方法增强泛化能力。
Method: F1采用混合Transformer架构,包括感知、预见生成和控制模块。其核心是目标条件视觉预见生成机制,将动作生成转化为预见引导的逆动力学问题。训练使用330k轨迹的广泛数据集,分三阶段进行。
Result: 在真实任务和仿真基准测试中,F1显著优于现有方法,任务成功率和泛化能力均有大幅提升。
Insight: 视觉预见生成能够有效提升动态环境中任务的长期规划和鲁棒性,模块化设计和多阶段训练是泛化能力的关键。
Abstract: Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.
[179] Deep Reactive Policy: Learning Reactive Manipulator Motion Planning for Dynamic Environments
Jiahui Yang,Jason Jingzhou Liu,Yulong Li,Youssef Khaky,Kenneth Shaw,Deepak Pathak
Main category: cs.RO
TL;DR: DRP是一种用于动态环境的视觉运动神经运动策略,核心是预训练的IMPACT模块,结合迭代师生微调和局部反应式目标提议模块DCP-RMP,在复杂动态场景中表现优异。
Details
Motivation: 传统运动规划器需要完整环境信息且速度慢,神经运动策略在动态场景中泛化能力不足。DRP旨在解决这些问题,直接在点云输入上实现快速反应式运动规划。Contribution: 提出DRP框架,结合预训练的IMPACT、师生微调改进静态避障,以及DCP-RMP增强动态避障,实现高效泛化的动态运动规划。
Method: 1) IMPACT:基于Transformer的神经运动策略,预训练于千万级专家轨迹;2) 师生微调优化静态避障;3) DCP-RMP模块实时增强动态避障。
Result: DRP在杂乱场景、动态障碍和遮挡任务中,成功率和泛化能力超越传统和神经方法,仿真与实测均验证其有效性。
Insight: 大规模预训练与局部反应模块结合能有效提升动态环境下的运动规划性能,点云直接输入简化传感器融合。
Abstract: Generating collision-free motion in dynamic, partially observable environments is a fundamental challenge for robotic manipulators. Classical motion planners can compute globally optimal trajectories but require full environment knowledge and are typically too slow for dynamic scenes. Neural motion policies offer a promising alternative by operating in closed-loop directly on raw sensory inputs but often struggle to generalize in complex or dynamic settings. We propose Deep Reactive Policy (DRP), a visuo-motor neural motion policy designed for reactive motion generation in diverse dynamic environments, operating directly on point cloud sensory input. At its core is IMPACT, a transformer-based neural motion policy pretrained on 10 million generated expert trajectories across diverse simulation scenarios. We further improve IMPACT’s static obstacle avoidance through iterative student-teacher finetuning. We additionally enhance the policy’s dynamic obstacle avoidance at inference time using DCP-RMP, a locally reactive goal-proposal module. We evaluate DRP on challenging tasks featuring cluttered scenes, dynamic moving obstacles, and goal obstructions. DRP achieves strong generalization, outperforming prior classical and neural methods in success rate across both simulated and real-world settings. Video results and code available at https://deep-reactive-policy.com
cs.CR [Back]
[180] Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
Waris Gill,Natalie Isak,Matthew Dressman
Main category: cs.CR
TL;DR: BinaryShield提出了一种隐私保护的威胁情报共享系统,通过在LLM服务之间生成不可逆的攻击指纹,解决了合规边界下安全问题。
Details
Motivation: LLM服务广泛部署导致安全问题突出,但隐私法规限制跨服务共享威胁情报,BinaryShield旨在解决这一问题。Contribution: 首次提出隐私保护的威胁情报共享系统BinaryShield,结合PII脱敏、语义嵌入等技术生成不可逆攻击指纹。
Method: BinaryShield通过PII脱敏、语义嵌入、二进制量化和随机响应机制生成隐私保护的攻击指纹。
Result: F1-score达0.94,显著优于基线SimHash(0.77),存储减少64倍,相似性搜索速度提升38倍。
Insight: 隐私保护与安全威胁检测可兼顾,BinaryShield为跨服务安全协作提供了可行方案。
Abstract: The widespread deployment of LLMs across enterprise services has created a critical security blind spot. Organizations operate multiple LLM services handling billions of queries daily, yet regulatory compliance boundaries prevent these services from sharing threat intelligence about prompt injection attacks, the top security risk for LLMs. When an attack is detected in one service, the same threat may persist undetected in others for months, as privacy regulations prohibit sharing user prompts across compliance boundaries. We present BinaryShield, the first privacy-preserving threat intelligence system that enables secure sharing of attack fingerprints across compliance boundaries. BinaryShield transforms suspicious prompts through a unique pipeline combining PII redaction, semantic embedding, binary quantization, and randomized response mechanism to potentially generate non-invertible fingerprints that preserve attack patterns while providing privacy. Our evaluations demonstrate that BinaryShield achieves an F1-score of 0.94, significantly outperforming SimHash (0.77), the privacy-preserving baseline, while achieving 64x storage reduction and 38x faster similarity search compared to dense embeddings.
[181] Tell-Tale Watermarks for Explanatory Reasoning in Synthetic Media Forensics
Ching-Chun Chang,Isao Echizen
Main category: cs.CR
TL;DR: 论文提出了一种针对合成媒体的可解释水印系统,用于追踪合成媒体的生成链,通过水印的变化推理其经历的变换类型和程度。
Details
Motivation: 随着合成媒体的普及,公众对网络内容的信任度下降,而多样的媒体编辑操作使得法医分析更为复杂。需要一种能够追踪合成媒体生成历史的方法,以判断是否存在犯罪意图。Contribution: 开发了一种‘tell-tale’水印系统,能够根据合成媒体的变换动态演化,提供可解释的痕迹,支持对合成媒体生成链的推理。
Method: 设计了针对不同变换类别的定制水印,这些水印既不绝对鲁棒也不易碎,而是可解释的。通过推理组合变换参数空间,推断最可能的变换历史。
Result: 实验验证了该系统在保真度、同步性和可追踪性方面的有效性。
Insight: 可解释水印能够为合成媒体的法医分析提供新的技术手段,帮助辨别其真实性和变换历史。
Abstract: The rise of synthetic media has blurred the boundary between reality and fabrication under the evolving power of artificial intelligence, fueling an infodemic that erodes public trust in cyberspace. For digital imagery, a multitude of editing applications further complicates the forensic analysis, including semantic edits that alter content, photometric adjustments that recalibrate colour characteristics, and geometric projections that reshape viewpoints. Collectively, these transformations manipulate and control perceptual interpretation of digital imagery. This susceptibility calls for forensic enquiry into reconstructing the chain of events, thereby revealing deeper evidential insight into the presence or absence of criminal intent. This study seeks to address an inverse problem of tracing the underlying generation chain that gives rise to the observed synthetic media. A tell-tale watermarking system is developed for explanatory reasoning over the nature and extent of transformations across the lifecycle of synthetic media. Tell-tale watermarks are tailored to different classes of transformations, responding in a manner that is neither strictly robust nor fragile but instead interpretable. These watermarks function as reference clues that evolve under the same transformation dynamics as the carrier media, leaving interpretable traces when subjected to transformations. Explanatory reasoning is then performed to infer the most plausible account across the combinatorial parameter space of composite transformations. Experimental evaluations demonstrate the validity of tell-tale watermarking with respect to fidelity, synchronicity and traceability.
[182] Signal-Based Malware Classification Using 1D CNNs
Jack Wilkie,Hanan Hindy,Ivan Andonovic,Christos Tachtatzis,Robert Atkinson
Main category: cs.CR
TL;DR: 该论文提出了一种基于1D信号而非传统2D图像的恶意软件分类方法,通过避免量化噪声和2D依赖关系的信息损失,显著提升了分类性能。
Details
Motivation: 传统恶意软件分类方法中,将二进制文件转换为2D图像会导致信息损失(如量化噪声和虚假的2D依赖关系),限制了分类器的性能。本文旨在通过直接处理1D信号来避免这些问题。Contribution: 1. 提出了一种新的1D信号形式的恶意软件表示方法;2. 展示了如何将2D CNN架构适配于1D信号分类;3. 设计了一个基于ResNet和SE层的定制1D CNN模型,并在MalNet数据集上实现了SOTA性能。
Method: 1. 将恶意软件二进制文件直接转换为1D信号,避免量化噪声;2. 适配2D CNN架构为1D信号分类;3. 设计了一个结合ResNet和SE层的1D CNN模型。
Result: 在MalNet数据集上,该方法在二进制、类型和家族级别分类上的F1分数分别达到了0.874、0.503和0.507,优于现有方法。
Insight: 直接处理1D信号可以有效避免信息损失,为恶意软件分类提供了一种新的高效模态。未来的研究可以进一步探索1D信号的潜力。
Abstract: Malware classification is a contemporary and ongoing challenge in cyber-security: modern obfuscation techniques are able to evade traditional static analysis, while dynamic analysis is too resource intensive to be deployed at a large scale. One prominent line of research addresses these limitations by converting malware binaries into 2D images by heuristically reshaping them into a 2D grid before resizing using Lanczos resampling. These images can then be classified based on their textural information using computer vision approaches. While this approach can detect obfuscated malware more effectively than static analysis, the process of converting files into 2D images results in significant information loss due to both quantisation noise, caused by rounding to integer pixel values, and the introduction of 2D dependencies which do not exist in the original data. This loss of signal limits the classification performance of the downstream model. This work addresses these weaknesses by instead resizing the files into 1D signals which avoids the need for heuristic reshaping, and additionally these signals do not suffer from quantisation noise due to being stored in a floating-point format. It is shown that existing 2D CNN architectures can be readily adapted to classify these 1D signals for improved performance. Furthermore, a bespoke 1D convolutional neural network, based on the ResNet architecture and squeeze-and-excitation layers, was developed to classify these signals and evaluated on the MalNet dataset. It was found to achieve state-of-the-art performance on binary, type, and family level classification with F1 scores of 0.874, 0.503, and 0.507, respectively, paving the way for future models to operate on the proposed signal modality.
astro-ph.IM [Back]
[183] Stereovision Image Processing for Planetary Navigation Maps with Semi-Global Matching and Superpixel Segmentation
Yan-Shan Lu,Miguel Arana-Catania,Saurabh Upadhyay,Leonard Felicetti
Main category: astro-ph.IM
TL;DR: 该论文提出了一种结合半全局匹配(SGM)和超像素分割的方法,用于火星漫游车的立体视觉导航地图生成。相比传统局部块匹配方法,该方法在低纹理、遮挡和重复模式下表现更好,生成的地形模型更精确且更适合自主导航需求。
Details
Motivation: 火星地形复杂且危险,需要高精度的地形模型以确保漫游车安全导航。传统立体匹配方法在处理低纹理、遮挡和重复模式时效果不佳,因此需要一种更鲁棒的方法改进深度图生成。Contribution: 1)提出结合SGM和超像素分割的立体匹配方法,改善深度图的连贯性和细节恢复;2)在火星模拟数据集上验证了方法的有效性,展示了更一致的地形模型和更少的大面积空白区域;3)提供了从特征匹配到最终2D导航地图的完整处理流程。
Method: 采用半全局匹配(SGM)作为基础计算视差图,并通过超像素分割进行后处理优化,以利用场景上下文信息减少块状伪影并恢复细节。
Result: 在火星模拟和另两个数据集中,方法生成了更精确的视差图和地形模型,特别是在斜坡和遮挡区域表现更优,提高了导航地图的质量。
Insight: 结合全局优化和局部分割可以显著改善立体匹配的鲁棒性,特别是在复杂的地形场景中。这种方法为未来行星探索任务提供了实用的技术路线。
Abstract: Mars exploration requires precise and reliable terrain models to ensure safe rover navigation across its unpredictable and often hazardous landscapes. Stereoscopic vision serves a critical role in the rover’s perception, allowing scene reconstruction by generating precise depth maps through stereo matching. State-of-the-art Martian planetary exploration uses traditional local block-matching, aggregates cost over square windows, and refines disparities via smoothness constraints. However, this method often struggles with low-texture images, occlusion, and repetitive patterns because it considers only limited neighbouring pixels and lacks a wider understanding of scene context. This paper uses Semi-Global Matching (SGM) with superpixel-based refinement to mitigate the inherent block artefacts and recover lost details. The approach balances the efficiency and accuracy of SGM and adds context-aware segmentation to support more coherent depth inference. The proposed method has been evaluated in three datasets with successful results: In a Mars analogue, the terrain maps obtained show improved structural consistency, particularly in sloped or occlusion-prone regions. Large gaps behind rocks, which are common in raw disparity outputs, are reduced, and surface details like small rocks and edges are captured more accurately. Another two datasets, evaluated to test the method’s general robustness and adaptability, show more precise disparity maps and more consistent terrain models, better suited for the demands of autonomous navigation on Mars, and competitive accuracy across both non-occluded and full-image error metrics. This paper outlines the entire terrain modelling process, from finding corresponding features to generating the final 2D navigation maps, offering a complete pipeline suitable for integration in future planetary exploration missions.