Table of Contents
- cs.CL [Total: 30]
- cs.CV [Total: 72]
- cs.LG [Total: 10]
- cs.HC [Total: 1]
- cs.CY [Total: 2]
- cs.AI [Total: 5]
- cs.MA [Total: 1]
- cs.IR [Total: 2]
- cs.GR [Total: 2]
- cs.SI [Total: 1]
- cs.CR [Total: 1]
- cs.IT [Total: 1]
- cs.RO [Total: 8]
cs.CL [Back]
[1] Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition
Shreya G. Upadhyay,Carlos Busso,Chi-Chun Lee
Main category: cs.CL
TL;DR: 论文提出了一种基于说话者风格感知的音素锚定框架,用于改进跨语言语音情感识别(SER),通过图和双空间锚定方法实现了更好的情感跨语言传递。
Details
Motivation: 跨语言语音情感识别因语音变异性和说话者表达风格的差异而具有挑战性,需一种能对齐不同说话者和语言情感表达的方法。Contribution: 提出了一种说话者风格感知的音素锚定框架,结合图和双空间锚定技术,显著提升了跨语言情感识别的性能。
Method: 通过图聚类构建情感特定的说话者社区,并在说话者和音素空间中进行双空间锚定。
Result: 在MSP-Podcast和BIIC-Podcast数据集上的实验表明,该方法优于基线模型。
Insight: 跨语言情感表达存在共性,说话者风格和音素对齐是关键因素。
Abstract: Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.
[2] CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
Nithin Somasekharan,Ling Yue,Yadi Cao,Weichao Li,Patrick Emami,Pochinapeddi Sai Bhargav,Anurag Acharya,Xingyu Xie,Shaowu Pan
Main category: cs.CL
TL;DR: CFD-LLMBench是一个专门用于评估大语言模型(LLMs)在计算流体动力学(CFD)领域能力的基准测试套件,涵盖知识掌握、数值推理和工作流实施三个方面。
Details
Motivation: 尽管LLMs在通用NLP任务中表现优异,但在自动化复杂物理系统数值实验方面的应用仍未被充分探索,CFD作为计算科学的核心领域,为评估LLMs的科学能力提供了独特挑战。Contribution: 提出了CFD-LLMBench基准套件,包含CFDQuery、CFDCodeBench和FoamBench三个组件,全面评估LLMs在CFD领域的知识、推理和实施能力。
Method: 设计了基于实际CFD实践的评估框架,结合任务分类和严格评价指标(如代码可执行性、解精确性和数值收敛性)。
Result: 该基准为LLMs在复杂物理系统数值实验自动化中的应用奠定了坚实基础。
Insight: CFD领域的复杂性和实践性为LLMs提供了一个高价值的评估场景,有助于推动LLMs在科学计算中的进一步应用。
Abstract: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system – a critical and labor-intensive component – remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components – CFDQuery, CFDCodeBench, and FoamBench – designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.
[3] SKILL-RAG: Self-Knowledge Induced Learning and Filtering for Retrieval-Augmented Generation
Tomoaki Isoda
Main category: cs.CL
TL;DR: SKILL-RAG提出了一种新方法,利用语言模型的自我知识来过滤检索增强生成中的无关内容,并通过强化学习框架提升性能。
Details
Motivation: 检索增强生成(RAG)中,检索系统可能返回不相关内容,导致生成结果失真(幻觉问题)。如何筛选有用的外部知识成为关键挑战。Contribution: 1. 提出SKILL-RAG,利用模型的自我知识指导检索内容的选择;2. 设计基于强化学习的训练框架,显式提取模型自我知识;3. 实验证明其能提升生成质量并减少输入文档数量。
Method: 1. 基于强化学习训练框架,显式提取模型自我知识;2. 使用句子级粒度过滤无关内容;3. 结合内部与外部知识优化检索选择。
Result: 在Llama2-7B和Qwen3-8B上评估,SKILL-RAG显著提升生成质量并减少输入文档量。
Insight: 模型的自我知识(明确知晓哪些内容有用或无用)对检索增强生成至关重要,能有效减少幻觉问题并提升效率。
Abstract: Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive tasks in recent years. However, since retrieval systems may return irrelevant content, incorporating such information into the model often leads to hallucinations. Thus, identifying and filtering out unhelpful retrieved content is a key challenge for improving RAG performance.To better integrate the internal knowledge of the model with external knowledge from retrieval, it is essential to understand what the model “knows” and “does not know” (which is also called “self-knowledge”). Based on this insight, we propose SKILL-RAG (Self-Knowledge Induced Learning and Filtering for RAG), a novel method that leverages the model’s self-knowledge to determine which retrieved documents are beneficial for answering a given query. We design a reinforcement learning-based training framework to explicitly elicit self-knowledge from the model and employs sentence-level granularity to filter out irrelevant content while preserving useful knowledge.We evaluate SKILL-RAG using Llama2-7B and Qwen3-8B on several question answering benchmarks. Experimental results demonstrate that SKILL-RAG not only improves generation quality but also significantly reduces the number of input documents, validating the importance of self-knowledge in guiding the selection of high-quality retrievals.
[4] ShortCheck: Checkworthiness Detection of Multilingual Short-Form Videos
Henrik Vatndal,Vinay Setty
Main category: cs.CL
TL;DR: ShortCheck是一个模块化、纯推理的管道系统,旨在检测多语言短形式视频的可核查性,帮助人工事实核查员更高效地工作。
Details
Motivation: 短形式视频平台(如TikTok)的多模态、动态和噪声内容给虚假信息检测带来了独特挑战,亟需一种自动化工具来辅助人工核查。Contribution: 提出了ShortCheck系统,集成了语音转录、OCR、物体与深度伪造检测、视频到文本摘要以及声明验证等多个模块,专注于多语言短形式视频的可核查性检测。
Method: 采用模块化管道设计,结合多模态特征提取(如语音、文本、视觉)和自动化推理,最后通过人工标注数据集验证系统的有效性。
Result: 在TikTok视频的多语言数据集上,ShortCheck的性能表现良好,加权F1分数超过70%。
Insight: 短形式视频的虚假信息检测需要综合利用多模态特征,模块化设计能够灵活集成多种技术,提升检测效果。
Abstract: Short-form video platforms like TikTok present unique challenges for misinformation detection due to their multimodal, dynamic, and noisy content. We present ShortCheck, a modular, inference-only pipeline with a user-friendly interface that automatically identifies checkworthy short-form videos to help human fact-checkers. The system integrates speech transcription, OCR, object and deepfake detection, video-to-text summarization, and claim verification. ShortCheck is validated by evaluating it on two manually annotated datasets with TikTok videos in a multilingual setting. The pipeline achieves promising results with F1-weighted score over 70%.
[5] MARS: toward more efficient multi-agent collaboration for LLM reasoning
Xiao Wang,Jia Wang,Yijie Wang,Pengtao Dang,Sha Cao,Chi Zhang
Main category: cs.CL
TL;DR: MARS提出了一种基于角色的多智能体协作框架,通过作者-评审-元评审的流程优化LLM的推理能力,减少计算开销。
Details
Motivation: 单个LLM的推理能力有限,而现有的多智能体辩论方法(MAD)虽有效但计算开销过大。MARS旨在提升推理质量的同时降低计算成本。Contribution: 提出了MARS框架,通过角色划分(作者、评审、元评审)减少智能体间的交互开销,显著降低令牌消耗和推理时间,同时保持高精度。
Method: MARS设计了三层协作:作者生成初始解,评审独立提供意见,元评审整合反馈并决策。避免了评审间的直接交互。
Result: 实验表明,MARS在精度上与MAD相当,但令牌使用和推理时间减少约50%。
Insight: 角色划分和减少交互是多智能体协作中高效推理的关键,适用于需要平衡性能和成本的场景。
Abstract: Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50%. Code is available at https://github.com/xwang97/MARS.
[6] SwasthLLM: a Unified Cross-Lingual, Multi-Task, and Meta-Learning Zero-Shot Framework for Medical Diagnosis Using Contrastive Representations
Ayan Sar,Pranav Singh Puri,Sumit Aich,Tanupriya Choudhury,Abhijit Kumar
Main category: cs.CL
TL;DR: SwasthLLM 是一个统一的跨语言、多任务和元学习的零样本框架,通过对比表征实现医学诊断,能够在英语、印地语和孟加拉语上有效工作,无需语言特定的微调。
Details
Motivation: 在多语言医疗环境中,临床文本的自动疾病诊断因低资源语言标注数据稀缺和跨人群语言变异性而具有挑战性。Contribution: 提出 SwasthLLM 框架,结合跨语言、多任务学习和对比学习,支持零样本诊断,并在低资源语言上表现出色。
Method: 使用 XLM-RoBERTa 编码器,引入语言感知注意力机制、Siamese 对比学习模块和翻译一致性模块,结合多任务学习和 MAML 元学习。
Result: 在监督设置下测试准确率达 97.22%,零样本场景下在印地语和孟加拉语上分别达到 92.78% 和 73.33% 准确率。
Insight: 对比学习和多任务学习结合能有效提升跨语言医疗诊断的泛化能力,尤其适用于低资源语言场景。
Abstract: In multilingual healthcare environments, automatic disease diagnosis from clinical text remains a challenging task due to the scarcity of annotated medical data in low-resource languages and the linguistic variability across populations. This paper proposes SwasthLLM, a unified, zero-shot, cross-lingual, and multi-task learning framework for medical diagnosis that operates effectively across English, Hindi, and Bengali without requiring language-specific fine-tuning. At its core, SwasthLLM leverages the multilingual XLM-RoBERTa encoder augmented with a language-aware attention mechanism and a disease classification head, enabling the model to extract medically relevant information regardless of the language structure. To align semantic representations across languages, a Siamese contrastive learning module is introduced, ensuring that equivalent medical texts in different languages produce similar embeddings. Further, a translation consistency module and a contrastive projection head reinforce language-invariant representation learning. SwasthLLM is trained using a multi-task learning strategy, jointly optimizing disease classification, translation alignment, and contrastive learning objectives. Additionally, we employ Model-Agnostic Meta-Learning (MAML) to equip the model with rapid adaptation capabilities for unseen languages or tasks with minimal data. Our phased training pipeline emphasizes robust representation alignment before task-specific fine-tuning. Extensive evaluation shows that SwasthLLM achieves high diagnostic performance, with a test accuracy of 97.22% and an F1-score of 97.17% in supervised settings. Crucially, in zero-shot scenarios, it attains 92.78% accuracy on Hindi and 73.33% accuracy on Bengali medical text, demonstrating strong generalization in low-resource contexts.
[7] Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures
Sampurna Roy,Ayan Sar,Anurag Kaushish,Kanav Gupta,Tanupriya Choudhury,Abhijit Kumar
Main category: cs.CL
TL;DR: DS-MoE提出了一种动态推理链框架,通过深度专业化专家混合模块,根据输入复杂度动态选择计算路径,提高了效率和推理质量。
Details
Motivation: 传统Transformer对所有输入采用相同的处理深度,导致资源浪费和推理质量受限。作者希望通过动态深度专门化来优化计算效率和推理能力。Contribution: 1. 提出了DS-MoE框架,将专家混合范式从宽度扩展到深度专业化;2. 动态路由网络根据输入复杂度选择专家模块;3. 在效率、推理质量和可解释性上均取得显著提升。
Method: DS-MoE包含多个针对不同推理深度优化的专家模块(如浅层模式识别、逻辑推理等),通过路由网络动态组装推理链,仅激活必要模块。训练和评估基于The Pile数据集。
Result: 实验显示,DS-MoE实现了16%的计算节省和35%的推理加速,同时在复杂推理任务上精度提升2.8%。路由决策还提供了可解释的推理链。
Insight: 深度专业化模块化处理可以同时提升效率、推理质量和模型可解释性,为自适应神经网络架构提供了新的方向。
Abstract: Contemporary transformer architectures apply identical processing depth to all inputs, creating inefficiencies and limiting reasoning quality. Simple factual queries are subjected to the same multilayered computation as complex logical problems, wasting resources while constraining deep inference. To overcome this, we came up with a concept of Dynamic Reasoning Chains through Depth Specialised Mixture of Experts (DS-MoE), a modular framework that extends the Mixture of Experts paradigm from width-based to depth specialised computation. DS-MoE introduces expert modules optimised for distinct reasoning depths, shallow pattern recognition, compositional reasoning, logical inference, memory integration, and meta-cognitive supervision. A learned routing network dynamically assembles custom reasoning chains, activating only the necessary experts to match input complexity. The dataset on which we trained and evaluated DS-MoE is on The Pile, an 800GB corpus covering diverse domains such as scientific papers, legal texts, programming code, and web content, enabling systematic assessment across reasoning depths. Experimental results demonstrate that DS-MoE achieves up to 16 per cent computational savings and 35 per cent faster inference compared to uniform-depth transformers, while delivering 2.8 per cent higher accuracy on complex multi-step reasoning benchmarks. Furthermore, routing decisions yield interpretable reasoning chains, enhancing transparency and scalability. These findings establish DS-MoE as a significant advancement in adaptive neural architectures, demonstrating that depth-specialised modular processing can simultaneously improve efficiency, reasoning quality, and interpretability in large-scale language models.
[8] Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions
Jungsoo Park,Ethan Mendes,Gabriel Stanovsky,Alan Ritter
Main category: cs.CL
TL;DR: 该论文提出了一种通过任务描述预测大型语言模型(LLM)在未运行实验前表现的方法,并构建了一个名为PRECOG的数据集,实验表明该任务具有可行性。
Details
Motivation: 当前大型语言模型的进展受到评估瓶颈的限制,论文希望通过仅基于任务描述和配置预测模型表现,减少实验开销。Contribution: 1. 提出了PRECOG数据集;2. 展示了仅通过文本描述预测模型表现的可行性;3. 分析了模型在预测时的行为模式差异。
Method: 1. 使用检索模块排除源论文的影响;2. 通过任务描述和配置预测模型分数;3. 在高置信度阈值下实现较低的平均绝对误差。
Result: 预测性能中等但可行,平均绝对误差低至8.7(Accuracy子集);GPT-5在新数据集上仍能实现一定预测准确率。
Insight: 1. 强推理模型表现出多样化的查询行为;2. 开源模型在检索多样性上表现较差;3. 零泄漏场景下预测仍有效。
Abstract: Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, evaluate models and settings, then iterate. We therefore ask a simple question: can we forecast outcomes before running any experiments? We study text-only performance forecasting: estimating a model’s score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. Experiments show the task is challenging but feasible: models equipped with a retrieval module that excludes source papers achieve moderate prediction performance with well-calibrated uncertainty, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds. Our analysis indicates that stronger reasoning models engage in diverse, iterative querying, whereas current open-source models lag and often skip retrieval or gather evidence with limited diversity. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT-5 with built-in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.
[9] Confidence-guided Refinement Reasoning for Zero-shot Question Answering
Youwon Jang,Woo Suk Choi,Minjoon Jung,Minsu Lee,Byoung-Tak Zhang
Main category: cs.CL
TL;DR: 本文提出了Confidence-guided Refinement Reasoning (C2R),一种无需训练的框架,适用于跨文本、图像和视频领域的问答任务。C2R通过构造和优化子问题及其答案(sub-QAs),为目标答案生成更好的置信度评分,从而选择最可靠的最终答案。
Details
Motivation: 现有的问答系统在多模态任务中可能存在推理路径单一或置信度评估不准确的问题,C2R旨在通过子问题的多样化和置信度引导的优化来解决这些问题。Contribution: 1. 提出了无需训练的C2R框架,可无缝集成到现有模型中;2. 分析了子问题的数量和质量对模型行为的影响。
Method: C2R首先筛选多样化的子问题及其答案(sub-QAs),然后通过比较置信度评分选择最佳答案。
Result: C2R在多种模型和基准测试中表现出一致的性能提升。
Insight: 子问题的数量和质量对模型的稳健性和可靠性有显著影响,多样化的推理路径能提升模型表现。
Abstract: We propose Confidence-guided Refinement Reasoning (C2R), a novel training-free framework applicable to question-answering (QA) tasks across text, image, and video domains. C2R strategically constructs and refines sub-questions and their answers (sub-QAs), deriving a better confidence score for the target answer. C2R first curates a subset of sub-QAs to explore diverse reasoning paths, then compares the confidence scores of the resulting answer candidates to select the most reliable final answer. Since C2R relies solely on confidence scores derived from the model itself, it can be seamlessly integrated with various existing QA models, demonstrating consistent performance improvements across diverse models and benchmarks. Furthermore, we provide essential yet underexplored insights into how leveraging sub-QAs affects model behavior, specifically analyzing the impact of both the quantity and quality of sub-QAs on achieving robust and reliable reasoning.
[10] Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching
Songze Li,Zhiqiang Liu,Zhengke Gui,Huajun Chen,Wen Zhang
Main category: cs.CL
TL;DR: 提出了Enrich-on-Graph(EoG)框架,利用LLMs的先验知识丰富知识图谱(KGs),以缩小图谱与查询之间的语义鸿沟,并进行高效推理,在KGQA任务中实现了最优性能。
Details
Motivation: 知识图谱问答(KGQA)中存在结构化知识图谱与非结构化查询之间的语义鸿沟问题,导致LLMs在处理此类任务时出现幻觉和事实错误。现有方法缺乏对这一问题的有效解决,且资源开销大、扩展性差。Contribution: 1. 提出Enrich-on-Graph框架,通过LLMs增强KGs以缩小语义鸿沟;2. 提出三种图谱质量评估指标,用于分析查询与图谱的对齐程度;3. 在两个KGQA基准数据集上实现了最优性能。
Method: Enrich-on-Graph框架利用LLMs的先验知识丰富KGs,生成更高质量的证据以支持推理,同时优化计算成本和扩展性。通过提出的三种评估指标量化图谱与查询的对齐质量。
Result: 在KGQA基准数据集上的实验表明,EoG能够高效生成高质量KGs,并在性能上达到了state-of-the-art水平。
Insight: 结合LLMs的先验知识与结构化KGs可以有效弥合语义鸿沟,提升复杂推理任务的准确性和鲁棒性。
Abstract: Large Language Models (LLMs) exhibit strong reasoning capabilities in complex tasks. However, they still struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA). We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures. Existing methods usually employ resource-intensive, non-scalable workflows reasoning on vanilla KGs, but overlook this gap. To address this challenge, we propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs’ prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries. EoG enables efficient evidence extraction from KGs for precise and robust reasoning, while ensuring low computational costs, scalability, and adaptability across different methods. Furthermore, we propose three graph quality evaluation metrics to analyze query-graph alignment in KGQA task, supported by theoretical validation of our optimization objectives. Extensive experiments on two KGQA benchmark datasets indicate that EoG can effectively generate high-quality KGs and achieve the state-of-the-art performance. Our code and data are available at https://github.com/zjukg/Enrich-on-Graph.
[11] Leveraging What’s Overfixed: Post-Correction via LLM Grammatical Error Overcorrection
Taehee Park,Heejin Do,Gary Geunbae Lee
Main category: cs.CL
TL;DR: PoCO是一种通过LLM故意过校正以最大化召回率,再通过小模型精细调整以提升精度的新方法,有效平衡了语法纠错的召回率和精度。
Details
Motivation: 小型语言模型(sLMs)在语法纠错中通常精度高但召回率低,而大型语言模型(LLMs)则相反,容易过校正导致精度不足。PoCO旨在结合两者的优势。Contribution: 提出了PoCO方法,通过LLM故意过校正以提升召回率,再通过小模型精细调整以提高精度,从而平衡语法纠错的性能。
Method: PoCO分为两步:1) 通过LLM触发过校正以最大化召回率;2) 通过小模型对过校正输出进行精细调整以提高精度。
Result: 实验表明,PoCO在保持高精度的同时显著提升了召回率,从而提高了语法纠错的整体质量。
Insight: 结合LLMs的生成能力和小模型的可靠性,可以显著提升语法纠错任务的性能。
Abstract: Robust supervised fine-tuned small Language Models (sLMs) often show high reliability but tend to undercorrect. They achieve high precision at the cost of low recall. Conversely, Large Language Models (LLMs) often show the opposite tendency, making excessive overcorrection, leading to low precision. To effectively harness the strengths of LLMs to address the recall challenges in sLMs, we propose Post-Correction via Overcorrection (PoCO), a novel approach that strategically balances recall and precision. PoCO first intentionally triggers overcorrection via LLM to maximize recall by allowing comprehensive revisions, then applies a targeted post-correction step via fine-tuning smaller models to identify and refine erroneous outputs. We aim to harmonize both aspects by leveraging the generative power of LLMs while preserving the reliability of smaller supervised models. Our extensive experiments demonstrate that PoCO effectively balances GEC performance by increasing recall with competitive precision, ultimately improving the overall quality of grammatical error correction.
[12] Distilling Many-Shot In-Context Learning into a Cheat Sheet
Ukyo Honda,Soichiro Murakami,Peinan Zhang
Main category: cs.CL
TL;DR: 论文提出了一种名为‘作弊单ICL’的方法,通过将多样本上下文学习的信息浓缩为简洁的文本摘要,显著减少了推断时的计算需求。
Details
Motivation: 传统的多样本上下文学习需要大量计算资源,因为输入令牌较长。作者希望通过一种更高效的方式实现类似性能。Contribution: 提出了作弊单ICL方法,将多样本上下文学习的信息蒸馏为简洁摘要,显著减少了推断时的令牌数量。
Method: 将多样本ICL的信息蒸馏为‘作弊单’,在推断时仅使用少量令牌的作弊单作为上下文。
Result: 在复杂的推理任务上,作弊单ICL性能和传统多样本ICL相当或更好,且显著减少令牌数量,无需检索。
Insight: 作弊单ICL是一种实用且高效的替代方法,适用于下游任务中对大语言模型的应用。
Abstract: Recent advances in large language models (LLMs) enable effective in-context learning (ICL) with many-shot examples, but at the cost of high computational demand due to longer input tokens. To address this, we propose cheat-sheet ICL, which distills the information from many-shot ICL into a concise textual summary (cheat sheet) used as the context at inference time. Experiments on challenging reasoning tasks show that cheat-sheet ICL achieves comparable or better performance than many-shot ICL with far fewer tokens, and matches retrieval-based ICL without requiring test-time retrieval. These findings demonstrate that cheat-sheet ICL is a practical alternative for leveraging LLMs in downstream tasks.
[13] WeFT: Weighted Entropy-driven Fine-Tuning for dLLMs
Guowei Xu,Wenxin Xu,Jiawang Zhao,Kaisheng Ma
Main category: cs.CL
TL;DR: WeFT是一种针对扩散语言模型(dLLMs)的加权熵驱动微调方法,通过基于熵的权重分配解决传统监督微调在扩散模型中的挑战,显著提升了生成质量。
Details
Motivation: 扩散模型在语言建模中表现出潜力,但其生成过程不可预测且缺乏精确的概率估计,导致监督微调效果受限,需要一种方法来控制生成方向的关键词。Contribution: 提出了WeFT方法:一种基于熵的加权监督微调技术,显著提升了扩散语言模型在推理任务上的性能(相对改进达39%至83%)。
Method: 通过分析扩散理论,为每个token分配基于熵的权重,优化微调过程,从而更能控制生成方向。
Result: 在多个推理基准测试(如Sudoku、GSM8K等)中,WeFT比标准SFT方法性能提升了39%到83%。
Insight: 基于熵的权重分配能有效捕捉生成过程中的关键token,提升扩散模型在语言建模中的可控性和性能。
Abstract: Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose WeFT, a weighted SFT method for diffusion language models, where tokens are assigned different weights based on their entropy. Derived from diffusion theory, WeFT delivers substantial gains: training on s1K, s1K-1.1, and 3k samples from open-r1, it achieves relative improvements of 39%, 64%, and 83% over standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500). The code and models will be made publicly available.
[14] Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models
Pittawat Taveekitworachai,Natpatchara Pongjirapat,Krittaphas Chaisutyakorn,Piyalitt Ittichaiwong,Tossaporn Saengja,Kunat Pipatanakul
Main category: cs.CL
TL;DR: 该论文探讨了医学推理模型(MRM)生成答案排名列表的方法,提出了提示和微调两种方法,并通过实验证明强化微调(RFT)在多答案格式中表现更优。
Details
Motivation: 临床决策通常依赖多个备选答案而非单一答案,但目前医学推理模型仅生成单一答案,限制了其在实际应用中的有效性。Contribution: 1) 提出了两种方法(提示和微调)使MRM生成排名列表;2) 设计了针对排名列表的新奖励函数;3) 展示了RFT在多答案格式中的鲁棒性。
Method: 1) 提示方法:通过引导模型生成答案列表;2) 监督微调(SFT):模仿标注答案;3) 强化微调(RFT):通过奖励函数鼓励探索。
Result: RFT在多答案格式中表现更鲁棒,而SFT仅在某些格式中有效。MRM能够识别有效答案,但不一定选择基准偏好的地面真值。
Insight: 临床决策中生成排名列表的价值显著,强化学习在提高模型鲁棒性方面具有潜力。
Abstract: This paper presents a systematic study on enabling medical reasoning models (MRMs) to generate ranked lists of answers for open-ended questions. Clinical decision-making rarely relies on a single answer but instead considers multiple options, reducing the risks of narrow perspectives. Yet current MRMs are typically trained to produce only one answer, even in open-ended settings. We propose an alternative format: ranked lists and investigate two approaches: prompting and fine-tuning. While prompting is a cost-effective way to steer an MRM’s response, not all MRMs generalize well across different answer formats: choice, short text, and list answers. Based on our prompting findings, we train and evaluate MRMs using supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT teaches a model to imitate annotated responses, and RFT incentivizes exploration through the responses that maximize a reward. We propose new reward functions targeted at ranked-list answer formats, and conduct ablation studies for RFT. Our results show that while some SFT models generalize to certain answer formats, models trained with RFT are more robust across multiple formats. We also present a case study on a modified MedQA with multiple valid answers, finding that although MRMs might fail to select the benchmark’s preferred ground truth, they can recognize valid answers. To the best of our knowledge, this is the first systematic investigation of approaches for enabling MRMs to generate answers as ranked lists. We hope this work provides a first step toward developing alternative answer formats that are beneficial beyond single answers in medical domains.
[15] MemLens: Uncovering Memorization in LLMs with Activation Trajectories
Zirui He,Haiyan Zhao,Ali Payani,Mengnan du
Main category: cs.CL
TL;DR: MemLens提出了一种通过分析LLM生成数字标记时的概率轨迹来检测记忆行为的激活视角方法,区分了受污染样本和干净样本的推理轨迹。
Details
Motivation: 现有方法基于表面词汇重叠和困惑度检测记忆行为,泛化性差且对隐式污染数据效果不佳。MemLens旨在通过激活轨迹解决这一问题。Contribution: 提出了MemLens方法,通过分析概率轨迹揭示记忆行为,展示受污染样本与干净样本在推理轨迹上的显著差异。
Method: 利用LoRA微调注入样本,分析数字标记生成时的概率轨迹,区分了受污染样本的“捷径”行为和干净样本的渐进推理模式。
Result: 实验表明受污染样本在早期层即锁定答案,表现高置信度,而干净样本在全模型深度中逐渐积累证据,验证了方法的有效性。
Insight: 激活轨迹能有效捕捉记忆行为的真实信号,而非虚假相关性,为LLM的记忆行为检测提供了新视角。
Abstract: Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical overlap and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut’’ behaviors, locking onto an answer with high confidence in the model’s early layers, whereas clean samples show more gradual evidence accumulation across the model’s full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.
[16] SoM-1K: A Thousand-Problem Benchmark Dataset for Strength of Materials
Qixin Wan,Zilong Wang,Jingwen Zhou,Wanting Wang,Ziheng Geng,Jiachen Liu,Ran Cao,Minghui Cheng,Lu Cheng
Main category: cs.CL
TL;DR: SoM-1K是一个包含1,065个问题的多模态基准数据集,用于评估基础模型在材料力学问题上的表现。研究发现当前基础模型在这一任务中表现不佳,最佳模型准确率仅为56.6%。文本描述(DoI)在提升性能方面比直接图像输入更有效。
Details
Motivation: 当前基础模型在复杂多模态工程问题上的表现尚不明确,尤其是在材料力学领域缺乏大规模的评估数据集。因此,研究者提出SoM-1K数据集和DoI策略,以填补这一空白。Contribution: 1. 提出首个大规模多模态材料力学基准数据集SoM-1K;2. 提出DoI策略,通过专家生成的文本描述替代图像输入来提升模型性能;3. 评估了八种基础模型,揭示了其在工程问题上的局限性。
Method: 1. 构建SoM-1K数据集,包含文本问题语句和示意图;2. 设计DoI策略,提供图像的文本描述作为上下文;3. 对八种LLM和VLM进行评估和错误分析。
Result: 当前基础模型表现较差,最佳模型准确率为56.6%;使用DoI的LLM性能优于直接使用图像的VLM;DoI能有效减少视觉误解错误。
Insight: 文本描述在当前基础模型的多模态推理中可能比图像输入更有效,凸显了工程领域需要更强健的多模态基础模型。
Abstract: Foundation models have shown remarkable capabilities in various domains, but their performance on complex, multimodal engineering problems remains largely unexplored. We introduce SoM-1K, the first large-scale multimodal benchmark dataset dedicated to evaluating foundation models on problems in the strength of materials (SoM). The dataset, which contains 1,065 annotated SoM problems, mirrors real-world engineering tasks by including both textual problem statements and schematic diagrams. Due to the limited capabilities of current foundation models in understanding complicated visual information, we propose a novel prompting strategy called Descriptions of Images (DoI), which provides rigorous expert-generated text descriptions of the visual diagrams as the context. We evaluate eight representative foundation models, including both large language models (LLMs) and vision language models (VLMs). Our results show that current foundation models struggle significantly with these engineering problems, with the best-performing model achieving only 56.6% accuracy. Interestingly, we found that LLMs, when provided with DoI, often outperform VLMs provided with visual diagrams. A detailed error analysis reveals that DoI plays a crucial role in mitigating visual misinterpretation errors, suggesting that accurate text-based descriptions can be more effective than direct image input for current foundation models. This work establishes a rigorous benchmark for engineering AI and highlights a critical need for developing more robust multimodal reasoning capabilities in foundation models, particularly in scientific and engineering contexts.
[17] Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs
Yixin Wan,Xingrun Chen,Kai-Wei Chang
Main category: cs.CL
TL;DR: 该论文揭示了大型语言模型(LLM)中的文化定位偏差,并提出了两种推理时缓解方法以减轻偏向主流美国文化的生成问题。
Details
Motivation: 研究发现,LLM在生成内容时倾向于从主流美国文化的视角出发,忽视了其他文化的代表性,从而可能导致公平性问题。Contribution: 1. 系统性地识别并研究了LLM中的文化定位偏差;2. 提出了CultureLens基准和评估指标;3. 开发了两种推理时缓解方法(FIP和MFA)以减轻偏差。
Method: 1. 提出了CultureLens基准,通过采访脚本生成任务量化文化偏差;2. 开发了两种缓解方法:(1)基于提示的FIP方法;(2)基于代理的MFA框架(包括单代理和多代理两种形式)。
Result: 实验显示,LLM在主流文化(美国)中表现偏向内部视角(88%),而对非主流文化则主要采用外部视角。代理方法能有效减轻偏差。
Insight: 多代理系统(MFA-MA)通过分层代理(计划、批评和精炼)能更系统地修正偏差,为LLM的公平性研究提供了新方向。
Abstract: Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM’s default generative stance aligns with a mainstream view and treats other cultures as outsiders. We propose the CultureLens benchmark with 4000 generation prompts and 3 evaluation metrics for quantifying this bias through the lens of a culturally situated interview script generation task, in which an LLM is positioned as an onsite reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals a stark pattern: while models adopt insider tones in over 88 percent of US-contexted scripts on average, they disproportionately adopt mainly outsider stances for less dominant cultures. To resolve these biases, we propose 2 inference-time mitigation methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of 2 pipelines: (1) MFA-SA (Single-Agent) introduces a self-reflection and rewriting loop based on fairness guidelines. (2) MFA-MA (Multi-Agent) structures the process into a hierarchy of specialized agents: a Planner Agent(initial script generation), a Critique Agent (evaluates initial script against fairness pillars), and a Refinement Agent (incorporates feedback to produce a polished, unbiased script). Empirical results showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.
[18] Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction
Lei Hei,Tingjing Liao,Yingxin Pei,Yiyang Qi,Jiaqi Wang,Ruiting Li,Feiliang Ren
Main category: cs.CL
TL;DR: 该论文提出了一种新颖的多模态关系抽取框架ROC,通过检索任务而非传统分类方法来捕捉关系语义,结合实体类型与位置信息,利用大语言模型生成关系描述,并通过对比学习对齐实体-关系对,显著提升了性能与可解释性。
Details
Motivation: 传统多模态关系抽取方法采用分类范式,忽视了结构约束(如实体类型和位置信息)及细粒度关系理解的语义表达能力,导致性能受限。Contribution: 1. 提出ROC框架,将多模态关系抽取重构为基于语义相似性的检索任务;2. 整合实体类型和位置信息;3. 通过自然语言描述拓展关系标签;4. 设计了基于对比学习的语义对齐方法。
Method: 1. 使用多模态编码器整合实体类型和位置信息;2. 利用大语言模型生成自然语言关系描述;3. 通过对比学习对齐实体-关系对的语义相似性。
Result: 在MNRE和MORE基准数据集上取得了最优性能,表现出更强的鲁棒性和可解释性。
Insight: 检索范式比分类范式更适合捕捉关系语义,结合结构信息和自然语言描述能显著提升多模态关系抽取的效果。
Abstract: Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose \underline{R}etrieval \underline{O}ver \underline{C}lassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability.
[19] Who’s Laughing Now? An Overview of Computational Humour Generation and Explanation
Tyler Loakman,William Thorne,Chenghua Lin
Main category: cs.CL
TL;DR: 该论文综述了计算幽默生成与解释的研究现状,强调了其在自然语言处理(NLP)中的重要性及当前模型的局限性,并提出了未来研究方向。
Details
Motivation: 幽默是人类的基本特质,其计算理解和生成是NLP中最具挑战性的任务之一。研究幽默有助于评估大语言模型(LLMs)的常识知识和推理能力。Contribution: 论文的系统性综述填补了计算幽默生成与解释的研究空白,并提出了未来研究的潜在方向,尤其是针对主观性和伦理模糊性的幽默问题。
Method: 采用文献综述方法,分析计算幽默领域的研究现状,重点关注生成与解释任务,并提出未来发展方向。
Result: 研究发现,尽管幽默理解是NLP的基础任务,但幽默生成和解释的研究很少,且当前模型的表现远不及人类水平。
Insight: 计算幽默研究需要更多关注主观性和伦理问题,同时需开发更具创造性和上下文依赖性的模型,以接近人类的幽默能力。
Abstract: The creation and perception of humour is a fundamental human trait, positioning its computational understanding as one of the most challenging tasks in natural language processing (NLP). As an abstract, creative, and frequently context-dependent construct, humour requires extensive reasoning to understand and create, making it a pertinent task for assessing the common-sense knowledge and reasoning abilities of modern large language models (LLMs). In this work, we survey the landscape of computational humour as it pertains to the generative tasks of creation and explanation. We observe that, despite the task of understanding humour bearing all the hallmarks of a foundational NLP task, work on generating and explaining humour beyond puns remains sparse, while state-of-the-art models continue to fall short of human capabilities. We bookend our literature survey by motivating the importance of computational humour processing as a subdiscipline of NLP and presenting an extensive discussion of future directions for research in the area that takes into account the subjective and ethically ambiguous nature of humour.
[20] Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning
Xiangru Tang,Wanghan Xu,Yujie Wang,Zijie Guo,Daniel Shao,Jiapeng Chen,Cixuan Zhang,Ziyi Wang,Lixin Zhang,Guancheng Wan,Wenlong Zhang,Lei Bai,Zhenfei Yin,Philip Torr,Hanrui Wang,Di Jin
Main category: cs.CL
TL;DR: 论文提出了一种结合隐式检索和结构化协作的统一框架,解决了显式检索和多智能体协作中的效率问题,在科学推理任务中取得了目前最高精度。
Details
Motivation: 现有大型语言模型在科学推理中存在显式检索中断推理流程和多智能体协作过程中强解被稀释的问题,需要一种更高效的方法。Contribution: 1. 提出了Monitor-based检索模块,在token级别集成外部知识;2. 设计了层次化解方案细化(HSR)和质量感知迭代推理(QAIR)方法。
Method: 结合隐式检索(Monitor-based)和结构化协作(HSR、QAIR),减少对推理流程的干扰并优化多智能体协作。
Result: 在HLE Bio/Chem Gold任务中达到48.3%的准确率,领先基线13.4个百分点,同时显著降低token使用和智能体步骤。
Insight: 推理失败和知识空缺在85%以上案例中同时出现;检索任务需多样性,而推理任务需共识。
Abstract: Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden “tool tax” of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity’s Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy – the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5% and agent steps by 43.7%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.
[21] CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis
Xinzhe Xu,Liang Zhao,Hongshen Xu,Chen Chen
Main category: cs.CL
TL;DR: 该论文提出了CLaw基准,用于评估大型语言模型在中文法律知识及其推理应用中的表现,包括细粒度法律语料和案例推理任务。实验表明当前模型在准确检索法律条款方面表现不佳,强调准确知识检索与推理能力的结合至关重要。
Details
Motivation: 大型语言模型在处理法律文本时通常缺乏专业法律知识的深入理解,导致其可靠性受限。因此,需要专门的基准来评估模型在法律领域的表现。Contribution: 1) 构建了细粒度的中文法律语料库(涵盖306部行政法规,64,849条子条款);2) 提出了254个案例推理任务,评估法律知识的实际应用。
Method: 通过构建细粒度法律语料库和案例推理任务,结合监督微调(SFT)或检索增强生成(RAG)技术提升模型表现。
Result: 实验显示,当代大型语言模型在准确检索法律条款方面表现不佳,影响了法律推理的可靠性。
Insight: 实现可靠的法律推理需要结合准确的法律知识检索与强大的通用推理能力,这一工作为领域专用语言模型的发展提供了重要参考。
Abstract: Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes, yet their reliability is often compromised by general pre-training that ingests legal texts without specialized focus, obscuring the true depth of their legal knowledge. This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning. CLaw comprises two key components: (1) a comprehensive, fine-grained corpus of all 306 Chinese national statutes, segmented to the subparagraph level and incorporating precise historical revision timesteps for rigorous recall evaluation (64,849 entries), and (2) a challenging set of 254 case-based reasoning instances derived from China Supreme Court curated materials to assess the practical application of legal knowledge. Our empirical evaluation reveals that most contemporary LLMs significantly struggle to faithfully reproduce legal provisions. As accurate retrieval and citation of legal provisions form the basis of legal reasoning, this deficiency critically undermines the reliability of their responses. We contend that achieving trustworthy legal reasoning in LLMs requires a robust synergy of accurate knowledge retrieval–potentially enhanced through supervised fine-tuning (SFT) or retrieval-augmented generation (RAG)–and strong general reasoning capabilities. This work provides an essential benchmark and critical insights for advancing domain-specific LLM reasoning, particularly within the complex legal sphere.
[22] Query-Centric Graph Retrieval Augmented Generation
Yaxiong Wu,Jianyuan Bo,Yongyue Zhang,Sheng Liang,Yong Liu
Main category: cs.CL
TL;DR: QCG-RAG引入了一种基于查询的图检索增强生成框架,通过控制粒度提升多跳推理性能,优于现有方法。
Details
Motivation: 现有图检索增强生成方法在粒度上存在困境:细粒度实体级图导致高成本且丢失上下文,粗粒度文档级图无法捕捉细致关系。Contribution: 提出QCG-RAG框架,实现查询粒度索引和多跳片段检索,通过Doc2Query构建查询中心图,提升图质量和可解释性。
Method: 结合Doc2Query和查询中心图构建方法,设计多跳检索机制选择相关片段。
Result: 在LiHuaWorld和MultiHop-RAG实验上,QCG-RAG在问答任务中表现优于现有方法。
Insight: 查询中心图提供灵活的粒度控制,有效平衡上下文保留和关系捕捉,为多跳推理提供了新范式。
Abstract: Graph-based retrieval-augmented generation (RAG) enriches large language models (LLMs) with external knowledge for long-context understanding and multi-hop reasoning, but existing methods face a granularity dilemma: fine-grained entity-level graphs incur high token costs and lose context, while coarse document-level graphs fail to capture nuanced relations. We introduce QCG-RAG, a query-centric graph RAG framework that enables query-granular indexing and multi-hop chunk retrieval. Our query-centric approach leverages Doc2Query and Doc2Query{-}{-} to construct query-centric graphs with controllable granularity, improving graph quality and interpretability. A tailored multi-hop retrieval mechanism then selects relevant chunks via the generated queries. Experiments on LiHuaWorld and MultiHop-RAG show that QCG-RAG consistently outperforms prior chunk-based and graph-based RAG methods in question answering accuracy, establishing a new paradigm for multi-hop reasoning.
[23] Un-Doubling Diffusion: LLM-guided Disambiguation of Homonym Duplication
Evgeny Kaskov,Elizaveta Petrova,Petr Surovtsev,Anna Kostikova,Ilya Mistiurin,Alexander Kapitanov,Alexander Nagaev
Main category: cs.CL
TL;DR: 这篇论文研究了扩散模型中同形异义词(homonym)重复生成的问题,并提出了一种测量重复率和评估扩散模型的方法,还探讨了通过提示扩展缓解该问题的方法。
Details
Motivation: 同形异义词在生成模型中可能导致歧义,尤其是在扩散模型中,可能会同时生成多个意义的图像。此外,Anglocentric偏见(偏向英语的翻译步骤)使得非英语的同形异义词在翻译后也可能成为问题。Contribution: 论文的主要贡献包括:1) 提出了一种测量同形异义词重复率的方法;2) 通过自动评估(使用VLM)和人工评估比较了不同扩散模型的表现;3) 研究了提示扩展对缓解该问题的有效性。
Method: 主要方法包括:1) 使用视觉语言模型(VLM)和人工评估测量重复率;2) 通过扩展提示(prompt expansion)来消解同形异义的歧义。
Result: 实验结果表明,提示扩展能够有效减少同形异义词的重复生成问题,同时对Anglocentric偏见也有缓解作用。
Insight: 研究的亮点在于揭示了扩散模型中同形异义词问题的普遍性,并提供了一个公开的自动评估工具,为未来研究提供了便利。
Abstract: Homonyms are words with identical spelling but distinct meanings, which pose challenges for many generative models. When a homonym appears in a prompt, diffusion models may generate multiple senses of the word simultaneously, which is known as homonym duplication. This issue is further complicated by an Anglocentric bias, which includes an additional translation step before the text-to-image model pipeline. As a result, even words that are not homonymous in the original language may become homonyms and lose their meaning after translation into English. In this paper, we introduce a method for measuring duplication rates and conduct evaluations of different diffusion models using both automatic evaluation utilizing Vision-Language Models (VLM) and human evaluation. Additionally, we investigate methods to mitigate the homonym duplication problem through prompt expansion, demonstrating that this approach also effectively reduces duplication related to Anglocentric bias. The code for the automatic evaluation pipeline is publicly available.
[24] LLM Output Homogenization is Task Dependent
Shomik Jain,Jack Lanchantin,Maximilian Nickel,Karen Ullrich,Ashia Wilson,Jamelle Watson-Daniels
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLM)输出同质化的任务依赖性,提出了一种任务分类法、任务锚定的功能多样性评估方法,以及一种任务锚定的采样技术,以增加功能多样性,同时维持输出质量。
Details
Motivation: LLM的输出同质化可能会降低其帮助性,但同质化的定义和问题性因任务类别而异。现有研究未充分考虑任务的多样性需求,本文旨在填补这一空白。Contribution: 1. 提出了一种包含八个任务类别的分类法,明确了不同任务的同质化定义;2. 引入了任务锚定的功能多样性评估方法;3. 设计了一种任务锚定的采样技术;4. 挑战了多样性与质量之间的权衡观念。
Method: 1. 任务分类法;2. 任务锚定的功能多样性评估;3. 任务锚定的采样技术;4. 实验验证多样性与质量的平衡。
Result: 结果表明,任务依赖性的方法能够更好地评估和缓解输出同质化,同时保持响应质量。
Insight: 任务依赖性在同质化评估和缓解中至关重要,多样性与质量并非总是冲突。
Abstract: A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct conceptualizations of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving homogenization where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.
[25] LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text
Irina Tolstykh,Aleksandra Tsybina,Sergey Yakubson,Maksim Kuprashevich
Main category: cs.CL
TL;DR: LLMTrace是一个新的大规模双语(英语和俄语)语料库,旨在用于AI生成文本的检测任务,支持全文二分类(人类vs.AI)和AI生成区间检测任务,提供字符级标注。
Details
Motivation: 现有数据集存在模型过时、语言单一、缺乏混合作者标注等问题,无法满足当前AI生成文本检测的需求,尤其是在混合作者文本中精确定位AI生成片段的需求。Contribution: 提出了LLMTrace,一个支持字符级标注的大规模双语语料库,为二分类和区间检测任务提供数据支持。
Method: 使用多样化的现代专有和开源LLM生成数据,并添加字符级标注以支持精细任务。
Result: LLMTrace填补了现有数据集的空白,为未来AI检测模型的训练和评估提供了重要资源。
Insight: 字符级标注能够显著提升模型在混合作者文本中检测AI生成片段的能力,为更实用的AI检测模型奠定基础。
Abstract: The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the increasingly common scenario of mixed human-AI authorship. Crucially, while some datasets address mixed authorship, none provide the character-level annotations required for the precise localization of AI-generated segments within a text. To address these gaps, we introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection. Constructed using a diverse range of modern proprietary and open-source LLMs, our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection, facilitated by character-level annotations. We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models. The project page is available at \href{https://sweetdream779.github.io/LLMTrace-info/}{iitolstykh/LLMTrace}.
[26] Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond
Dingzirui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng
Main category: cs.CL
TL;DR: 本文通过理论分析揭示了输入扰动对Chain-of-Thought(CoT)输出的影响,提出了扰动上限与推理步数和嵌入向量范数的相关性,并通过实验验证了理论结论。
Details
Motivation: 现有研究表明CoT的输出受输入扰动影响显著,但缺乏理论解释其传播机制,限制了提示优化方法的改进。Contribution: 1. 推导了输入扰动对CoT输出波动的上限;2. 证明上限与推理步数正相关;3. 发现无限长推理无法完全消除扰动影响;4. 在线性自注意力模型中,扰动上限与嵌入向量范数负相关。
Method: 理论分析了输入扰动与CoT输出的关系,推导了扰动上限的数学表达式,并在LSA模型中进一步验证了其与嵌入向量范数的相关性。
Result: 实验结果表明理论分析与实验结果一致,验证了推理步数和嵌入向量范数对输入扰动上限的影响。
Insight: 1. 增加推理步数可能放大扰动影响;2. 嵌入向量范数的优化有助于提高模型鲁棒性;3. 无限推理步长无法完全解决扰动问题。
Abstract: Existing research indicates that the output of Chain-of-Thought (CoT) is significantly affected by input perturbations. Although many methods aim to mitigate such impact by optimizing prompts, a theoretical explanation of how these perturbations influence CoT outputs remains an open area of research. This gap limits our in-depth understanding of how input perturbations propagate during the reasoning process and hinders further improvements in prompt optimization methods. Therefore, in this paper, we theoretically analyze the effect of input perturbations on the fluctuation of CoT outputs. We first derive an upper bound for input perturbations under the condition that the output fluctuation is within an acceptable range, based on which we prove that: (i) This upper bound is positively correlated with the number of reasoning steps in the CoT; (ii) Even an infinitely long reasoning process cannot eliminate the impact of input perturbations. We then apply these conclusions to the Linear Self-Attention (LSA) model, which can be viewed as a simplified version of the Transformer. For the LSA model, we prove that the upper bound for input perturbation is negatively correlated with the norms of the input embedding and hidden state vectors. To validate this theoretical analysis, we conduct experiments on three mainstream datasets and four mainstream models. The experimental results align with our theoretical analysis, empirically demonstrating the correctness of our findings.
[27] DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding
Kin Ian Lo,Hala Hawashin,Mina Abbaszadeh,Tilen Limback-Stokin,Hadi Wazni,Mehrnoosh Sadrzadeh
Main category: cs.CL
TL;DR: DisCoCLIP是一个结合CLIP视觉编码器和新型张量网络文本编码器的多模态模型,通过显式编码语法结构提升视觉-语言任务的组合推理能力。
Details
Motivation: 现有视觉-语言模型在大规模图像-文本对齐上表现优异,但在依赖词序和谓词-论元结构的任务中表现不足。Contribution: 提出DisCoCLIP,通过显式编码句子的语法结构和分布语义,显著提升了对动词语义和词序的敏感性。
Method: 使用组合范畴语法解析句子生成分布词张量,并通过张量分解降低参数量。结合CLIP视觉编码器,端到端训练。
Result: 在SVO-Probes、ARO和新提出的SVO-Swap任务中显著提升了性能,如动词准确率从77.6%提升至82.4%。
Insight: 通过张量网络嵌入显式语言结构可以生成高效且可解释的表示,显著提升组合推理能力。
Abstract: Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence’s grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP’s SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision-language tasks.
[28] The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages
Pranjal A. Chitale,Varun Gumma,Sanchit Ahuja,Prashant Kodali,Manan Uppadhyay,Deepthi Sudharsan,Sunayana Sitaram
Main category: cs.CL
TL;DR: 该论文探讨了合成数据在多语言、多文化AI系统中的作用,以印度语言为例,提出了一种基于开源大语言模型的数据生成方法,并评估了其效果。
Details
Motivation: 开发在多语言环境中有效且文化相关的AI系统是一个长期挑战,尤其在低资源语言中,合成数据的潜力尚未充分探索。Contribution: 提出了Updesh数据集,一个高质量的合成指令跟随数据集,覆盖13种印度语言,强调文化背景和多任务能力。
Method: 利用开源大语言模型(>= 235B参数)从语言特定的维基百科内容生成数据,结合自上而下的翻译范式。
Result: 生成数据质量高,模型在低/中资源语言任务中表现显著提升,缩小了与高资源语言的差距。
Insight: 有效的多语言AI需要结合上下文感知和文化相关的方法,合成数据在其中扮演关键角色。
Abstract: Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.
[29] RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Zhilin Wang,Jiaqi Zeng,Olivier Delalleau,Ellie Evans,Daniel Egert,Hoo-Chang Shin,Felipe Soares,Yi Dong,Oleksii Kuchaiev
Main category: cs.CL
TL;DR: RLBFF提出了一种结合人类反馈和规则验证的方法,通过在自然语言反馈中提取二元原则训练奖励模型,提升了模型的性能和可解释性。
Details
Motivation: 当前RLHF和RLVR各有局限性:RLHF依赖人类判断且缺乏明确标准,RLVR则仅关注基于正确性的验证。RLBFF旨在结合两者的优势,提供更具多样性和精确性的反馈机制。Contribution: 1. 提出RLBFF框架,结合人类偏好和规则验证;2. 通过自然语言反馈提取二元原则训练奖励模型;3. 展示该方法在RM-Bench和JudgeBench上的领先性能;4. 提供开源配方对齐Qwen3-32B模型,性能超越其他主流模型。
Method: RLBFF从自然语言反馈中提取二元原则(如准确性或代码可读性),并以此为基础训练奖励模型,将其视为任务满足性判断。
Result: RLBFF训练的奖励模型在RM-Bench和JudgeBench上分别达到86.2%和81.4%的top性能,同时在MT-Bench等基准测试中优于其他模型。
Insight: 该方法通过二元原则增强了奖励模型的可解释性和定制化能力,同时显著降低了推理成本。
Abstract: Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost).
[30] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
Yizhou Wang,Chen Tang,Han Deng,Jiabei Xiao,Jiaqi Liu,Jianyu Wu,Jun Yao,Pengze Li,Encheng Su,Lintao Wang,Guohang Zhuang,Yuchen Ren,Ben Fei,Ming Hu,Xin Chen,Dongzhan Zhou,Junjun He,Xiangyu Yue,Zhenfei Yin,Jiamin Wu,Qihao Zheng,Yuhao Zhou,Huihui Xu,Chenglong Ma,Yan Lu,Wenlong Zhang,Chunfeng Song,Philip Torr,Shixiang Tang,Xinzhu Ma,Wanli Ouyang,Lei Bai
Main category: cs.CL
TL;DR: SciReasoner是一个科学推理基础模型,通过多模态预训练和强化学习实现了跨学科的科学任务处理能力,覆盖文本翻译、知识提取、性质预测等多种任务。
Details
Motivation: 科学领域的数据和任务具有高度的异构性和复杂性,传统方法难以统一处理。SciReasoner旨在构建一个跨学科的科学推理基础模型,实现对科学数据的全面理解和处理。Contribution: 提出了一个支持多种科学任务的通用模型,结合预训练、指令微调和强化学习,显著提升了任务的覆盖范围和跨领域的泛化能力,并开源了模型和数据集。
Method: 模型通过预训练科学文本和多模态数据,结合指令微调和强化学习进行对齐和优化,支持多种科学任务的处理能力。
Result: 与专用系统相比,SciReasoner在任务覆盖范围和跨领域泛化能力上表现更优,同时保持了高保真度。
Insight: 跨学科的学习不仅能提升模型的迁移能力,还能增强其在下游任务中的可靠性,展示了通用科学模型的巨大潜力。
Abstract: We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.
cs.CV [Back]
[31] Leveraging NTPs for Efficient Hallucination Detection in VLMs
Ofir Azachi,Kfir Eliyahu,Eyal El Ani,Rom Himelstein,Roi Reichart,Yuval Pinter,Nitay Calderon
Main category: cs.CV
TL;DR: 该论文提出了一种基于NTP(下一个token概率)的高效幻觉检测方法,通过训练传统机器学习模型,减少计算开销并提升检测效率。
Details
Motivation: 视觉语言模型(VLMs)的幻觉问题(生成内容与视觉输入不符)严重影响了其可靠性。现有的检测方法通常依赖于VLMs自身或其他模型,计算量大且延迟高。Contribution: 1. 提出了基于NTP的轻量级幻觉检测方法;2. 引入了一个包含1400条人工标注的数据集;3. 结合NTP与语言NTP进一步提升了检测性能。
Method: 利用VLM生成的NTP信号训练传统ML模型,通过低NTP值(高不确定性)与幻觉的关联性进行检测。结合语言NTP和VLM预测分数进一步优化。
Result: NTP特征能有效预测幻觉,轻量级ML模型性能接近强VLM。结合语言NTP和VLM分数后性能更优。
Insight: NTP作为不确定性指标能高效捕捉幻觉,结合简单ML模型可替代复杂VLMs用于检测,为提升VLM可靠性提供了轻量级解决方案。
Abstract: Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM’s next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.
[32] Quasi-Synthetic Riemannian Data Generation for Writer-Independent Offline Signature Verification
Elias N. Zois,Moises Diaz,Salem Said,Miguel A. Ferrer
Main category: cs.CV
TL;DR: 本文提出了一种基于黎曼几何的准合成数据生成方法,用于独立于作者的离线签名验证,通过在SPD空间生成合成样本,显著降低了验证误差。
Details
Motivation: 离线签名验证在独立于作者场景下的泛化能力仍需改进,传统方法依赖真实数据集训练。本文利用黎曼几何生成合成数据以解决这一问题。Contribution: 提出了一种基于SPD空间的准合成数据生成框架,通过黎曼高斯混合模型生成正负样本,提升了独立于作者的签名验证性能。
Method: 使用SPD空间的黎曼高斯混合模型生成合成数据,通过度量学习框架优化样本对,并在真实数据集上进行验证。
Result: 在两个签名数据集上实验表明,该方法在跨数据集和数据集内验证中均取得低错误率。
Insight: 黎曼几何的合成数据生成方法可以显著提升独立于作者的签名验证性能,减少对真实数据的依赖。
Abstract: Offline handwritten signature verification remains a challenging task, particularly in writer-independent settings where models must generalize across unseen individuals. Recent developments have highlighted the advantage of geometrically inspired representations, such as covariance descriptors on Riemannian manifolds. However, past or present, handcrafted or data-driven methods usually depend on real-world signature datasets for classifier training. We introduce a quasi-synthetic data generation framework leveraging the Riemannian geometry of Symmetric Positive Definite matrices (SPD). A small set of genuine samples in the SPD space is the seed to a Riemannian Gaussian Mixture which identifies Riemannian centers as synthetic writers and variances as their properties. Riemannian Gaussian sampling on each center generates positive as well as negative synthetic SPD populations. A metric learning framework utilizes pairs of similar and dissimilar SPD points, subsequently testing it over on real-world datasets. Experiments conducted on two popular signature datasets, encompassing Western and Asian writing styles, demonstrate the efficacy of the proposed approach under both intra- and cross- dataset evaluation protocols. The results indicate that our quasi-synthetic approach achieves low error rates, highlighting the potential of generating synthetic data in Riemannian spaces for writer-independent signature verification systems.
[33] Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream,Yunpeng Chen,Yu Gao,Lixue Gong,Meng Guo,Qiushan Guo,Zhiyao Guo,Xiaoxia Hou,Weilin Huang,Yixuan Huang,Xiaowen Jian,Huafeng Kuang,Zhichao Lai,Fanshi Li,Liang Li,Xiaochen Lian,Chao Liao,Liyang Liu,Wei Liu,Yanzuo Lu,Zhengxiong Luo,Tongtong Ou,Guang Shi,Yichun Shi,Shiqi Sun,Yu Tian,Zhi Tian,Peng Wang,Rui Wang,Xun Wang,Ye Wang,Guofeng Wu,Jie Wu,Wenxu Wu,Yonghui Wu,Xin Xia,Xuefeng Xiao,Shuang Xu,Xin Yan,Ceyuan Yang,Jianchao Yang,Zhonghua Zhai,Chenlin Zhang,Heng Zhang,Qi Zhang,Xinyu Zhang,Yuwei Zhang,Shijia Zhao,Wenliang Zhao,Wenjia Zhu
Main category: cs.CV
TL;DR: Seedream 4.0是一个高效的多模态图像生成系统,结合了文本到图像(T2I)生成、图像编辑和多图像合成功能,采用扩散变压器和优化的VAE实现高效训练和高分辨率图像生成,并通过对抗蒸馏和多模态后训练提升性能,取得了SOTA结果。
Details
Motivation: 传统图像生成系统通常单独处理文本到图像生成或图像编辑任务,缺乏多模态和交互能力。Seedream 4.0旨在提供一个统一的框架,支持多种生成任务,并提升效率和性能。Contribution: 1. 提出高效的扩散变压器和优化的VAE,显著减少图像token数量;2. 结合对抗蒸馏和多模态后训练,提升性能和速度;3. 扩展生成能力,支持多图像参考和多输出生成。
Method: 1. 使用扩散变压器和高效的VAE减少token数量;2. 通过多模态后训练联合优化T2I和图像编辑任务;3. 采用对抗蒸馏、量化技术和推测解码加速推理。
Result: Seedream 4.0在T2I和多模态图像编辑任务上达到SOTA效果,能够在1.8秒内生成2K分辨率图像,展示了复杂的多模态能力和交互性。
Insight: Seedream 4.0通过统一的框架和高效的训练方法,突破了传统生成任务的局限,为创作和专业应用提供了更灵活的工具。
Abstract: We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.
[34] A Contrastive Learning Framework for Breast Cancer Detection
Samia Saeed,Khuram Naveed
Main category: cs.CV
TL;DR: 论文提出了一种基于对比学习的框架,通过小规模标注数据和大量未标注数据提升乳腺癌检测准确性,在基准数据集上达到了96.7%的准确率。
Details
Motivation: 乳腺癌是全球第二大癌症相关死因,早期检测对治疗至关重要。传统计算机辅助检测系统依赖标注数据,但深度学习在标注数据不足时表现不佳。对比学习为解决这一问题提供了可能。Contribution: 1. 提出了基于对比学习的半监督框架,利用未标注数据提升模型性能。2. 通过在ResNet-50上应用对比学习,并结合数据增强和转换,显著提高了乳腺癌检测的准确性。3. 在INbreast和MIAS数据集上取得96.7%的准确率,超越现有方法。
Method: 1. 使用半监督对比学习方法训练ResNet-50,利用未标注的乳腺X光片数据优化模型。2. 通过多种数据增强和转换技术提升模型的鲁棒性。3. 在小规模标注数据上进行微调。
Result: 在INbreast和MIAS基准数据集上,模型达到了96.7%的准确率,优于现有方法。
Insight: 对比学习在小规模标注数据场景下表现出色,通过利用未标注数据可以显著提升模型的泛化能力和准确性。这项研究为医学图像分析中的数据标注不足问题提供了有效解决方案。
Abstract: Breast cancer, the second leading cause of cancer-related deaths globally, accounts for a quarter of all cancer cases [1]. To lower this death rate, it is crucial to detect tumors early, as early-stage detection significantly improves treatment outcomes. Advances in non-invasive imaging techniques have made early detection possible through computer-aided detection (CAD) systems which rely on traditional image analysis to identify malignancies. However, there is a growing shift towards deep learning methods due to their superior effectiveness. Despite their potential, deep learning methods often struggle with accuracy due to the limited availability of large-labeled datasets for training. To address this issue, our study introduces a Contrastive Learning (CL) framework, which excels with smaller labeled datasets. In this regard, we train Resnet-50 in semi supervised CL approach using similarity index on a large amount of unlabeled mammogram data. In this regard, we use various augmentation and transformations which help improve the performance of our approach. Finally, we tune our model on a small set of labelled data that outperforms the existing state of the art. Specifically, we observed a 96.7% accuracy in detecting breast cancer on benchmark datasets INbreast and MIAS.
[35] Are Foundation Models Ready for Industrial Defect Recognition? A Reality Check on Real-World Data
Simon Baeuerle,Pratik Khanna,Nils Friederich,Angelo Jovin Yamachui Sitcheu,Damir Shakirov,Andreas Steimer,Ralf Mikut
Main category: cs.CV
TL;DR: 本文探讨了基础模型(FMs)在工业缺陷识别中的适用性,发现尽管其在公共数据集上表现优异,但在真实工业数据上却表现不佳。
Details
Motivation: 工业缺陷识别需要大量标注数据和定制模型,耗时耗力。基础模型因其可迁移性和零样本能力,有望简化这一过程。Contribution: 通过实验验证了基础模型在真实工业数据上的局限性,揭示了其与公共数据集表现差异的问题。
Method: 测试了多种近期基础模型在自定义工业图像数据和公共数据集上的表现。
Result: 所有测试的基础模型在真实工业数据上均失败,而在公共数据集上表现良好。
Insight: 基础模型在工业缺陷识别中的应用仍需进一步优化,现有模型对真实数据的适应性不足。
Abstract: Foundation Models (FMs) have shown impressive performance on various text and image processing tasks. They can generalize across domains and datasets in a zero-shot setting. This could make them suitable for automated quality inspection during series manufacturing, where various types of images are being evaluated for many different products. Replacing tedious labeling tasks with a simple text prompt to describe anomalies and utilizing the same models across many products would save significant efforts during model setup and implementation. This is a strong advantage over supervised Artificial Intelligence (AI) models, which are trained for individual applications and require labeled training data. We test multiple recent FMs on both custom real-world industrial image data and public image data. We show that all of those models fail on our real-world data, while the very same models perform well on public benchmark datasets.
[36] Shared Neural Space: Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision
Jing Li,Oskar Bartosz,Chengyu Wang,Michal Wnuczynski,Dilshan Godaliyadda,Michael Polley
Main category: cs.CV
TL;DR: 该论文提出了一种通用的神经网络空间(NS),通过编码器-解码器框架预计算跨视觉和成像任务的特征,实现了多任务和跨域的高效特征共享。
Details
Motivation: 当前大多数AI模型针对单一任务定制,导致多任务场景下效率低下。论文旨在解决这一问题,提出一种统一的特征编码方法。Contribution: 提出了一个轻量级的CNN编码器-解码器框架(Neural Space),支持跨任务和跨域的特征共享,减少了冗余并提升了泛化能力。
Method: 采用编码器-解码器架构,预计算通用特征表示,支持下游多任务模块共享同一特征空间。
Result: 实验表明,该方法可高效支持去马赛克、去噪、深度估计和语义分割等多种视觉任务,且在硬件上具有广泛适用性。
Insight: 统一的特征空间设计为多任务视觉系统提供了高效的基础设施,同时避免了传统Transformer模型的复杂性。
Abstract: The majority of AI models in imaging and vision are customized to perform on specific high-precision task. However, this strategy is inefficient for applications with a series of modular tasks, since each requires a mapping into a disparate latent domain. To address this inefficiency, we proposed a universal Neural Space (NS), where an encoder-decoder framework pre-computes features across vision and imaging tasks. Our encoder learns transformation aware, generalizable representations, which enable multiple downstream AI modules to share the same feature space. This architecture reduces redundancy, improves generalization across domain shift, and establishes a foundation for effecient multi-task vision pipelines. Furthermore, as opposed to larger transformer backbones, our backbone is lightweight and CNN-based, allowing for wider across hardware. We furthur demonstrate that imaging and vision modules, such as demosaicing, denoising, depth estimation and semantic segmentation can be performed efficiently in the NS.
[37] InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On
Julien Han,Shuwen Qiu,Qi Li,Xingzi Xu,Mehmet Saygin Seyfioglu,Kavosh Asadi,Karim Bouyarmane
Main category: cs.CV
TL;DR: InstructVTON是一个基于自然语言指导的交互式虚拟试衣系统,通过自动生成二值掩码和语言引导的样式控制,解决了传统掩码试衣方法的局限性。
Details
Motivation: 传统基于掩码的虚拟试衣方法需要用户精确绘制掩码,且在某些复杂样式(如卷起袖子)的场景下无法实现。InstructVTON通过自动化掩码生成和自然语言交互,简化了用户体验并扩展了试衣功能。Contribution: 1) 引入自然语言指导的交互式虚拟试衣系统;2) 利用视觉语言模型和图像分割模型自动化生成掩码;3) 兼容现有试衣模型,实现复杂样式控制。
Method: 1) 将虚拟试衣问题建模为图像引导的修复任务;2) 结合VLMs和分割模型生成二值掩码;3) 通过自然语言指令指导样式生成。
Result: 系统在复杂样式控制和用户交互方面实现SOTA效果,无需人工绘制掩码。
Insight: 自然语言与视觉模型的结合可以显著提升虚拟试衣的用户体验和功能扩展,尤其是在复杂样式控制方面。
Abstract: We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditioned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with “sleeves rolled up” styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.
[38] Innovative Deep Learning Architecture for Enhanced Altered Fingerprint Recognition
Dana A Abdullah,Dana Rasul Hamad,Bishar Rasheed Ibrahim,Sirwan Abdulwahid Aula,Aso Khaleel Ameen,Sabat Salih Hamadamin
Main category: cs.CV
TL;DR: 论文提出DeepAFRNet,一种基于VGG16的深度学习架构,用于增强的篡改指纹识别,在SOCOFing数据集的不同难度级别上表现优异。
Details
Motivation: 篡改指纹(AFR)在边境控制、法医和财政准入等应用中具有挑战性。现有方法难以处理真实篡改指纹,且缺乏对不同难度级别的评估。Contribution: 1) 提出DeepAFRNet模型,结合VGG16和余弦相似度,有效识别篡改指纹;2) 使用真实篡改指纹数据集SOCOFing,并分难度级别评估性能。
Method: 采用VGG16作为主干网络提取高维特征,利用余弦相似度比较指纹嵌入向量,严格评估不同难度级别的识别性能。
Result: 在SOCOFing数据集上,DeepAFRNet在Easy、Medium、Hard三个级别分别达到96.7%、98.76%、99.54%的准确率;阈值敏感性研究表明阈值选择对性能至关重要。
Insight: 研究表明,真实篡改指纹的识别需严格阈值控制,且分难度级别评估更能反映模型的实用性,为实际部署提供指导。
Abstract: Altered fingerprint recognition (AFR) is challenging for biometric verification in applications such as border control, forensics, and fiscal admission. Adversaries can deliberately modify ridge patterns to evade detection, so robust recognition of altered prints is essential. We present DeepAFRNet, a deep learning recognition model that matches and recognizes distorted fingerprint samples. The approach uses a VGG16 backbone to extract high-dimensional features and cosine similarity to compare embeddings. We evaluate on the SOCOFing Real-Altered subset with three difficulty levels (Easy, Medium, Hard). With strict thresholds, DeepAFRNet achieves accuracies of 96.7 percent, 98.76 percent, and 99.54 percent for the three levels. A threshold-sensitivity study shows that relaxing the threshold from 0.92 to 0.72 sharply degrades accuracy to 7.86 percent, 27.05 percent, and 29.51 percent, underscoring the importance of threshold selection in biometric systems. By using real altered samples and reporting per-level metrics, DeepAFRNet addresses limitations of prior work based on synthetic alterations or limited verification protocols, and indicates readiness for real-world deployments where both security and recognition resilience are critical.
[39] Large Pre-Trained Models for Bimanual Manipulation in 3D
Hanna Yurchyk,Wei-Di Chang,Gregory Dudek,David Meger
Main category: cs.CV
TL;DR: 该论文提出了一种通过将预训练的视觉变换器(ViT)的注意力图与体素表示结合的方法,以提升双手机器人操作的能力。
Details
Motivation: 为了增强双手机器人在3D空间中的操作能力,研究者探索了如何利用预训练模型的注意力图来提供像素级的显著性信息,从而改进机器人的行为克隆策略。Contribution: 主要贡献是通过将DINOv2模型的注意力图转化为体素级的语义线索,并将其整合到体素基策略中,显著提升了双手机器人操作任务的性能。
Method: 方法包括从DINOv2中提取注意力图,将其作为像素级显著性分数,再将其映射到3D体素网格中,最后结合到行为克隆策略中。
Result: 实验结果表明,该方法在RLBench双手机器人基准测试中平均绝对提升了8.2%,相对增益为21.9%。
Insight: 研究表明,预训练模型的注意力图可以为机器人操作提供有价值的语义线索,改善其在复杂任务中的表现。
Abstract: We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.
[40] A Comparative Benchmark of Real-time Detectors for Blueberry Detection towards Precision Orchard Management
Xinyang Mu,Yuzhen Lu,Boyang Deng
Main category: cs.CV
TL;DR: 这篇论文对蓝莓检测进行了实时目标检测器的比较基准研究,评估了YOLO和RT-DETR家族的36种模型变体,并在新构建的数据集上进行了测试。结果显示了模型的精度和速度的平衡,并探索了半监督学习的改进潜力。
Details
Motivation: 蓝莓在自然环境中检测的挑战包括光照变化、遮挡和运动模糊,而深度学习目标检测器需要大规模、多样的数据集来实现高精度,同时在实际应用中还需平衡精度、速度和内存消耗。Contribution: 论文的主要贡献包括:1) 新构建了一个包含85,879个标注实例的蓝莓检测数据集;2) 对YOLO和RT-DETR家族的36种模型变体进行了全面比较;3) 探索了半监督学习对模型性能的提升效果。
Method: 方法包括:1) 使用智能手机和机器视觉平台收集数据;2) 评估YOLO和RT-DETR模型的性能和速度;3) 采用Unbiased Mean Teacher半监督学习方法对模型进行微调。
Result: 结果显示,YOLOv12m和RT-DETRv2-X分别取得93.3%和93.6%的mAP@50,而半监督学习后,RT-DETRv2-X进一步提升至94.8%。中型模型在精度和速度之间表现出较好的平衡。
Insight: 研究发现半监督学习对跨领域无标签数据的利用仍有改进空间,同时中型模型在实际应用中可能是最佳选择。
Abstract: Blueberry detection in natural environments remains challenging due to variable lighting, occlusions, and motion blur due to environmental factors and imaging devices. Deep learning-based object detectors promise to address these challenges, but they demand a large-scale, diverse dataset that captures the real-world complexities. Moreover, deploying these models in practical scenarios often requires the right accuracy/speed/memory trade-off in model selection. This study presents a novel comparative benchmark analysis of advanced real-time object detectors, including YOLO (You Only Look Once) (v8-v12) and RT-DETR (Real-Time Detection Transformers) (v1-v2) families, consisting of 36 model variants, evaluated on a newly curated dataset for blueberry detection. This dataset comprises 661 canopy images collected with smartphones during the 2022-2023 seasons, consisting of 85,879 labelled instances (including 36,256 ripe and 49,623 unripe blueberries) across a wide range of lighting conditions, occlusions, and fruit maturity stages. Among the YOLO models, YOLOv12m achieved the best accuracy with a mAP@50 of 93.3%, while RT-DETRv2-X obtained a mAP@50 of 93.6%, the highest among all the RT-DETR variants. The inference time varied with the model scale and complexity, and the mid-sized models appeared to offer a good accuracy-speed balance. To further enhance detection performance, all the models were fine-tuned using Unbiased Mean Teacher-based semi-supervised learning (SSL) on a separate set of 1,035 unlabeled images acquired by a ground-based machine vision platform in 2024. This resulted in accuracy gains ranging from -1.4% to 2.9%, with RT-DETR-v2-X achieving the best mAP@50 of 94.8%. More in-depth research into SSL is needed to better leverage cross-domain unlabeled data. Both the dataset and software programs of this study are made publicly available to support further research.
[41] Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation
Farbod Bigdeli,Mohsen Mohammadagha,Ali Bigdeli
Main category: cs.CV
TL;DR: 该论文提出了一种轻量级的感兴趣区域(ROI)增强策略,用于提升乳腺X光影像分类任务,在Mini-DDSM数据集上证明了其有效性,且无需额外标注或模型架构修改。
Details
Motivation: 乳腺X光检查在乳腺癌早期筛查中至关重要,但现有深度学习模型因数据集分辨率受限和样本量不足而表现不佳。论文旨在通过简单的ROI增强策略提升分类性能。Contribution: 提出了一种轻量级的ROI增强方法,通过随机裁剪和替换ROI区域,提升了模型在乳腺X光检查分类任务中的表现,同时保持了推理效率。
Method: 在训练阶段,以一定概率将完整图像替换为预计算的无标签ROI区域裁剪(可选抖动以增加多样性),并在患者级别的交叉验证下评估性能。
Result: 在Mini-DDSM数据集上,ROI增强(最佳参数:p_roi=0.10, alpha=0.10)带来ROC-AUC的轻微提升,但PR-AUC表现持平或略有下降。
Insight: 数据中心的简单ROI增强策略可以在资源受限的环境中提升模型性能,而无需额外标注或架构修改,是一种高效的增强方法。
Abstract: Breast cancer screening with mammography remains central to early detection and mortality reduction. Deep learning has shown strong potential for automating mammogram interpretation, yet limited-resolution datasets and small sample sizes continue to restrict performance. We revisit the Mini-DDSM dataset (9,684 images; 2,414 patients) and introduce a lightweight region-of-interest (ROI) augmentation strategy. During training, full images are probabilistically replaced with random ROI crops sampled from a precomputed, label-free bounding-box bank, with optional jitter to increase variability. We evaluate under strict patient-level cross-validation and report ROC-AUC, PR-AUC, and training-time efficiency metrics (throughput and GPU memory). Because ROI augmentation is training-only, inference-time cost remains unchanged. On Mini-DDSM, ROI augmentation (best: p_roi = 0.10, alpha = 0.10) yields modest average ROC-AUC gains, with performance varying across folds; PR-AUC is flat to slightly lower. These results demonstrate that simple, data-centric ROI strategies can enhance mammography classification in constrained settings without requiring additional labels or architectural modifications.
[42] Reflect3r: Single-View 3D Stereo Reconstruction Aided by Mirror Reflections
Jing Wu,Zirui Wang,Iro Laina,Victor Adrian Prisacariu
Main category: cs.CV
TL;DR: 这篇论文提出了一种通过镜面反射实现单目3D立体重建的方法(Reflect3r),利用反射作为辅助视图,构造虚拟摄像头,从而简化成像过程并提升重建效果。
Details
Motivation: 利用日常环境中常见的镜面反射,提供额外的立体信息,从而简化传统的多视图立体成像过程,提高3D重建的通用性和鲁棒性。Contribution: 1. 提出了一种基于镜面反射的单目3D重建框架;2. 设计了物理有效的虚拟摄像头变换方法;3. 引入对称性感知损失以优化姿态估计;4. 支持动态场景的高效逐帧几何恢复;5. 提供了一个可定制的合成数据集用于评估。
Method: 1. 将镜面反射视为辅助视图,构造虚拟摄像头;2. 设计像素级虚拟视图生成方法;3. 使用对称性感知损失优化姿态估计;4. 扩展框架以支持动态场景。
Result: 在真实数据和合成数据上进行了广泛实验,证明了方法的有效性。
Insight: 镜面反射可以作为单目3D重建的有力辅助工具,通过虚拟视图和多视图立体方法的结合,显著简化了传统的复杂成像过程。
Abstract: Mirror reflections are common in everyday environments and can provide stereo information within a single capture, as the real and reflected virtual views are visible simultaneously. We exploit this property by treating the reflection as an auxiliary view and designing a transformation that constructs a physically valid virtual camera, allowing direct pixel-domain generation of the virtual view while adhering to the real-world imaging process. This enables a multi-view stereo setup from a single image, simplifying the imaging process, making it compatible with powerful feed-forward reconstruction models for generalizable and robust 3D reconstruction. To further exploit the geometric symmetry introduced by mirrors, we propose a symmetric-aware loss to refine pose estimation. Our framework also naturally extends to dynamic scenes, where each frame contains a mirror reflection, enabling efficient per-frame geometry recovery. For quantitative evaluation, we provide a fully customizable synthetic dataset of 16 Blender scenes, each with ground-truth point clouds and camera poses. Extensive experiments on real-world data and synthetic data are conducted to illustrate the effectiveness of our method.
[43] Recov-Vision: Linking Street View Imagery and Vision-Language Models for Post-Disaster Recovery
Yiming Xiao,Archit Gupta,Miguel Esparza,Yu-Hsuan Ho,Antonia Sebastian,Hannah Weas,Rose Houck,Ali Mostafavi
Main category: cs.CV
TL;DR: 该论文提出了一种名为FacadeTrack的框架,结合街景图像和视觉语言模型,用于灾后重建中的建筑物占用评估。通过两阶段设计,实现了高精度和高召回率的决策策略。
Details
Motivation: 灾后建筑物占用评估对资源分配和公共安全至关重要,但现有方法(如航拍图像)难以捕捉立面和入口细节。街景图像虽能提供这些细节,但存在稀疏和对齐困难的问题。Contribution: 1. 提出了FacadeTrack框架,结合街景图像和语言模型,生成可解释的属性(如入口堵塞、临时覆盖物等)。2. 提供了两种决策策略:透明的一阶段规则和分离感知与推理的两阶段设计。3. 在真实灾后调查中验证了方法的有效性。
Method: 1. 将全景视频与地块对齐,并校正为立面视图。2. 利用视觉语言模型提取可解释的属性。3. 通过一阶段规则或两阶段设计进行决策,两阶段设计将感知与保守推理分离。
Result: 在飓风灾后调查中,两阶段方法的精确度为0.927,召回率为0.781,F1得分为0.848,优于一阶段基线(精确度0.943,召回率0.728,F1得分0.822)。
Insight: 1. 中间属性和空间诊断揭示了误差来源,便于针对性质量控制。2. 框架具有可扩展性和可审计性,适合与地理空间和应急管理工作流集成。
Abstract: Building-level occupancy after disasters is vital for triage, inspections, utility re-energization, and equitable resource allocation. Overhead imagery provides rapid coverage but often misses facade and access cues that determine habitability, while street-view imagery captures those details but is sparse and difficult to align with parcels. We present FacadeTrack, a street-level, language-guided framework that links panoramic video to parcels, rectifies views to facades, and elicits interpretable attributes (for example, entry blockage, temporary coverings, localized debris) that drive two decision strategies: a transparent one-stage rule and a two-stage design that separates perception from conservative reasoning. Evaluated across two post-Hurricane Helene surveys, the two-stage approach achieves a precision of 0.927, a recall of 0.781, and an F-1 score of 0.848, compared with the one-stage baseline at a precision of 0.943, a recall of 0.728, and an F-1 score of 0.822. Beyond accuracy, intermediate attributes and spatial diagnostics reveal where and why residual errors occur, enabling targeted quality control. The pipeline provides auditable, scalable occupancy assessments suitable for integration into geospatial and emergency-management workflows.
[44] Human Semantic Representations of Social Interactions from Moving Shapes
Yiling Yun,Hongjing Lu
Main category: cs.CV
TL;DR: 论文探讨人类如何通过简单的移动形状识别社交互动,并分析了语义表示在补充视觉特征中的作用。研究发现语义模型能更好地解释人类判断,尤其是基于动词的嵌入。
Details
Motivation: 人类能够从简单的移动形状中识别社交互动,但现有研究多关注视觉特征,缺乏对语义表示的分析。本文旨在填补这一空缺。Contribution: 揭示了语义结构在社交互动感知中的作用,尤其是动词嵌入在解释人类相似性判断中的优越性。
Method: 通过两项研究:1)让参与者根据移动形状标记动画;2)测量27种社交互动的表征几何,并与基于视觉特征、标签和语义嵌入的模型预测对比。
Result: 研究发现语义模型能补充视觉特征,且基于动词的嵌入最能解释人类判断。
Insight: 社交互动的感知不仅依赖视觉特征,还需结合语义结构,尤其是动词的分类信息。
Abstract: Humans are social creatures who readily recognize various social interactions from simple display of moving shapes. While previous research has often focused on visual features, we examine what semantic representations that humans employ to complement visual features. In Study 1, we directly asked human participants to label the animations based on their impression of moving shapes. We found that human responses were distributed. In Study 2, we measured the representational geometry of 27 social interactions through human similarity judgments and compared it with model predictions based on visual features, labels, and semantic embeddings from animation descriptions. We found that semantic models provided complementary information to visual features in explaining human judgments. Among the semantic models, verb-based embeddings extracted from descriptions account for human similarity judgments the best. These results suggest that social perception in simple displays reflects the semantic structure of social interactions, bridging visual and abstract representations.
[45] Enhancing Cross-View Geo-Localization Generalization via Global-Local Consistency and Geometric Equivariance
Xiaowei Wang,Di Wang,Ke Li,Yifeng Wang,Chengjian Wang,Libin Sun,Zhihong Wu,Yiming Zhang,Quan Wang
Main category: cs.CV
TL;DR: 论文EGS提出了一种新的跨视角地理定位框架,通过全局-局部一致性和几何等变性提升跨域泛化能力。
Details
Motivation: 解决跨视角地理定位中因无人机视角和视场差异导致的强外观变化问题,以及如何建立可靠的全局和局部对应关系。Contribution: 引入E(2)-可转向CNN编码器和带虚拟超节点的图结构,提升特征稳定性和全局-局部一致性,显著优于现有方法。
Method: 使用E(2)-可转向CNN提取旋转和视角不变的稳定特征,并通过虚拟超节点图实现全局语义聚合与局部一致性。
Result: 在University-1652和SUES-200基准测试中表现优异,实现了跨域地理定位的新SOTA。
Insight: 几何等变性和全局-局部一致性设计是提升跨域泛化的关键。
Abstract: Cross-view geo-localization (CVGL) aims to match images of the same location captured from drastically different viewpoints. Despite recent progress, existing methods still face two key challenges: (1) achieving robustness under severe appearance variations induced by diverse UAV orientations and fields of view, which hinders cross-domain generalization, and (2) establishing reliable correspondences that capture both global scene-level semantics and fine-grained local details. In this paper, we propose EGS, a novel CVGL framework designed to enhance cross-domain generalization. Specifically, we introduce an E(2)-Steerable CNN encoder to extract stable and reliable features under rotation and viewpoint shifts. Furthermore, we construct a graph with a virtual super-node that connects to all local nodes, enabling global semantics to be aggregated and redistributed to local regions, thereby enforcing global-local consistency. Extensive experiments on the University-1652 and SUES-200 benchmarks demonstrate that EGS consistently achieves substantial performance gains and establishes a new state of the art in cross-domain CVGL.
[46] DENet: Dual-Path Edge Network with Global-Local Attention for Infrared Small Target Detection
Jiayi Zuo,Songwei Pei,Qian Li
Main category: cs.CV
TL;DR: DENet提出了一种双路径边缘网络,通过全局-局部注意力机制和多尺度边缘细化来解决红外小目标检测中的特征对齐和噪声抑制问题。
Details
Motivation: 红外小目标检测在复杂和嘈杂背景中表现不佳,主要原因是缺乏明显的纹理和形态特征,以及高分辨率空间细节与语义上下文之间的冲突。现有方法难以在低对比度下准确提取目标边缘。Contribution: 1. 设计了双路径网络,分别处理边缘增强和语义建模;2. 提出双向交互模块(BIM),结合局部和全局自注意力;3. 引入多边缘细化器(MER),使用泰勒有限差分算子提升边缘细节。
Method: 1. 使用双向交互模块(BIM)结合局部和全局自注意力捕获多尺度特征;2. 通过多边缘细化器(MER)和注意力门控机制增强边缘细节并抑制噪声。
Result: DENet在红外小目标检测任务中表现优异,实现了精准的目标定位和噪声抑制。
Insight: 结合全局语义和局部边缘信息,通过数学计算(如泰勒差分)提升边缘检测精度,为解决小目标检测问题提供了新思路。
Abstract: Infrared small target detection is crucial for remote sensing applications like disaster warning and maritime surveillance. However, due to the lack of distinctive texture and morphological features, infrared small targets are highly susceptible to blending into cluttered and noisy backgrounds. A fundamental challenge in designing deep models for this task lies in the inherent conflict between capturing high-resolution spatial details for minute targets and extracting robust semantic context for larger targets, often leading to feature misalignment and suboptimal performance. Existing methods often rely on fixed gradient operators or simplistic attention mechanisms, which are inadequate for accurately extracting target edges under low contrast and high noise. In this paper, we propose a novel Dual-Path Edge Network that explicitly addresses this challenge by decoupling edge enhancement and semantic modeling into two complementary processing paths. The first path employs a Bidirectional Interaction Module, which uses both Local Self-Attention and Global Self-Attention to capture multi-scale local and global feature dependencies. The global attention mechanism, based on a Transformer architecture, integrates long-range semantic relationships and contextual information, ensuring robust scene understanding. The second path introduces the Multi-Edge Refiner, which enhances fine-grained edge details using cascaded Taylor finite difference operators at multiple scales. This mathematical approach, along with an attention-driven gating mechanism, enables precise edge localization and feature enhancement for targets of varying sizes, while effectively suppressing noise. Our method provides a promising solution for precise infrared small target detection and localization, combining structural semantics and edge refinement in a unified framework.
[47] Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset
Ruixu Zhang,Yuran Wang,Xinyi Hu,Chaoyu Mai,Wenxuan Liu,Danni Xu,Xian Zhong,Zheng Wang
Main category: cs.CV
TL;DR: 该论文提出了群体意图预测(GIF)任务和首个大规模数据集SHOT,并开发了GIFT框架,用于通过分析个体行为和群体动态来预测群体意图的出现。
Details
Motivation: 传统意图识别过于关注个体意图,忽视了群体情境中复杂、动态的集体意图。为此,作者提出了群体意图预测任务,以填补这一研究空白。Contribution: 1. 引入群体意图和GIF任务;2. 发布首个大规模数据集SHOT,支持多视角、多层次意图分析;3. 提出GIFT框架,结合个体特征与群体动态建模预测意图。
Method: GIFT框架通过提取细粒度个体特征并建模群体动态演化,预测群体意图的出现时机。SHOT数据集包含篮球比赛视频,标注了个体属性和意图,支持多视角适应性。
Result: 实验证明SHOT数据集和GIFT框架的有效性,为群体意图预测研究奠定了基础。
Insight: 集体意图的预测需要同时考虑个体行为和群体交互的动态变化,SHOT为未来研究提供了多样化的场景支持。
Abstract: Intention recognition has traditionally focused on individual intentions, overlooking the complexities of collective intentions in group settings. To address this limitation, we introduce the concept of group intention, which represents shared goals emerging through the actions of multiple individuals, and Group Intention Forecasting (GIF), a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions before the collective goal becomes apparent. To investigate GIF in a specific scenario, we propose SHOT, the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views and annotated with 6 types of individual attributes. SHOT is designed with 3 key characteristics: multi-individual information, multi-view adaptability, and multi-level intention, making it well-suited for studying emerging group intentions. Furthermore, we introduce GIFT (Group Intention ForecasTer), a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence. Experimental results confirm the effectiveness of SHOT and GIFT, establishing a strong foundation for future research in group intention forecasting. The dataset is available at https://xinyi-hu.github.io/SHOT_DATASET.
[48] Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection
Yu Guo,Shengfeng He,Yuxu Lu,Haonan An,Yihang Tao,Huilin Zhu,Jingxian Liu,Yuguang Fang
Main category: cs.CV
TL;DR: Neptune-X提出了一种结合生成模型和主动采样的框架,通过合成多样化的海事场景数据并动态选择任务相关样本,解决了海事目标检测中标注数据稀缺和泛化能力不足的问题。
Details
Motivation: 海事目标检测在导航安全和自主操作中至关重要,但面临标注数据稀缺和模型在多样化海事场景中泛化能力差的挑战。现有数据集在某些场景(如公海)中表现不佳。Contribution: 1)提出X-to-Maritime生成模型,通过双向对象-水域注意力模块提升合成数据的视觉逼真度;2)设计属性相关主动采样方法,动态选择任务相关样本;3)构建首个海事生成基准数据集。
Method: 采用多模态条件生成模型合成多样化海事场景,结合双向对象-水域注意力模块捕捉对象与水域的边界交互。随后通过属性相关主动采样选择任务相关样本用于训练。
Result: 实验表明,该方法在海事场景合成中达到新基准,显著提升了检测精度,尤其在挑战性场景中表现突出。
Insight: 通过生成与选择结合的方法,Neptune-X为数据稀缺领域的任务提供了一种有效解决方案,同时强调了任务相关样本选择的重要性。
Abstract: Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). % In particular, models trained on existing datasets often underperform in underrepresented scenarios such as open-sea environments. To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings.The code is available at https://github.com/gy65896/Neptune-X.
[49] AI-Enabled Crater-Based Navigation for Lunar Mapping
Sofia McLeod,Chee-Kheng Chng,Matthew Rodda,Tat-Jun Chin
Main category: cs.CV
TL;DR: 论文提出STELLA,首个端到端的月球坑导航(CBN)系统,用于长期月球测绘任务,通过结合Mask R-CNN等模块,在多样化光照和视角下实现高精度位姿估计。
Details
Motivation: 现有的CBN技术主要集中在短期、垂直视角的降落任务中,无法满足长期、多光照条件的月球测绘任务需求。Contribution: 1) 提出STELLA端到端CBN系统;2) 发布首个模拟月球全年测绘的数据集CRESENT-365;3) 在多样化条件下验证系统性能。
Method: STELLA整合了Mask R-CNN坑检测、无描述符坑识别模块、鲁棒的PnCr位姿求解器和批量轨道估计后端。
Result: 实验表明,STELLA在位置和姿态估计上分别达到米级和亚度级精度,支持多样化视角和光照条件。
Insight: 未来月球测绘任务需考虑全球覆盖和光照变化的挑战,STELLA为此提供了可行方案。
Abstract: Crater-Based Navigation (CBN) uses the ubiquitous impact craters of the Moon observed on images as natural landmarks to determine the six degrees of freedom pose of a spacecraft. To date, CBN has primarily been studied in the context of powered descent and landing. These missions are typically short in duration, with high-frequency imagery captured from a nadir viewpoint over well-lit terrain. In contrast, lunar mapping missions involve sparse, oblique imagery acquired under varying illumination conditions over potentially year-long campaigns, posing significantly greater challenges for pose estimation. We bridge this gap with STELLA - the first end-to-end CBN pipeline for long-duration lunar mapping. STELLA combines a Mask R-CNN-based crater detector, a descriptor-less crater identification module, a robust perspective-n-crater pose solver, and a batch orbit determination back-end. To rigorously test STELLA, we introduce CRESENT-365 - the first public dataset that emulates a year-long lunar mapping mission. Each of its 15,283 images is rendered from high-resolution digital elevation models with SPICE-derived Sun angles and Moon motion, delivering realistic global coverage, illumination cycles, and viewing geometries. Experiments on CRESENT+ and CRESENT-365 show that STELLA maintains metre-level position accuracy and sub-degree attitude accuracy on average across wide ranges of viewing angles, illumination conditions, and lunar latitudes. These results constitute the first comprehensive assessment of CBN in a true lunar mapping setting and inform operational conditions that should be considered for future missions.
[50] Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models
Zoe Wanying He,Sean Trott,Meenakshi Khosla
Main category: cs.CV
TL;DR: 论文探索了视觉与语言模型在深层表征上的对齐性,发现这种对齐主要出现在模型的中晚期层,且对语义变化敏感,能够反映人类对图像-文本匹配的偏好。
Details
Motivation: 研究动机是明确视觉与语言模型在表征空间中的对齐现象,包括其出现的位置、依赖的线索、与人类偏好的关系以及聚合样本对对齐的影响。Contribution: 主要贡献包括揭示了视觉与语言模型表征对齐的层级特性、语义敏感性,以及证明了这种对齐与人类偏好一致且可通过聚合样本增强。
Method: 通过分析模型层级的表征对齐、语义扰动实验、多对多图像-文本匹配任务及聚合样本的影响,系统研究了对齐现象。
Result: 结果显示对齐在模型的中晚期层最显著,对语义变化敏感,能够反映人类偏好,聚合样本可进一步强化对齐效果。
Insight: 研究表明,单模态模型能够自发形成与人类语义理解一致的共享表征空间,且这种对齐可通过多样本聚合增强。
Abstract: Recent studies show that deep vision-only and language-only models–trained on disjoint modalities–nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, whether it captures human preferences in many-to-many image-text scenarios, and how aggregating exemplars of the same concept affects alignment. Here, we systematically investigate these questions. We find that alignment peaks in mid-to-late layers of both model types, reflecting a shift from modality-specific to conceptually shared representations. This alignment is robust to appearance-only changes but collapses when semantics are altered (e.g., object removal or word-order scrambling), highlighting that the shared code is truly semantic. Moving beyond the one-to-one image-caption paradigm, a forced-choice “Pick-a-Pic” task shows that human preferences for image-caption matches are mirrored in the embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions akin to human judgments. Surprisingly, averaging embeddings across exemplars amplifies alignment rather than blurring detail. Together, our results demonstrate that unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.
[51] FreeInsert: Personalized Object Insertion with Geometric and Style Control
Yuhong Zhang,Han Wang,Yiwen Wang,Rong Xie,Li Song
Main category: cs.CV
TL;DR: FreeInsert是一个无需训练的新框架,通过利用3D几何信息,实现个性化对象插入到任意场景中,解决了现有方法在几何控制和风格一致性上的不足。
Details
Motivation: 现有图像编辑方法在个性化图像合成任务中存在几何控制不足和风格一致性差的问题,FreeInsert旨在解决这些问题。Contribution: 提出了一个无需训练的框架FreeInsert,利用3D几何信息实现几何控制和风格一致性,结合扩散适配器生成高质量的编辑图像。
Method: 将2D对象转换为3D,进行3D级交互编辑后重新渲染为2D图像,结合扩散适配器实现几何和风格控制。
Result: 生成的图像具有精确的几何控制和良好的风格一致性。
Insight: 通过3D几何信息提升2D编辑的几何控制能力,扩散适配器有效解决了风格一致性问题。
Abstract: Text-to-image diffusion models have made significant progress in image generation, allowing for effortless customized generation. However, existing image editing methods still face certain limitations when dealing with personalized image composition tasks. First, there is the issue of lack of geometric control over the inserted objects. Current methods are confined to 2D space and typically rely on textual instructions, making it challenging to maintain precise geometric control over the objects. Second, there is the challenge of style consistency. Existing methods often overlook the style consistency between the inserted object and the background, resulting in a lack of realism. In addition, the challenge of inserting objects into images without extensive training remains significant. To address these issues, we propose \textit{FreeInsert}, a novel training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information. Benefiting from the advances in existing 3D generation models, we first convert the 2D object into 3D, perform interactive editing at the 3D level, and then re-render it into a 2D image from a specified view. This process introduces geometric controls such as shape or view. The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters, ultimately producing geometrically controlled, style-consistent edited images via the diffusion model.
[52] CusEnhancer: A Zero-Shot Scene and Controllability Enhancement Method for Photo Customization via ResInversion
Maoye Ren,Praneetha Vaddamanu,Jianjin Xu,Fernando De la Torre Frade
Main category: cs.CV
TL;DR: 论文提出了一种零样本增强方法CustomEnhancer,通过ResInversion技术提升现有身份定制模型的效果,支持多流生成和训练自由控制,显著降低了复杂度。
Details
Motivation: 现有文本生成图像扩散模型在生成逼真人物照片时,面临场景退化、控制不足和身份感知不优的问题。Contribution: 1. 提出CustomEnhancer框架,通过三重流融合生成方法提升场景多样性和身份保真度;2. 引入ResInversion技术,显著降低反演时间。
Method: 1. 结合三重流生成方法(PerGeneration)统一生成与重建过程;2. 使用ResInversion技术通过预扩散机制校正噪声,优化反演效率。
Result: 实验表明,CustomEnhancer在场景多样性、身份保真度和训练自由控制方面达到SOTA,且ResInversion比NTI快129倍。
Insight: 1. 多流生成方法能显著提升生成质量;2. 预扩散机制是高效反演的关键创新。
Abstract: Recently remarkable progress has been made in synthesizing realistic human photos using text-to-image diffusion models. However, current approaches face degraded scenes, insufficient control, and suboptimal perceptual identity. We introduce CustomEnhancer, a novel framework to augment existing identity customization models. CustomEnhancer is a zero-shot enhancement pipeline that leverages face swapping techniques, pretrained diffusion model, to obtain additional representations in a zeroshot manner for encoding into personalized models. Through our proposed triple-flow fused PerGeneration approach, which identifies and combines two compatible counter-directional latent spaces to manipulate a pivotal space of personalized model, we unify the generation and reconstruction processes, realizing generation from three flows. Our pipeline also enables comprehensive training-free control over the generation process of personalized models, offering precise controlled personalization for them and eliminating the need for controller retraining for per-model. Besides, to address the high time complexity of null-text inversion (NTI), we introduce ResInversion, a novel inversion method that performs noise rectification via a pre-diffusion mechanism, reducing the inversion time by 129 times. Experiments demonstrate that CustomEnhancer reach SOTA results at scene diversity, identity fidelity, training-free controls, while also showing the efficiency of our ResInversion over NTI. The code will be made publicly available upon paper acceptance.
[53] CompressAI-Vision: Open-source software to evaluate compression methods for computer vision tasks
Hyomin Choi,Heeji Han,Chris Rosewarne,Fabien Racapé
Main category: cs.CV
TL;DR: CompressAI-Vision是一个开源评估平台,用于比较压缩方法在计算机视觉任务中的表现,支持远程和分布式推理场景,结合标准编解码器,并在MPEG的FCM标准开发中被采用。
Details
Motivation: 随着基于神经网络的计算机视觉应用增多,需要一种针对下游视觉任务优化的压缩方法评估平台,以比较不同压缩技术对任务准确性的影响。Contribution: CompressAI-Vision提供了一个统一的评估平台,支持多任务、多数据集和不同推理场景的压缩方法比较,并被MPEG用于FCM标准开发。
Method: 平台整合了标准编解码器,评估压缩输入数据对视觉任务准确性的影响,重点考察比特率和任务准确率的权衡。
Result: 展示了平台在不同数据集和编解码器下的压缩增益效果,证明了其在评估压缩技术对视觉任务影响的实用性。
Insight: 压缩技术与任务优化的结合是未来的研究方向,开源平台为标准化和研究社区提供了宝贵工具。
Abstract: With the increasing use of neural network (NN)-based computer vision applications that process image and video data as input, interest has emerged in video compression technology optimized for computer vision tasks. In fact, given the variety of vision tasks, associated NN models and datasets, a consolidated platform is needed as a common ground to implement and evaluate compression methods optimized for downstream vision tasks. CompressAI-Vision is introduced as a comprehensive evaluation platform where new coding tools compete to efficiently compress the input of vision network while retaining task accuracy in the context of two different inference scenarios: “remote” and “split” inferencing. Our study showcases various use cases of the evaluation platform incorporated with standard codecs (under development) by examining the compression gain on several datasets in terms of bit-rate versus task accuracy. This evaluation platform has been developed as open-source software and is adopted by the Moving Pictures Experts Group (MPEG) for the development the Feature Coding for Machines (FCM) standard. The software is available publicly at https://github.com/InterDigitalInc/CompressAI-Vision.
[54] Dual-supervised Asymmetric Co-training for Semi-supervised Medical Domain Generalization
Jincai Song,Haipeng Chen,Jun Qin,Na Zhao
Main category: cs.CV
TL;DR: 本文提出了一种双监督非对称协同训练框架(DAC),用于解决跨域半监督域泛化(CD-SSDG)问题,通过特征级监督和非对称辅助任务提升了模型在未见过的测试域上的泛化性能。
Details
Motivation: 医学图像分割中,半监督域泛化(SSDG)旨在减少标注成本并应对域偏移问题,但传统方法假设每个源域的标记和未标记数据均可获取,而实际中常遇到标注有限且域偏移共存的情况。因此,本文研究了更实用和挑战性的CD-SSDG场景。Contribution: 1. 提出了CD-SSDG的新场景;2. 设计了DAC框架,融合特征级监督和非对称辅助任务;3. 在Fundus、Polyp和SCGM数据集上验证了框架的有效性。
Method: DAC框架基于协同训练范式,引入两个子模型提供交叉伪监督,并结合特征级监督和非对称辅助任务。特征级监督弥补了域偏移导致的伪标签不准确,而辅助任务增强了域不变特征的判别性。
Result: 在Fundus、Polyp和SCGM数据集上的实验表明,DAC框架在跨域半监督域泛化任务中表现优异,优于现有方法。
Insight: 1. 特征级监督和非对称辅助任务显著提升了域泛化能力;2. CD-SSDG是一种更贴近实际需求的场景;3. 模型设计需同时考虑伪标签质量和特征判别性。
Abstract: Semi-supervised domain generalization (SSDG) in medical image segmentation offers a promising solution for generalizing to unseen domains during testing, addressing domain shift challenges and minimizing annotation costs. However, conventional SSDG methods assume labeled and unlabeled data are available for each source domain in the training set, a condition that is not always met in practice. The coexistence of limited annotation and domain shift in the training set is a prevalent issue. Thus, this paper explores a more practical and challenging scenario, cross-domain semi-supervised domain generalization (CD-SSDG), where domain shifts occur between labeled and unlabeled training data, in addition to shifts between training and testing sets. Existing SSDG methods exhibit sub-optimal performance under such domain shifts because of inaccurate pseudolabels. To address this issue, we propose a novel dual-supervised asymmetric co-training (DAC) framework tailored for CD-SSDG. Building upon the co-training paradigm with two sub-models offering cross pseudo supervision, our DAC framework integrates extra feature-level supervision and asymmetric auxiliary tasks for each sub-model. This feature-level supervision serves to address inaccurate pseudo supervision caused by domain shifts between labeled and unlabeled data, utilizing complementary supervision from the rich feature space. Additionally, two distinct auxiliary self-supervised tasks are integrated into each sub-model to enhance domain-invariant discriminative feature learning and prevent model collapse. Extensive experiments on real-world medical image segmentation datasets, i.e., Fundus, Polyp, and SCGM, demonstrate the robust generalizability of the proposed DAC framework.
[55] DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation
Ved Umrajkar
Main category: cs.CV
TL;DR: DAC-LoRA提出了一种动态对抗课程框架,将对抗训练融入参数高效微调(PEFT),显著提升了对抗鲁棒性,同时保持了干净数据的准确性。
Details
Motivation: 现有的参数高效微调方法(如LoRA)虽然能高效适应特定任务,但其在面对对抗攻击时仍显脆弱,尤其是在安全关键应用中可能引发严重后果。Contribution: 提出DAC-LoRA框架,结合动态对抗课程和PEFT,显著提升了模型在对抗攻击下的鲁棒性,且轻量易用。
Method: 基于动态对抗课程,逐步增加攻击难度,结合FOSC条件和TRADES损失函数,优化对抗训练过程。
Result: 在保持干净数据准确性的同时,DAC-LoRA显著提升了对抗鲁棒性,并可无缝集成到标准PEFT流程中。
Insight: 动态对抗课程是一种通用方法,可应用于任何迭代攻击方法,为对抗训练提供了新思路。
Abstract: Vision-Language Models (VLMs) are foundational to critical applications like autonomous driving, medical diagnosis, and content moderation. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA enable their efficient adaptation to specialized tasks, these models remain vulnerable to adversarial attacks that can compromise safety-critical decisions. CLIP, the backbone for numerous downstream VLMs, is a high-value target whose vulnerabilities can cascade across the multimodal AI ecosystem. We propose Dynamic Adversarial Curriculum DAC-LoRA, a novel framework that integrates adversarial training into PEFT. The core principle of our method i.e. an intelligent curriculum of progressively challenging attack, is general and can potentially be applied to any iterative attack method. Guided by the First-Order Stationary Condition (FOSC) and a TRADES-inspired loss, DAC-LoRA achieves substantial improvements in adversarial robustness without significantly compromising clean accuracy. Our work presents an effective, lightweight, and broadly applicable method to demonstrate that the DAC-LoRA framework can be easily integrated into a standard PEFT pipeline to significantly enhance robustness.
[56] Federated Domain Generalization with Domain-specific Soft Prompts Generation
Jianhan Wu,Xiaoyang Qu,Zhangcheng Huang,Jianzong Wang
Main category: cs.CV
TL;DR: 论文提出了一种新方法FedDSPG,通过生成域特定软提示(DSPs)来解决联邦学习中的域泛化问题,优于现有基线方法。
Details
Motivation: 现有联邦域泛化(FDG)方法基于提示学习,但学习的提示多样性有限且忽略未知域信息。Contribution: 提出了FedDSPG方法,生成域特定软提示以解决FDG任务,提升了模型的泛化能力。
Method: 通过生成模型整合内容和域知识,生成DSPs用于未知域的推理任务。
Result: 在多个公开数据集上验证了方法的有效性,取得了最先进的结果。
Insight: 从生成视角处理FDG任务是一个新颖且有效的方向。
Abstract: Prompt learning has become an efficient paradigm for adapting CLIP to downstream tasks. Compared with traditional fine-tuning, prompt learning optimizes a few parameters yet yields highly competitive results, especially appealing in federated learning for computational efficiency. engendering domain shift among clients and posing a formidable challenge for downstream-task adaptation. Existing federated domain generalization (FDG) methods based on prompt learning typically learn soft prompts from training samples, replacing manually designed prompts to enhance the generalization ability of federated models. However, these learned prompts exhibit limited diversity and tend to ignore information from unknown domains. We propose a novel and effective method from a generative perspective for handling FDG tasks, namely federated domain generalization with domain-specific soft prompts generation (FedDSPG). Specifically, during training, we introduce domain-specific soft prompts (DSPs) for each domain and integrate content and domain knowledge into the generative model among clients. In the inference phase, the generator is utilized to obtain DSPs for unseen target domains, thus guiding downstream tasks in unknown domains. Comprehensive evaluations across several public datasets confirm that our method outperforms existing strong baselines in FDG, achieving state-of-the-art results.
[57] Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning
Thanh Binh Le,Hoang Nhat Khang Vo,Tan-Ha Mai,Trong Nhan Phan
Main category: cs.CV
TL;DR: 本文提出了一种名为LumbarCLIP的多模态框架,利用对比学习对齐腰椎MRI扫描与放射学描述,实现了高精度诊断。
Details
Motivation: 全球数百万人受腰痛困扰,需要结合医学影像与文本报告的诊断模型,以提升诊断的准确性和效率。Contribution: 1. 提出LumbarCLIP框架,结合视觉与文本编码器,实现跨模态对齐;2. 在分类任务中达到95.00%的准确率和94.75%的F1分数;3. 通过消融实验验证线性投影头优于非线性变体。
Method: 1. 使用ResNet-50、Vision Transformer和Swin Transformer作为视觉编码器;2. 结合BERT文本编码器提取特征;3. 使用线性或非线性投影头将特征映射到共享嵌入空间;4. 采用软CLIP损失进行对比训练。
Result: 模型在测试集上实现了95.00%的准确率和94.75%的F1分数,并在消融实验中证实线性投影头更有效。
Insight: LumbarCLIP为自动化肌肉骨骼诊断提供了有效工具,并验证了简单线性投影头在多模态对齐中的优越性。
Abstract: Low back pain affects millions worldwide, driving the need for robust diagnostic models that can jointly analyze complex medical images and accompanying text reports. We present LumbarCLIP, a novel multimodal framework that leverages contrastive language-image pretraining to align lumbar spine MRI scans with corresponding radiological descriptions. Built upon a curated dataset containing axial MRI views paired with expert-written reports, LumbarCLIP integrates vision encoders (ResNet-50, Vision Transformer, Swin Transformer) with a BERT-based text encoder to extract dense representations. These are projected into a shared embedding space via learnable projection heads, configurable as linear or non-linear, and normalized to facilitate stable contrastive training using a soft CLIP loss. Our model achieves state-of-the-art performance on downstream classification, reaching up to 95.00% accuracy and 94.75% F1-score on the test set, despite inherent class imbalance. Extensive ablation studies demonstrate that linear projection heads yield more effective cross-modal alignment than non-linear variants. LumbarCLIP offers a promising foundation for automated musculoskeletal diagnosis and clinical decision support.
[58] Poisoning Prompt-Guided Sampling in Video Large Language Models
Yuxin Cao,Wei Song,Jingling Xue,Jin Song Dong
Main category: cs.CV
TL;DR: 该论文提出了PoisonVID,首个针对视频大语言模型(VideoLLMs)中提示引导采样(prompt-guided sampling)的黑盒投毒攻击,通过闭环优化策略显著降低了有害帧的相关性评分,攻击成功率高达82%-99%。
Details
Motivation: VideoLLMs的性能依赖于帧采样策略,尤其是最新的提示引导采样方法。然而,其安全性尚未被充分研究,论文旨在填补这一空白,揭示其潜在漏洞。Contribution: 1. 提出首个针对VideoLLMs中提示引导采样的黑盒投毒攻击PoisonVID;2. 利用影子VideoLLM和轻量语言模型构建描述集,通过闭环优化生成通用扰动;3. 在多种采样策略和VideoLLMs上验证了攻击的高效性。
Method: 1. 使用影子VideoLLM和GPT-4o-mini生成有害描述的描述集;2. 通过闭环优化迭代生成通用扰动,抑制有害帧的相关性评分;3. 攻击目标VideoLLMs的提示引导采样机制。
Result: 在三种提示引导采样策略和三种先进VideoLLMs上,PoisonVID的攻击成功率为82%-99%,显著揭示了采样策略的安全隐患。
Insight: 研究强调了提示引导采样策略的安全性问题,为未来设计更鲁棒的VideoLLMs采样方法提供了重要启示。
Abstract: Video Large Language Models (VideoLLMs) have emerged as powerful tools for understanding videos, supporting tasks such as summarization, captioning, and question answering. Their performance has been driven by advances in frame sampling, progressing from uniform-based to semantic-similarity-based and, most recently, prompt-guided strategies. While vulnerabilities have been identified in earlier sampling strategies, the safety of prompt-guided sampling remains unexplored. We close this gap by presenting PoisonVID, the first black-box poisoning attack that undermines prompt-guided sampling in VideoLLMs. PoisonVID compromises the underlying prompt-guided sampling mechanism through a closed-loop optimization strategy that iteratively optimizes a universal perturbation to suppress harmful frame relevance scores, guided by a depiction set constructed from paraphrased harmful descriptions leveraging a shadow VideoLLM and a lightweight language model, i.e., GPT-4o-mini. Comprehensively evaluated on three prompt-guided sampling strategies and across three advanced VideoLLMs, PoisonVID achieves 82% - 99% attack success rate, highlighting the importance of developing future advanced sampling strategies for VideoLLMs.
[59] Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer
Abdur Rehman,S M A Sharif,Md Abdur Rahaman,Mohamed Jismy Aashik Rasool,Seongwan Kim,Jaeho Lee
Main category: cs.CV
TL;DR: 该论文提出了一种名为GoR的创新可学习正则化方法,用于动态平衡量化感知训练(QAT)和知识蒸馏(KD)的目标,显著提升了小模型在低比特量化下的性能。
Details
Motivation: 现有QAT-KD方法在低比特量化下难以平衡任务特定(TS)和蒸馏损失,导致性能下降。论文旨在解决这一挑战。Contribution: 提出了GoR方法和QAT-EKD-GoR框架,通过可学习的正则化动态调整损失权重,显著提升了小量化模型的性能,并在边缘设备上实现高效推理。
Method: 使用GoR动态平衡TS和KD目标,仅需两个可训练参数。QAT-EKD-GoR框架则引入多教师模型进行集成蒸馏。
Result: 实验显示GoR在图像分类、目标检测和LLM压缩中均优于现有方法,且在边缘设备上保持全精度模型的准确性。
Insight: 动态损失权重调整和集成蒸馏的结合为低比特量化模型提供了高效且稳健的解决方案,实际部署潜力大。
Abstract: Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under low-bit quantization. We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small quantized models (SQMs). Experiments on image classification, object detection (OD), and large language model (LLM) compression show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.
[60] Plant identification based on noisy web data: the amazing performance of deep learning (LifeCLEF 2017)
Herve Goeau,Pierre Bonnet,Alexis Joly
Main category: cs.CV
TL;DR: LifeCLEF 2017植物识别挑战赛评估了基于噪声网络数据训练的深度学习模型与传统专家标注数据的性能对比,结果显示前者具有竞争力。
Details
Motivation: 尽管国际植物学机构提供了大量植物图像,但许多物种仍缺乏高质量标注数据。网络上的噪声数据虽量大但标注质量低,研究旨在验证这类数据是否可用于大规模植物识别。Contribution: 提出了一个基于噪声网络数据的植物识别挑战赛,验证了深度学习模型在噪声数据上的表现,揭示了网络数据的潜力。
Method: 使用两种训练数据集(噪声网络数据和专家标注数据)训练深度学习模型,并通过Pl@ntNet应用的独立测试集评估性能。
Result: 研究表明,尽管网络数据存在噪声,深度学习模型仍表现出色,甚至在某些情况下与专家标注数据竞争。
Insight: 网络噪声数据可以是植物识别的重要资源,深度学习技术能够有效处理标注错误和大规模数据。
Abstract: The 2017-th edition of the LifeCLEF plant identification challenge is an important milestone towards automated plant identification systems working at the scale of continental floras with 10.000 plant species living mainly in Europe and North America illustrated by a total of 1.1M images. Nowadays, such ambitious systems are enabled thanks to the conjunction of the dazzling recent progress in image classification with deep learning and several outstanding international initiatives, such as the Encyclopedia of Life (EOL), aggregating the visual knowledge on plant species coming from the main national botany institutes. However, despite all these efforts the majority of the plant species still remain without pictures or are poorly illustrated. Outside the institutional channels, a much larger number of plant pictures are available and spread on the web through botanist blogs, plant lovers web-pages, image hosting websites and on-line plant retailers. The LifeCLEF 2017 plant challenge presented in this paper aimed at evaluating to what extent a large noisy training dataset collected through the web and containing a lot of labelling errors can compete with a smaller but trusted training dataset checked by experts. To fairly compare both training strategies, the test dataset was created from a third data source, i.e. the Pl@ntNet mobile application that collects millions of plant image queries all over the world. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
[61] TasselNetV4: A vision foundation model for cross-scene, cross-scale, and cross-species plant counting
Xiaonan Hu,Xuebing Li,Jinyu Xu,Abdulkadir Duran Adan,Letian Zhou,Xuhui Zhu,Yanan Li,Wei Guo,Shouyang Liu,Wenzhong Liu,Hao Lu
Main category: cs.CV
TL;DR: TasselNetV4是一种视觉基础模型,用于跨场景、跨尺度和跨物种的植物计数,通过结合局部计数思想与提取匹配范式,显著提升了计数的准确性和效率。
Details
Motivation: 植物具有多样性且动态变化,现有物种依赖的计数模型难以应对新物种,因此需要一种通用的计数方法。Contribution: 提出TasselNetV4,将局部计数思想与提取匹配范式结合,并通过多分支框感知局部计数器增强跨尺度鲁棒性。
Method: 基于视觉Transformer,结合多分支框感知局部计数器,构建跨物种计数模型。
Result: 在两个挑战性数据集上,TasselNetV4显著优于现有最先进的类无关计数模型。
Insight: 植物计数的核心问题应从‘计数什么’转向‘如何计数’,通用的跨物种模型更适用于农业场景。
Abstract: Accurate plant counting provides valuable information for agriculture such as crop yield prediction, plant density assessment, and phenotype quantification. Vision-based approaches are currently the mainstream solution. Prior art typically uses a detection or a regression model to count a specific plant. However, plants have biodiversity, and new cultivars are increasingly bred each year. It is almost impossible to exhaust and build all species-dependent counting models. Inspired by class-agnostic counting (CAC) in computer vision, we argue that it is time to rethink the problem formulation of plant counting, from what plants to count to how to count plants. In contrast to most daily objects with spatial and temporal invariance, plants are dynamic, changing with time and space. Their non-rigid structure often leads to worse performance than counting rigid instances like heads and cars such that current CAC and open-world detection models are suboptimal to count plants. In this work, we inherit the vein of the TasselNet plant counting model and introduce a new extension, TasselNetV4, shifting from species-specific counting to cross-species counting. TasselNetV4 marries the local counting idea of TasselNet with the extract-and-match paradigm in CAC. It builds upon a plain vision transformer and incorporates novel multi-branch box-aware local counters used to enhance cross-scale robustness. Two challenging datasets, PAC-105 and PAC-Somalia, are harvested. Extensive experiments against state-of-the-art CAC models show that TasselNetV4 achieves not only superior counting performance but also high efficiency.Our results indicate that TasselNetV4 emerges to be a vision foundation model for cross-scene, cross-scale, and cross-species plant counting.
[62] SD-RetinaNet: Topologically Constrained Semi-Supervised Retinal Lesion and Layer Segmentation in OCT
Botond Fazekas,Guilherme Aresta,Philipp Seeböck,Julia Mai,Ursula Schmidt-Erfurth,Hrvoje Bogunović
Main category: cs.CV
TL;DR: SD-RetinaNet是一种新型的半监督学习模型,通过引入完全可微的生物标志物拓扑引擎,解决现有方法在视网膜OCT图像分割中产生的解剖学不合理分割问题,实现了层和病变的双向交互学习,并在公开和内部数据集上表现出色。
Details
Motivation: 现有半监督学习方法在视网膜OCT图像分割中存在解剖学不合理分割、层-病变交互建模不足以及拓扑正确性缺乏保证的问题。Contribution: 1. 提出了一种完全可微的生物标志物拓扑引擎,确保解剖学正确的分割。2. 实现了层和病变的双向交互学习。3. 引入分离空间和风格因子的解耦表示方法。
Method: 模型采用半监督学习框架,结合可微拓扑引擎和解耦表示,利用未标记和部分标记数据进行训练,强制层和病变的解剖学一致性。
Result: 在OCT扫描的公开和内部数据集上,SD-RetinaNet在层和病变分割任务上均优于当前最优方法,并能推广到病理案例。
Insight: 解剖学约束在半监督学习中能显著提升分割的准确性和鲁棒性,同时增强分割结果的可信度。
Abstract: Optical coherence tomography (OCT) is widely used for diagnosing and monitoring retinal diseases, such as age-related macular degeneration (AMD). The segmentation of biomarkers such as layers and lesions is essential for patient diagnosis and follow-up. Recently, semi-supervised learning has shown promise in improving retinal segmentation performance. However, existing methods often produce anatomically implausible segmentations, fail to effectively model layer-lesion interactions, and lack guarantees on topological correctness. To address these limitations, we propose a novel semi-supervised model that introduces a fully differentiable biomarker topology engine to enforce anatomically correct segmentation of lesions and layers. This enables joint learning with bidirectional influence between layers and lesions, leveraging unlabeled and diverse partially labeled datasets. Our model learns a disentangled representation, separating spatial and style factors. This approach enables more realistic layer segmentations and improves lesion segmentation, while strictly enforcing lesion location in their anatomically plausible positions relative to the segmented layers. We evaluate the proposed model on public and internal datasets of OCT scans and show that it outperforms the current state-of-the-art in both lesion and layer segmentation, while demonstrating the ability to generalize layer segmentation to pathological cases using partially annotated training data. Our results demonstrate the potential of using anatomical constraints in semi-supervised learning for accurate, robust, and trustworthy retinal biomarker segmentation.
[63] Plant identification in an open-world (LifeCLEF 2016)
Herve Goeau,Pierre Bonnet,Alexis Joly
Main category: cs.CV
TL;DR: 该论文介绍了2016年LifeCLEF植物识别挑战赛,重点是开放集识别问题,即在未知类别存在的条件下实现鲁棒性的植物识别。
Details
Motivation: 研究动机在于通过大规模众包平台收集的数据,模拟真实世界的生物多样性监测场景,并解决开放集识别这一实际挑战。Contribution: 主要贡献包括:(1)提出并组织了一个基于众包数据的开放集植物识别挑战赛;(2)为开放集识别问题提供了基准和评估方法。
Method: 研究方法包括:(1)使用包含11万张图像和1000种植物的众包数据集;(2)将植物识别任务建模为开放集分类问题。
Result: 结果显示,参赛团队的系统在面对未知类别时表现差异显著,但提出的方法在开放集识别上取得了一定进展。
Insight: 该论文说明了开放集识别在实际应用中的重要性,并展示了众包数据在生物多样性监测中的潜力。
Abstract: The LifeCLEF plant identification challenge aims at evaluating plant identification methods and systems at a very large scale, close to the conditions of a real-world biodiversity monitoring scenario. The 2016-th edition was actually conducted on a set of more than 110K images illustrating 1000 plant species living in West Europe, built through a large-scale participatory sensing platform initiated in 2011 and which now involves tens of thousands of contributors. The main novelty over the previous years is that the identification task was evaluated as an open-set recognition problem, i.e. a problem in which the recognition system has to be robust to unknown and never seen categories. Beyond the brute-force classification across the known classes of the training set, the big challenge was thus to automatically reject the false positive classification hits that are caused by the unknown classes. This overview presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
[64] SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering
Yan Zhang,Jiaqing Lin,Miao Zhang,Kui Xiao,Xiaoju Hou,Yue Zhao,Zhifei Li
Main category: cs.CV
TL;DR: SCRA-VQA通过总结和重新排序图像描述,帮助大型语言模型更好地理解VQA任务,无需昂贵的端到端训练,显著提升了性能。
Details
Motivation: 传统KB-VQA方法中,大型语言模型(LLMs)依赖噪声较多的图像描述,且对VQA任务的理解有限,影响了推理能力。Contribution: 提出SCRA-VQA,结合预训练视觉语言模型生成描述,并通过总结和重新排序排除无关信息,显著提升了LLMs的表现。
Method: 使用预训练视觉语言模型生成图像描述,生成上下文示例并对描述进行总结和排序,从而提高LLMs的任务适应性。
Result: 在两个数据集OK-VQA和A-OKVQA上分别达到38.8%和34.6%的准确率,表现优异。
Insight: 通过优化图像描述的质量和相关性,可以显著提升LLMs在VQA任务中的性能,而无需额外训练。
Abstract: Acquiring high-quality knowledge is a central focus in Knowledge-Based Visual Question Answering (KB-VQA). Recent methods use large language models (LLMs) as knowledge engines for answering. These methods generally employ image captions as visual text descriptions to assist LLMs in interpreting images. However, the captions frequently include excessive noise irrelevant to the question, and LLMs generally do not comprehend VQA tasks, limiting their reasoning capabilities. To address this issue, we propose the Summarized Caption-Rerank Augmented VQA (SCRA-VQA), which employs a pre-trained visual language model to convert images into captions. Moreover, SCRA-VQA generates contextual examples for the captions while simultaneously summarizing and reordering them to exclude unrelated information. The caption-rerank process enables LLMs to understand the image information and questions better, thus enhancing the model’s reasoning ability and task adaptability without expensive end-to-end training. Based on an LLM with 6.7B parameters, SCRA-VQA performs excellently on two challenging knowledge-based VQA datasets: OK-VQA and A-OKVQA, achieving accuracies of 38.8% and 34.6%. Our code is available at https://github.com/HubuKG/SCRA-VQA.
[65] The Unanticipated Asymmetry Between Perceptual Optimization and Assessment
Jiabei Zhang,Qi Wang,Siyu Wu,Du Chen,Tianhe Wu
Main category: cs.CV
TL;DR: 该论文揭示了感知优化与评估之间的不对称性:在IQA表现优异的保真度指标不一定适用于感知优化,且鉴别器设计对优化效果起决定性作用。
Details
Motivation: 研究动机是探索感知优化中保真度和对抗性目标的有效性与它们作为图像质量评估(IQA)指标的能力之间的关系,填补现有研究的空白。Contribution: 主要贡献是系统分析了感知优化与评估之间的不对称性,发现保真度指标在IQA和优化中的作用不一致,并揭示了鉴别器设计对优化效果的至关重要影响。
Method: 论文通过系统实验和分析,比较了不同目标函数(保真度和对抗性)在优化和IQA中的作用,并研究了鉴别器架构(如基于补丁和卷积的设计)对优化效果的影响。
Result: 研究发现保真度指标在IQA中的表现与优化效果不一致,尤其是在对抗训练下更明显;同时,鉴别器设计对优化效果至关重要。
Insight: 论文的关键见解是感知优化与评估之间存在不对称性,且鉴别器的设计(如基于补丁或卷积的结构)在优化中表现更优,为未来优化方法提供了理论指导。
Abstract: Perceptual optimization is primarily driven by the fidelity objective, which enforces both semantic consistency and overall visual realism, while the adversarial objective provides complementary refinement by enhancing perceptual sharpness and fine-grained detail. Despite their central role, the correlation between their effectiveness as optimization objectives and their capability as image quality assessment (IQA) metrics remains underexplored. In this work, we conduct a systematic analysis and reveal an unanticipated asymmetry between perceptual optimization and assessment: fidelity metrics that excel in IQA are not necessarily effective for perceptual optimization, with this misalignment emerging more distinctly under adversarial training. In addition, while discriminators effectively suppress artifacts during optimization, their learned representations offer only limited benefits when reused as backbone initializations for IQA models. Beyond this asymmetry, our findings further demonstrate that discriminator design plays a decisive role in shaping optimization, with patch-level and convolutional architectures providing more faithful detail reconstruction than vanilla or Transformer-based alternatives. These insights advance the understanding of loss function design and its connection to IQA transferability, paving the way for more principled approaches to perceptual optimization.
[66] Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering
Zhifei Li,Feng Qiu,Yiran Wang,Yujing Xia,Kui Xiao,Miao Zhang,Yan Zhang
Main category: cs.CV
TL;DR: 该论文提出了IOG-VQA模型,结合对象交互自注意力与基于GAN的去偏方法,用于提升视觉问答任务的性能,强调处理对象交互和数据偏置的重要性。
Details
Motivation: 现有的VQA模型容易受到训练数据中的偏置影响,导致过度依赖表面模式,难以泛化到多样化的问题和图像。因此,需要一种方法同时关注对象间的复杂交互和数据偏置问题。Contribution: 1. 提出了IOG-VQA模型,结合对象交互自注意力和基于GAN的去偏框架。2. 通过自注意力机制捕捉对象间的复杂交互。3. 利用GAN生成无偏数据分布,提升模型泛化能力。
Method: 1. 引入对象交互自注意力机制,增强对图像中对象关系的理解。2. 设计基于GAN的去偏框架,生成平衡的数据分布以减少模型对偏置的依赖。
Result: 在VQA-CP v1和VQA-CP v2数据集上的实验表明,IOG-VQA性能显著优于现有方法,尤其在处理偏置和不平衡数据时表现突出。
Insight: 1. 对象交互和数据集偏置是影响VQA模型性能的关键因素。2. 结合自注意力和GAN的去偏方法可以有效提升模型的鲁棒性和泛化能力。
Abstract: Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at https://github.com/HubuKG/IOG-VQA.
[67] Nuclear Diffusion Models for Low-Rank Background Suppression in Videos
Tristan S. W. Stevens,Oisín Nolan,Jean-Luc Robert,Ruud J. G. van Sloun
Main category: cs.CV
TL;DR: 该论文提出了一种结合低秩时间建模与扩散后验采样的混合框架Nuclear Diffusion,用于视频背景抑制,并应用于心脏超声去雾任务,效果优于传统RPCA方法。
Details
Motivation: 视频中常存在结构化噪声和背景伪影,传统基于稀疏假设的方法难以捕捉真实视频数据的丰富变化性。为解决这一问题,作者提出了结合低秩建模和扩散模型的方法。Contribution: 主要贡献是提出Nuclear Diffusion框架,将低秩时间建模与扩散后验采样结合,实现了更好的视频背景抑制与信号保留效果。
Method: 方法核心是混合框架:1)低秩建模捕捉背景结构;2)扩散后验采样生成动态内容的高保真先验。在心脏超声去雾任务上验证。
Result: 实验表明,相比于传统RPCA,Nuclear Diffusion在对比度增强(gCNR)和信号保留(KS统计量)方面表现更优。
Insight: 论文揭示了结合模型驱动方法与生成式先验(如扩散模型)在高保真视频恢复任务中的潜力。
Abstract: Video sequences often contain structured noise and background artifacts that obscure dynamic content, posing challenges for accurate analysis and restoration. Robust principal component methods address this by decomposing data into low-rank and sparse components. Still, the sparsity assumption often fails to capture the rich variability present in real video data. To overcome this limitation, a hybrid framework that integrates low-rank temporal modeling with diffusion posterior sampling is proposed. The proposed method, Nuclear Diffusion, is evaluated on a real-world medical imaging problem, namely cardiac ultrasound dehazing, and demonstrates improved dehazing performance compared to traditional RPCA concerning contrast enhancement (gCNR) and signal preservation (KS statistic). These results highlight the potential of combining model-based temporal models with deep generative priors for high-fidelity video restoration.
[68] Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification
Patrick Knab,Sascha Marton,Philipp J. Schubert,Drago Guggiana,Christian Bartelt
Main category: cs.CV
TL;DR: 该论文提出了一种名为MoTIF的框架,将概念瓶颈模型扩展到视频分类,通过显式建模全局概念重要性、局部概念关联性和时间依赖性,提升了视频分类的可解释性。
Details
Motivation: 现有的概念瓶颈模型主要针对静态图像分类,而视频数据由于时间依赖性,其可解释性建模面临挑战。该研究旨在将概念瓶颈模型扩展至视频分类,以更好地理解和解释视频中的动作和事件。Contribution: 1. 提出了MoTIF框架,首次将概念瓶颈模型应用于视频分类。2. 通过全局、局部和时间三个视角显式建模概念的相关性。3. 在保持性能竞争力的同时,提升了模型的可解释性。
Method: 1. 基于Transformer架构设计MoTIF,支持任意长度的视频序列。2. 通过概念(如物体、属性或动作)的动态建模,捕捉视频中的时间依赖性。3. 从全局、局部和时间三个尺度分析概念的重要性。
Result: 实验表明,MoTIF能够有效将概念瓶颈范式迁移到视频数据,在保持分类性能的同时,显著提升了模型的可解释性。
Insight: 视频中的概念不仅是静态的语义实体,还具有时间动态性,显式建模这些动态性是提升视频分类可解释性的关键。
Abstract: Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., ‘bow’, ‘mount’, ‘shoot’) that reoccur across time - forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance. Code available at github.com/patrick-knab/MoTIF.
[69] FSMODNet: A Closer Look at Few-Shot Detection in Multispectral Data
Manuel Nkegoum,Minh-Tan Pham,Élisa Fromont,Bruno Avignon,Sébastien Lefèvre
Main category: cs.CV
TL;DR: 本文提出了一种名为FSMODNet的框架,用于解决少样本多光谱目标检测问题,通过跨模态特征集成和可变形注意力机制,在可见光和热成像数据上提升检测性能。
Details
Motivation: 在可见光和热成像等多光谱数据中,目标检测任务面临标注数据稀缺的问题,尤其是在复杂光照和环境条件下。本文旨在通过跨模态信息融合,解决少样本场景下的检测挑战。Contribution: 1. 提出了FSMODNet框架,专注于少样本多光谱目标检测任务。2. 引入可变形注意力机制,有效结合可见光和热成像数据的优势。3. 在公开数据集上验证了方法在低数据场景下的鲁棒性。
Method: 1. 使用跨模态特征集成,结合可见光和热成像信息。2. 采用可变形注意力机制动态调整特征聚合,适应复杂环境。
Result: 在两个公开数据集上的实验表明,FSMODNet在低数据场景下优于多个基线方法,特别是在复杂光照和环境条件下表现突出。
Insight: 跨模态特征集成和可变形注意力机制是多光谱目标检测中少样本学习的有效策略,能够显著提升模型对复杂条件的适应性。
Abstract: Few-shot multispectral object detection (FSMOD) addresses the challenge of detecting objects across visible and thermal modalities with minimal annotated data. In this paper, we explore this complex task and introduce a framework named “FSMODNet” that leverages cross-modality feature integration to improve detection performance even with limited labels. By effectively combining the unique strengths of visible and thermal imagery using deformable attention, the proposed method demonstrates robust adaptability in complex illumination and environmental conditions. Experimental results on two public datasets show effective object detection performance in challenging low-data regimes, outperforming several baselines we established from state-of-the-art models. All code, models, and experimental data splits can be found at https://anonymous.4open.science/r/Test-B48D.
[70] Finding 3D Positions of Distant Objects from Noisy Camera Movement and Semantic Segmentation Sequences
Julius Pesonen,Arno Solin,Eija Honkavaara
Main category: cs.CV
TL;DR: 论文提出了一种基于粒子滤波器的方法,通过噪声相机运动和语义分割序列实现远距离物体的3D定位,适用于资源受限或远景场景。
Details
Motivation: 在资源受限或目标距离较远的场景中,密集深度估计或3D场景重建方法不可行,因此需要一个轻量且灵活的解决方案。Contribution: 提出了一种基于粒子滤波器的3D物体定位方法,适用于单目标和多目标场景,且独立于检测方法,灵活性高。
Method: 使用粒子滤波器结合相机位姿和图像分割序列进行3D定位,通过3D模拟和无人机实验验证。
Result: 实验表明,该方法在远距离或资源受限的场景中优于传统方法,并能与现有的图像分割模型结合用于实际任务(如野火监控)。
Insight: 粒子滤波器为复杂或受限场景中的3D定位提供了轻量且通用的解决方案,尤其适合无人机监测等应用。
Abstract: 3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with dense depth estimation or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved using particle filters for both single and multiple target scenarios. The method was studied using a 3D simulation and a drone-based image segmentation sequence with global navigation satellite system (GNSS)-based camera pose estimates. The results showed that a particle filter can be used to solve practical localisation tasks based on camera poses and image segments in these situations where other solutions fail. The particle filter is independent of the detection method, making it flexible for new tasks. The study also demonstrates that drone-based wildfire monitoring can be conducted using the proposed method paired with a pre-existing image segmentation model.
[71] SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images
Qinfeng Zhu,Han Li,Liang He,Lei Fan
Main category: cs.CV
TL;DR: 提出了一种名为SwinMamba的新型框架,结合了局部和全局特征感知,提升了遥感图像的语义分割性能。
Details
Motivation: 遥感图像的语义分割面临高分辨率、复杂场景和多尺度物体的挑战,而现有的Vision Mamba方法虽高效但忽视了局部特征。Contribution: 提出SwinMamba框架,通过局部-全局结合的Mamba扫描方式,增强局部特征捕捉能力,同时保留全局上下文信息。
Method: 结合Swin Transformer的思想,前两个阶段采用局部扫描捕获细节,后两个阶段全局扫描融合上下文;使用重叠移位窗口提升区域间信息交换。
Result: 在LoveDA和ISPRS Potsdam数据集上实验表明,SwinMamba优于现有方法。
Insight: 局部-全局特征的结合是解决遥感图像分割复杂性的关键,重叠移位窗口设计提升了特征融合效果。
Abstract: Semantic segmentation of remote sensing imagery is a fundamental task in computer vision, supporting a wide range of applications such as land use classification, urban planning, and environmental monitoring. However, this task is often challenged by the high spatial resolution, complex scene structures, and diverse object scales present in remote sensing data. To address these challenges, various deep learning architectures have been proposed, including convolutional neural networks, Vision Transformers, and the recently introduced Vision Mamba. Vision Mamba features a global receptive field and low computational complexity, demonstrating both efficiency and effectiveness in image segmentation. However, its reliance on global scanning tends to overlook critical local features, such as textures and edges, which are essential for achieving accurate segmentation in remote sensing contexts. To tackle this limitation, we propose SwinMamba, a novel framework inspired by the Swin Transformer. SwinMamba integrates localized Mamba-style scanning within shifted windows with a global receptive field, to enhance the model’s perception of both local and global features. Specifically, the first two stages of SwinMamba perform local scanning to capture fine-grained details, while its subsequent two stages leverage global scanning to fuse broader contextual information. In our model, the use of overlapping shifted windows enhances inter-region information exchange, facilitating more robust feature integration across the entire image. Extensive experiments on the LoveDA and ISPRS Potsdam datasets demonstrate that SwinMamba outperforms state-of-the-art methods, underscoring its effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery.
[72] Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Framework
Wenhao Tang,Heng Fang,Ge Wu,Xiang Li,Ming-Ming Cheng
Main category: cs.CV
TL;DR: 提出了一种基于包的多实例学习框架(PackMIL),用于解决计算病理学(CPath)中数据异质性和冗余的问题,通过打包序列和残差分支等方法显著提高了训练效率和准确性。
Details
Motivation: 计算病理学中的全切片图像(WSIs)具有极长的序列长度和显著的异质性,传统方法难以高效处理这些问题。Contribution: 1. 提出PackMIL框架,打包变长序列为固定长度以实现批量训练;2. 引入残差分支和多切片监督;3. 提出注意力驱动的下采样器以减少冗余。
Method: 1. 通过打包变长序列为固定长度实现批量训练;2. 使用残差分支组合丢弃特征为“超切片”;3. 注意力下采样器压缩特征以减少冗余。
Result: 在PANDA(UNI)数据集上,准确性提升了8%,训练时间仅需12%。
Insight: 针对数据挑战的优化在计算病理学中具有显著潜力,尤其是在基础模型时代。
Abstract: Computational pathology (CPath) digitizes pathology slides into whole slide images (WSIs), enabling analysis for critical healthcare tasks such as cancer diagnosis and prognosis. However, WSIs possess extremely long sequence lengths (up to 200K), significant length variations (from 200 to 200K), and limited supervision. These extreme variations in sequence length lead to high data heterogeneity and redundancy. Conventional methods often compromise on training efficiency and optimization to preserve such heterogeneity under limited supervision. To comprehensively address these challenges, we propose a pack-based MIL framework. It packs multiple sampled, variable-length feature sequences into fixed-length ones, enabling batched training while preserving data heterogeneity. Moreover, we introduce a residual branch that composes discarded features from multiple slides into a hyperslide which is trained with tailored labels. It offers multi-slide supervision while mitigating feature loss from sampling. Meanwhile, an attention-driven downsampler is introduced to compress features in both branches to reduce redundancy. By alleviating these challenges, our approach achieves an accuracy improvement of up to 8% while using only 12% of the training time in the PANDA(UNI). Extensive experiments demonstrate that focusing data challenges in CPath holds significant potential in the era of foundation models. The code is https://github.com/FangHeng/PackMIL
[73] SimDiff: Simulator-constrained Diffusion Model for Physically Plausible Motion Generation
Akihisa Watanabe,Jiawei Ren,Li Siyao,Yichen Peng,Erwin Wu,Edgar Simo-Serra
Main category: cs.CV
TL;DR: SimDiff是一种基于扩散模型的物理运动生成方法,通过将模拟器约束直接整合到去噪过程中,避免了推理时的重复模拟调用,提升了效率并支持细粒度环境参数控制。
Details
Motivation: 现有方法通过模拟器层实现物理合理性,但计算成本高且无法并行化。SimDiff旨在高效生成物理合理运动,避免推理时的模拟器调用。Contribution: 1. 将模拟器约束解释为扩散过程中的引导形式;2. 提出SimDiff,直接整合环境参数到去噪过程;3. 展示了组合泛化能力。
Method: SimDiff将模拟器约束建模为扩散过程中的引导(基于分类器或无分类器),通过环境参数条件化生成物理合理运动。
Result: SimDiff高效生成物理合理运动,支持多环境参数控制,并泛化到未见过的参数组合。
Insight: 模拟器约束可通过扩散模型的条件化实现,避免推理时的计算瓶颈,同时保留物理合理性。
Abstract: Generating physically plausible human motion is crucial for applications such as character animation and virtual reality. Existing approaches often incorporate a simulator-based motion projection layer to the diffusion process to enforce physical plausibility. However, such methods are computationally expensive due to the sequential nature of the simulator, which prevents parallelization. We show that simulator-based motion projection can be interpreted as a form of guidance, either classifier-based or classifier-free, within the diffusion process. Building on this insight, we propose SimDiff, a Simulator-constrained Diffusion Model that integrates environment parameters (e.g., gravity, wind) directly into the denoising process. By conditioning on these parameters, SimDiff generates physically plausible motions efficiently, without repeated simulator calls at inference, and also provides fine-grained control over different physical coefficients. Moreover, SimDiff successfully generalizes to unseen combinations of environmental parameters, demonstrating compositional generalization.
[74] Unlocking Noise-Resistant Vision: Key Architectural Secrets for Robust Models
Bum Jun Kim,Makoto Kawano,Yusuke Iwasawa,Yutaka Matsuo
Main category: cs.CV
TL;DR: 该论文通过分析1174个预训练视觉模型,揭示了对高斯噪声更具鲁棒性的四种架构设计模式:更大的stem核、更小的输入分辨率、平均池化以及监督式ViT而非CLIP ViT,并通过理论验证了这些发现的因果机制。
Details
Motivation: 尽管视觉模型的鲁棒性常被衡量,但其对特定架构设计选择的依赖很少被深入研究。论文旨在探究哪些架构设计能自然提升模型对高斯噪声的鲁棒性,并将经验发现转化为可操作的设计规则。Contribution: 1. 识别了四种显著提升模型对高斯噪声鲁棒性的架构设计模式。2. 通过理论分析验证了这些经验的因果机制。3. 提出了实用的、即插即用的设计指南,帮助构建更鲁棒的视觉模型。
Method: 1. 对1174个预训练视觉模型进行实验评估,识别鲁棒性设计模式。2. 针对发现的设计模式(如低通stem核、抗混叠下采样、池化方式等)进行数学建模和理论分析。3. 通过Lipschitz边界等工具解释CLIP ViT的脆弱性。
Result: 实验表明,提出的设计模式可实现最多506的排名提升和21.6%的准确率提升。理论分析明确了这些现象背后的因果机制,如低通stem核的噪声衰减效应与核尺寸平方相关。
Insight: 1. 鲁棒性可通过明确的架构设计模块化实现。2. 池化方式的选择(平均池化优于最大池化)对噪声抑制效果显著。3. CLIP预处理中的标准化缩小了输入范围,放大了模型的敏感性。
Abstract: While the robustness of vision models is often measured, their dependence on specific architectural design choices is rarely dissected. We investigate why certain vision architectures are inherently more robust to additive Gaussian noise and convert these empirical insights into simple, actionable design rules. Specifically, we performed extensive evaluations on 1,174 pretrained vision models, empirically identifying four consistent design patterns for improved robustness against Gaussian noise: larger stem kernels, smaller input resolutions, average pooling, and supervised vision transformers (ViTs) rather than CLIP ViTs, which yield up to 506 rank improvements and 21.6%p accuracy gains. We then develop a theoretical analysis that explains these findings, converting observed correlations into causal mechanisms. First, we prove that low-pass stem kernels attenuate noise with a gain that decreases quadratically with kernel size and that anti-aliased downsampling reduces noise energy roughly in proportion to the square of the downsampling factor. Second, we demonstrate that average pooling is unbiased and suppresses noise in proportion to the pooling window area, whereas max pooling incurs a positive bias that grows slowly with window size and yields a relatively higher mean-squared error and greater worst-case sensitivity. Third, we reveal and explain the vulnerability of CLIP ViTs via a pixel-space Lipschitz bound: The smaller normalization standard deviations used in CLIP preprocessing amplify worst-case sensitivity by up to 1.91 times relative to the Inception-style preprocessing common in supervised ViTs. Our results collectively disentangle robustness into interpretable modules, provide a theory that explains the observed trends, and build practical, plug-and-play guidelines for designing vision models more robust against Gaussian noise.
[75] Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery
Angelo Henriques,Korab Hoxha,Daniel Zapp,Peter C. Issa,Nassir Navab,M. Ali Nasseri
Main category: cs.CV
TL;DR: 这篇论文对手术场景中的场景图(SG)研究进行了系统性回顾,揭示了快速发展的趋势和关键的‘数据鸿沟’,同时展示了SG技术在手术分析和生成任务中的应用及其未来发展方向。
Details
Motivation: 手术环境的复杂性和动态性需要结构化关系表示,场景图(SG)为此提供了有效的工具,但目前研究中出现了‘数据鸿沟’,影响了研究的可推广性。Contribution: 系统性地总结了SG在手术中的应用和方法学进展,揭示了‘数据鸿沟’问题,并提出了SG作为手术智能系统关键技术的重要性。
Method: 采用PRISMA-ScR引导的范围综述方法,分析了SG在手术中的应用和方法学发展,尤其是从基础图神经网络到专业化基础模型的演进。
Result: 研究发现SG技术在手术工作流识别、自动安全监控及可控手术模拟等任务中表现优异,已成为核心技术。
Insight: SG不仅有助于手术场景的分析,还能支持生成任务,但在数据标注和实时实现方面仍需进一步解决。SG正在成为提升手术安全性和效率的关键语义桥梁。
Abstract: Scene graphs (SGs) provide structured relational representations crucial for decoding complex, dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery, charting its applications, methodological advancements, and future directions. Our analysis reveals rapid growth, yet uncovers a critical ‘data divide’: internal-view research (e.g., triplet recognition) almost exclusively uses real-world 2D video, while external-view 4D modeling relies heavily on simulated data, exposing a key translational research gap. Methodologically, the field has advanced from foundational graph neural networks to specialized foundation models that now significantly outperform generalist large vision-language models in surgical contexts. This progress has established SGs as a cornerstone technology for both analysis, such as workflow recognition and automated safety monitoring, and generative tasks like controllable surgical simulation. Although challenges in data annotation and real-time implementation persist, they are actively being addressed through emerging techniques. Surgical SGs are maturing into an essential semantic bridge, enabling a new generation of intelligent systems to improve surgical safety, efficiency, and training.
[76] A Real-Time On-Device Defect Detection Framework for Laser Power-Meter Sensors via Unsupervised Learning
Dongqi Zheng,Wenjin Fu,Guangzong Chen
Main category: cs.CV
TL;DR: 该论文提出了一种基于无监督学习的实时设备端激光功率传感器缺陷检测框架,通过合成数据增强和多尺度特征提取实现高效缺陷检测。
Details
Motivation: 解决激光功率传感器涂层缺陷(如热损伤和划痕)识别问题,这些缺陷会影响激光能量测量精度,尤其在医疗和工业应用中。Contribution: 1. 提出无监督异常检测框架,仅需‘正常’样本训练;2. 结合Laplacian边缘检测、K-means聚类和StyleGAN2合成数据;3. 基于UFlow的多尺度特征提取网络。
Method: 1. 预处理管道(Laplacian边缘检测和K-means聚类分割目标区域);2. StyleGAN2合成数据增强;3. UFlow网络架构生成异常图。
Result: 在366张真实传感器图像上测试,缺陷样本准确率93.8%,正常样本89.3%,图像级和像素级AUROC分别为0.957和0.961。
Insight: 无监督方法无需大量标注缺陷数据,适合实际工业场景;设备端实现(0.5秒/图像)展示了实时应用的潜力。
Abstract: We present an automated vision-based system for defect detection and classification of laser power meter sensor coatings. Our approach addresses the critical challenge of identifying coating defects such as thermal damage and scratches that can compromise laser energy measurement accuracy in medical and industrial applications. The system employs an unsupervised anomaly detection framework that trains exclusively on ``good’’ sensor images to learn normal coating distribution patterns, enabling detection of both known and novel defect types without requiring extensive labeled defect datasets. Our methodology consists of three key components: (1) a robust preprocessing pipeline using Laplacian edge detection and K-means clustering to segment the area of interest, (2) synthetic data augmentation via StyleGAN2, and (3) a UFlow-based neural network architecture for multi-scale feature extraction and anomaly map generation. Experimental evaluation on 366 real sensor images demonstrates $93.8%$ accuracy on defective samples and $89.3%$ accuracy on good samples, with image-level AUROC of 0.957 and pixel-level AUROC of 0.961. The system provides potential annual cost savings through automated quality control and processing times of 0.5 seconds per image in on-device implementation.
[77] Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos
Sarmistha Das,R E Zera Marveen Lyngkhoi,Sriparna Saha,Alka Maurya
Main category: cs.CV
TL;DR: FASTER是一个多模态金融咨询视频摘要框架,通过提取多模态特征、生成优化摘要及对齐视觉关键帧与文本,显著提升了金融内容的理解和可操作性。
Details
Motivation: 金融咨询视频时长较长且多模态(视觉与文本),从中提取关键信息具有挑战性。现有方法在模态对齐和摘要质量上存在不足。Contribution: 1. 提出FASTER框架,解决模态特征提取、摘要优化和视觉-文本对齐问题;2. 引入Fin-APT数据集,填补公开金融视频数据空白;3. DPO改进的损失函数确保摘要的精确性和事实一致性。
Method: 结合BLIP(视觉语义描述)、OCR(文本模式提取)和Whisper(转录与说话人分离),采用改进的DPO损失函数和基于排序的关键帧检索机制。
Result: FASTER在多模态摘要任务中表现优于主流LLMs和VLMs,展现出强鲁棒性和泛化能力。
Insight: 跨模态对齐和数据稀缺是金融视频摘要的核心挑战,FASTER通过模块化设计和数据集贡献推动该领域研究。
Abstract: The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER’s strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER
[78] SiNGER: A Clearer Voice Distills Vision Transformers Further
Geunhyeok Yu,Sunjae Jeong,Yoonyoung Choi,Jaeseung Kim,Hyoseok Hwang
Main category: cs.CV
TL;DR: 论文提出了一种名为SiNGER的新蒸馏框架,通过零空间引导的能量重分配技术,既能抑制教师模型中的高范数伪影,又能保留信息信号,从而提升了学生模型的性能。
Details
Motivation: 视觉Transformer作为视觉基础模型的骨干已被广泛采用,但其产生的高范数伪影会降低表示质量。传统知识蒸馏方法会让学生模型过度拟合这些伪影,而低估信息信号,导致模型性能下降。Contribution: 提出SiNGER框架,通过零空间引导的特征精炼技术,在抑制伪影的同时保留信息信号,显著提升了学生模型的性能。
Method: 利用零空间引导的扰动对教师特征进行精炼,并使用LoRA适配器高效实现扰动,最小化模型结构修改。
Result: 实验表明,SiNGER在多下游任务中提升了学生模型的性能,达到了当前最佳水平,并生成了更清晰和可解释的表示。
Insight: 通过零空间扰动分离伪影和信息信号,为知识蒸馏提供了一种更有效的方法,解决了传统方法中的权衡问题。
Abstract: Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher’s features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.
[79] Fast-SEnSeI: Lightweight Sensor-Independent Cloud Masking for On-board Multispectral Sensors
Jan Kněžík,Jonáš Herec,Rado Pitoňák
Main category: cs.CV
TL;DR: Fast-SEnSeI是一种轻量级、与传感器无关的编码模块,支持多光谱传感器的灵活在轨云分割,适用于不同波段配置。
Details
Motivation: 云分割是地球观测任务的关键预处理步骤,但现有模型通常依赖于特定传感器配置且需地面处理。需要一种灵活高效的解决方案。Contribution: 提出Fast-SEnSeI模块,改进了光谱描述符和轻量架构,支持任意波段组合输入,并结合CPU-FPGA混合部署方案。
Method: 基于改进的SEnSeI-v2,整合光谱描述符和轻量架构,使用量化的U-Net分割模型,并通过Apache TVM和FPGA高效实现。
Result: 在Sentinel-2和Landsat 8数据集上验证了多样化输入配置下的准确分割性能。
Insight: Fast-SEnSeI展示了在轨处理的高效性和灵活性,为多光谱传感器的通用云分割提供了实用解决方案。
Abstract: Cloud segmentation is a critical preprocessing step for many Earth observation tasks, yet most models are tightly coupled to specific sensor configurations and rely on ground-based processing. In this work, we propose Fast-SEnSeI, a lightweight, sensor-independent encoder module that enables flexible, on-board cloud segmentation across multispectral sensors with varying band configurations. Building upon SEnSeI-v2, Fast-SEnSeI integrates an improved spectral descriptor, lightweight architecture, and robust padding-band handling. It accepts arbitrary combinations of spectral bands and their wavelengths, producing fixed-size feature maps that feed into a compact, quantized segmentation model based on a modified U-Net. The module runs efficiently on embedded CPUs using Apache TVM, while the segmentation model is deployed on FPGA, forming a CPU-FPGA hybrid pipeline suitable for space-qualified hardware. Evaluations on Sentinel-2 and Landsat 8 datasets demonstrate accurate segmentation across diverse input configurations.
[80] Vision Transformers: the threat of realistic adversarial patches
Kasper Cools,Clara Maathuis,Alexander M. van Oers,Claudia S. Hübner,Nikos Deligiannis,Marijke Vandewal,Geert De Cubber
Main category: cs.CV
TL;DR: 该论文研究了针对Vision Transformers(ViTs)的现实对抗性补丁攻击,特别是通过Creases Transformation(CT)技术设计逼真的对抗性补丁,揭示了ViT模型在分类任务中的脆弱性。
Details
Motivation: 随着机器视觉系统(尤其是ViTs)的广泛应用,其安全性成为关键问题。论文旨在探讨ViTs在面对针对性的对抗性攻击时的脆弱性,尤其是与CNNs相比的鲁棒性差异。Contribution: 1)提出了一种通过Creases Transformation(CT)技术生成逼真对抗性补丁的方法;2)验证了对抗性攻击技术在ViTs和CNNs之间的转移性;3)揭示了ViT模型在面对对抗性攻击时的显著脆弱性。
Method: 使用Creases Transformation(CT)技术生成逼真的对抗性补丁,并在四种不同的ViT模型上进行实验,评估攻击的成功率。
Result: 实验结果揭示了显著的分化:攻击成功率从40.04%到99.97%不等,表明ViTs在面对对抗性攻击时的鲁棒性受预训练数据集规模和方法的影响。
Insight: 论文表明,尽管ViTs在某些任务中表现优于CNNs,但在面对针对性攻击时仍然脆弱。预训练的规模和方法对模型的鲁棒性有重要影响。
Abstract: The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.
[81] UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep Decomposition
Guojun Lei,Rong Zhang,Chi Wang,Tianhang Liu,Hong Li,Zhiyuan Ma,Weiwei Xu
Main category: cs.CV
TL;DR: UniTransfer提出了一种新颖的视频概念迁移架构,通过空间和时间步分解实现精确可控的视频编辑,效果优于现有基线。
Details
Motivation: 现有的视频概念迁移方法在精确控制和编辑性上存在不足,UniTransfer旨在通过空间和时间步分解解决这些问题。Contribution: 1)提出空间分解方法(前景主体、背景、运动流);2)设计双流到单流的DiT架构;3)引入自监督预训练策略;4)提出Chain-of-Prompt机制和时间步分解方法;5)构建OpenAnimal数据集。
Method: 1)空间分解视频为前景、背景和运动流;2)采用DiT架构实现细粒度控制;3)通过随机掩码自监督预训练;4)Chain-of-Prompt机制分阶段指导去噪生成。
Result: 实验表明UniTransfer在视觉保真度和编辑性上超越现有基线,支持多样化的参考图像和场景。
Insight: 空间和时间步分解结合LLM指导,为视频概念迁移提供了更可控的生成方式。
Abstract: We propose a novel architecture UniTransfer, which introduces both spatial and diffusion timestep decomposition in a progressive paradigm, achieving precise and controllable video concept transfer. Specifically, in terms of spatial decomposition, we decouple videos into three key components: the foreground subject, the background, and the motion flow. Building upon this decomposed formulation, we further introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos. We also introduce a self-supervised pretraining strategy based on random masking to enhance the decomposed representation learning from large-scale unlabeled video data. Inspired by the Chain-of-Thought reasoning paradigm, we further revisit the denoising diffusion process and propose a Chain-of-Prompt (CoP) mechanism to achieve the timestep decomposition. We decompose the denoising process into three stages of different granularity and leverage large language models (LLMs) for stage-specific instructions to guide the generation progressively. We also curate an animal-centric video dataset called OpenAnimal to facilitate the advancement and benchmarking of research in video concept transfer. Extensive experiments demonstrate that our method achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in both visual fidelity and editability. Web Page: https://yu-shaonian.github.io/UniTransfer-Web/
[82] VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
Ziang Yan,Xinhao Li,Yinan He,Zhengrong Yue,Xiangyu Zeng,Yali Wang,Yu Qiao,Limin Wang,Yi Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为视觉测试时间缩放(VTTS)的新方法,通过推理过程中的迭代感知增强多模态大语言模型(MLLM)的推理能力。VTTS模拟人类的分层注意力机制,逐步优化高置信度的时空区域,并利用VTTS-80K数据集支持这一范式。实验显示,该方法在多个任务和基准上取得了显著改进。
Details
Motivation: 目前的多模态大语言模型主要依赖静态感知阶段,限制了推理能力的提升。本文旨在通过动态的迭代感知机制,模仿人类的注意力机制,增强模型的多模态推理能力。Contribution: 1. 提出了视觉测试时间缩放(VTTS)方法,通过迭代感知增强推理能力;2. 引入了VTTS-80K数据集以支持迭代感知;3. 在多个任务和基准上实现了显著性能提升。
Method: VTTS通过迭代感知(ITP)机制,结合强化学习和时空监督,逐步优化模型对高置信度区域的注意力。VTTS-80K数据集用于训练和验证这一方法。
Result: 实验结果表明,VTTS在超过15个基准任务中平均提高了5%以上的性能,显著优于Qwen2.5VL-3B和-7B等基线模型。
Insight: 动态的迭代感知机制可以显著提升多模态大语言模型的推理能力,验证了人类注意力机制的模拟在多模态任务中的有效性。
Abstract: Inducing reasoning in multimodal large language models (MLLMs) is critical for achieving human-level perception and understanding. Existing methods mainly leverage LLM reasoning to analyze parsed visuals, often limited by static perception stages. This paper introduces Visual Test-Time Scaling (VTTS), a novel approach to enhance MLLMs’ reasoning via iterative perception during inference. VTTS mimics humans’ hierarchical attention by progressively refining focus on high-confidence spatio-temporal regions, guided by updated textual predictions. Specifically, VTTS employs an Iterative Perception (ITP) mechanism, incorporating reinforcement learning with spatio-temporal supervision to optimize reasoning. To support this paradigm, we also present VTTS-80K, a dataset tailored for iterative perception. These designs allows a MLLM to enhance its performance by increasing its perceptual compute. Extensive experiments validate VTTS’s effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1.5 model has achieved remarkable improvements, with an average increase of over 5%, compared to robust baselines such as Qwen2.5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.
[83] Mammo-CLIP Dissect: A Framework for Analysing Mammography Concepts in Vision-Language Models
Suaiba Amina Salahuddin,Teresa Dorszewski,Marit Almenning Martiniussen,Tone Hovda,Antonio Portaluri,Solveig Thrun,Michael Kampffmeyer,Elisabeth Wetzer,Kristoffer Wickstrøm,Robert Jenssen
Main category: cs.CV
TL;DR: Mammo-CLIP Dissect是一个基于概念的乳腺X线影像模型解释框架,利用乳腺专用的视觉语言模型(Mammo-CLIP)分析神经元学习到的临床相关概念,揭示了领域专属训练和任务适应对概念掌握的影响。
Details
Motivation: 理解深度学习模型在医学影像中学到了什么对AI在临床中的安全部署至关重要。以往研究多关注像素层面的解释,而本文侧重模型学习的文本概念,更贴近临床医生的推理过程。Contribution: 提出了首个针对乳腺X线影像的概念解释框架Mammo-CLIP Dissect,通过量化神经元与领域知识的对齐,揭示了模型在专用数据训练和任务微调中的概念学习差异。
Method: 利用乳腺专用的Mammo-CLIP作为“解剖器”,标注神经元层的人类可解释概念,并分析训练数据、微调策略对概念覆盖的影响。
Result: 专用数据训练的模型学到更多临床相关概念,且更贴近放射科医生的工作流程;任务微调会增强某些概念的捕捉(如良性钙化),但可能牺牲其他特征(如密度相关)。
Insight: 领域专属训练和任务微调在概念学习中存在权衡,揭示了CNN如何捕捉乳腺X线影像知识的机制,为模型解释提供了新视角。
Abstract: Understanding what deep learning (DL) models learn is essential for the safe deployment of artificial intelligence (AI) in clinical settings. While previous work has focused on pixel-based explainability methods, less attention has been paid to the textual concepts learned by these models, which may better reflect the reasoning used by clinicians. We introduce Mammo-CLIP Dissect, the first concept-based explainability framework for systematically dissecting DL vision models trained for mammography. Leveraging a mammography-specific vision-language model (Mammo-CLIP) as a “dissector,” our approach labels neurons at specified layers with human-interpretable textual concepts and quantifies their alignment to domain knowledge. Using Mammo-CLIP Dissect, we investigate three key questions: (1) how concept learning differs between DL vision models trained on general image datasets versus mammography-specific datasets; (2) how fine-tuning for downstream mammography tasks affects concept specialisation; and (3) which mammography-relevant concepts remain underrepresented. We show that models trained on mammography data capture more clinically relevant concepts and align more closely with radiologists’ workflows than models not trained on mammography data. Fine-tuning for task-specific classification enhances the capture of certain concept categories (e.g., benign calcifications) but can reduce coverage of others (e.g., density-related features), indicating a trade-off between specialisation and generalisation. Our findings show that Mammo-CLIP Dissect provides insights into how convolutional neural networks (CNNs) capture mammography-specific knowledge. By comparing models across training data and fine-tuning regimes, we reveal how domain-specific training and task-specific adaptation shape concept learning. Code and concept set are available: https://github.com/Suaiba/Mammo-CLIP-Dissect.
[84] MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
Sicheng Tao,Jungang Li,Yibo Yan,Junyan Zhang,Yubo Gao,Hanqian Li,ShuHang Xun,Yuxuan Fan,Hong Chen,Jianxiang He,Xuming Hu
Main category: cs.CV
TL;DR: MOSS-ChatV是一个基于强化学习的框架,通过动态时间规整(DTW)的过程奖励来解决视频推理中的过程不一致问题,显著提升了多模态大语言模型(MLLMs)在视频时序推理中的表现。
Details
Motivation: 现有MLLMs在视频推理中容易出现过程不一致的问题,即中间推理与视频动态脱节,尽管最终答案可能正确。这降低了模型的解释性和鲁棒性。Contribution: 1.提出了一种基于DTW的过程奖励机制;2.构建了MOSS-Video基准测试集,包含标注的推理轨迹;3.验证了框架在不同架构中的广泛适用性。
Method: 通过强化学习框架结合DTW奖励,对推理过程进行监督,无需额外的奖励模型。关键创新是动态状态预测作为视频推理的度量标准。
Result: 在MOSS-Video测试集上达到87.2%的准确率,同时在MVBench和MMVU等通用视频基准测试中表现提升。GPT-4o评测显示推理轨迹更加一致和稳定。
Insight: 过程奖励机制可以有效提升视频推理的连贯性和鲁棒性,且该方法对不同模型架构具有普适性,为视频时序推理任务提供了一种新思路。
Abstract: Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.
[85] MotionFlow:Learning Implicit Motion Flow for Complex Camera Trajectory Control in Video Generation
Guojun Lei,Chi Wang,Yikai Wang,Hong Li,Ying Song,Weiwei Xu
Main category: cs.CV
TL;DR: 该论文提出了一种名为MotionFlow的方法,通过将相机和物体运动转换为像素运动,利用稳定扩散网络学习参考运动图,从而在视频生成中实现复杂相机轨迹的控制。
Details
Motivation: 现有方法在视频生成中通常将相机和物体运动分开学习,这容易导致两者相对运动的混淆。为了解决这一问题,论文提出了一种集成相机和物体运动的新方法。Contribution: 主要贡献是提出了一种能够同时处理相机和物体运动的方法,通过运动图的隐式学习和语义物体先验,实现了复杂相机轨迹控制的视频生成。
Method: 方法包括:1) 将相机和物体运动转换为像素运动;2) 使用稳定扩散网络学习参考运动图;3) 结合语义物体先验输入图像到视频网络生成最终视频。
Result: 实验表明,该方法在视频生成任务中显著优于现有最先进方法。
Insight: 论文的亮点在于将相机和物体运动的联合建模,通过运动图的隐式学习实现了复杂相机轨迹的控制,同时保持了物体运动的一致性。
Abstract: Generating videos guided by camera trajectories poses significant challenges in achieving consistency and generalizability, particularly when both camera and object motions are present. Existing approaches often attempt to learn these motions separately, which may lead to confusion regarding the relative motion between the camera and the objects. To address this challenge, we propose a novel approach that integrates both camera and object motions by converting them into the motion of corresponding pixels. Utilizing a stable diffusion network, we effectively learn reference motion maps in relation to the specified camera trajectory. These maps, along with an extracted semantic object prior, are then fed into an image-to-video network to generate the desired video that can accurately follow the designated camera trajectory while maintaining consistent object motions. Extensive experiments verify that our model outperforms SOTA methods by a large margin.
[86] The Unwinnable Arms Race of AI Image Detection
Till Aczel,Lorenzo Vettor,Andreas Plesner,Roger Wattenhofer
Main category: cs.CV
TL;DR: 论文探讨了AI图像生成与检测之间的‘军备竞赛’,表明数据维度和复杂性是影响检测器效果的关键因素。
Details
Motivation: 随着图像生成AI的快速发展,合成图像与真实图像的界限变得模糊,引发了生成器与检测器之间的竞争。Contribution: 论文分析了数据维度和复杂性对检测器效果的影响,指出中等复杂性数据集最有利于检测。
Method: 使用Kolmogorov复杂度衡量数据集固有结构,分析不同复杂度下检测器的表现。
Result: 研究发现,极简或极复杂的数据集会降低合成图像的检测能力,而中等复杂度数据集最有利于暴露生成器的缺陷。
Insight: 揭示了检测器表现的非单调性,为未来设计更鲁棒的检测方法提供了方向。
Abstract: The rapid progress of image generative AI has blurred the boundary between synthetic and real images, fueling an arms race between generators and discriminators. This paper investigates the conditions under which discriminators are most disadvantaged in this competition. We analyze two key factors: data dimensionality and data complexity. While increased dimensionality often strengthens the discriminators ability to detect subtle inconsistencies, complexity introduces a more nuanced effect. Using Kolmogorov complexity as a measure of intrinsic dataset structure, we show that both very simple and highly complex datasets reduce the detectability of synthetic images; generators can learn simple datasets almost perfectly, whereas extreme diversity masks imperfections. In contrast, intermediate-complexity datasets create the most favorable conditions for detection, as generators fail to fully capture the distribution and their errors remain visible.
[87] Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization’s Impact on CLIP Beyond Accuracy
Aymen Bouguerra,Daniel Montoya,Alexandra Gomez-Villa,Fabio Arnez,Chokri Mraidha
Main category: cs.CV
TL;DR: 这篇论文系统地评估了量化对CLIP模型在可靠性指标上的影响,揭示了一些反直觉的结果,并提出了一些量化感知训练方法,可以在不牺牲性能的情况下提升模型的零样本准确性、校准性和鲁棒性。
Details
Motivation: 尽管CLIP等视觉语言模型在零样本泛化能力上表现出色,但其量化对可靠性指标的影响尚未被充分研究。论文旨在填补这一空白,探索量化在高效可靠部署中的作用。Contribution: 论文的主要贡献包括:1) 量化对CLIP模型校准性和鲁棒性的系统性评估;2) 揭示了量化对不同预训练模型的校准性有不同影响;3) 提出了量化感知训练方法,可同时提升零样本准确性、校准性和OOD检测能力。
Method: 论文采用大规模评估方法,分析了量化对CLIP模型的多项可靠性指标(如校准性和OOD检测)的影响,并提出了特定的量化感知训练方法。
Result: 结果显示,量化可以显著改善低自信预训练模型的校准性,但对高自信模型可能有害。尽管如此,量化仍能提升OOD检测能力。量化感知训练方法可同时提升多项目标性能。
Insight: 量化不仅是效率提升的工具,还能在某些情况下改善模型的可靠性和鲁棒性。不同类型的预训练模型对量化的反应不同,选择合适的量化方法至关重要。
Abstract: The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP’s performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.
[88] TABLET: A Large-Scale Dataset for Robust Visual Table Understanding
Iñigo Alonso,Imanol Miranda,Eneko Agirre,Mirella Lapata
Main category: cs.CV
TL;DR: TABLET是一个大规模视觉表格理解数据集,包含400万个示例和200万个独特表格,保留了88%的原始可视化。它提供图像-HTML对、元数据和溯源信息,支持多种任务,提升模型在真实场景下的鲁棒性。
Details
Motivation: 当前表格理解的数据集多基于合成渲染,缺乏真实世界表格的复杂性和多样性。TABLET旨在填补这一空白,提供一个包含原始可视化和多样化任务的大规模数据集。Contribution: TABLET的主要贡献是:(1)包含400万个示例和200万个独特表格的大规模数据集;(2)保留88%的原始可视化;(3)提供图像-HTML对和溯源信息;(4)支持20种任务,提升模型鲁棒性和泛化能力。
Method: 数据集构建包括:从真实来源收集表格,保留原始可视化,生成图像-HTML对,添加元数据和溯源信息。训练方法是在数据集上微调视觉-语言模型(如Qwen2.5-VL-7B)。
Result: 实验表明,在TABLET上微调的模型在已知和未知的视觉表格理解任务中表现更好,同时在真实世界表格可视化中表现出更高的鲁棒性。
Insight: TABLET的成功表明:(1)原始可视化对模型训练至关重要;(2)大规模多样化数据集能显著提升模型的泛化能力;(3)数据集的溯源信息有助于未来研究的可扩展性。
Abstract: While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.
[89] Learning Conformal Explainers for Image Classifiers
Amr Alkhatib,Stephanie Lowry
Main category: cs.CV
TL;DR: 该论文提出了一种基于共形预测的方法,用于生成高保真度的图像分类器解释,并通过四种一致性函数量化解释与模型预测的匹配程度。FastSHAP在实验中表现最佳。
Details
Motivation: 现有的特征归因方法在解释图像分类模型的预测时,常因鲁棒性和忠实性问题而受限。作者希望通过一种可控制保真度的共形预测方法解决这些问题。Contribution: 1. 提出了一种新颖的共形预测方法,用于生成可控制保真度的解释。2. 设计了四种一致性函数量化解释的忠实性。3. 实验证明FastSHAP在保真度和信息效率上优于其他方法。
Method: 方法通过共形预测识别一组足以保持模型预测的显著性特征,无需真实解释标定。四种一致性函数用于量化解释的匹配程度。
Result: 实验在五个解释器和六个图像数据集上验证了方法的有效性,FastSHAP在保真度和解释区域大小上表现最优,超级像素一致性度量优于像素级方法。
Insight: 1. 共形预测方法可以有效控制解释的保真度。2. 超级像素的一致性度量更适合图像解释任务。
Abstract: Feature attribution methods are widely used for explaining image-based predictions, as they provide feature-level insights that can be intuitively visualized. However, such explanations often vary in their robustness and may fail to faithfully reflect the reasoning of the underlying black-box model. To address these limitations, we propose a novel conformal prediction-based approach that enables users to directly control the fidelity of the generated explanations. The method identifies a subset of salient features that is sufficient to preserve the model’s prediction, regardless of the information carried by the excluded features, and without demanding access to ground-truth explanations for calibration. Four conformity functions are proposed to quantify the extent to which explanations conform to the model’s predictions. The approach is empirically evaluated using five explainers across six image datasets. The empirical results demonstrate that FastSHAP consistently outperforms the competing methods in terms of both fidelity and informational efficiency, the latter measured by the size of the explanation regions. Furthermore, the results reveal that conformity measures based on super-pixels are more effective than their pixel-wise counterparts.
[90] SlideMamba: Entropy-Based Adaptive Fusion of GNN and Mamba for Enhanced Representation Learning in Digital Pathology
Shakib Khan,Fariba Dambandkhameneh,Nazim Shaikh,Yao Nie,Raghavan Venugopal,Xiao Li
Main category: cs.CV
TL;DR: SlideMamba是一个结合了Mamba架构和图神经网络(GNN)的深度学习框架,通过熵基自适应融合策略提升数字病理学中的表征学习能力,在预测基因融合和突变状态任务中表现优异。
Details
Motivation: 数字病理学中的全玻片图像(WSIs)分析需要同时捕捉局部空间关系和长程上下文依赖。现有方法如MIL或GNN在处理这种复杂任务时存在局限性,需要一种更灵活的架构来结合两者的优势。Contribution: 1) 提出SlideMamba框架,结合Mamba(长程依赖)和GNN(局部空间关系);2) 设计熵基自适应融合策略,动态平衡两者的贡献;3) 在基因融合预测任务中显著优于基线方法。
Method: 1) Mamba模块捕捉全局长程依赖;2) GNN模块处理局部空间交互;3) 熵基权重机制动态调整两者融合比例,优先选择预测置信度更高的分支。
Result: SlideMamba在PRAUC指标上达到0.751±0.05,优于MIL、Trans-MIL等基线方法,且在ROC AUC、敏感性和特异性指标上表现竞争性。
Insight: 熵基融合策略为多模态或多分支模型提供了动态权重分配的通用思路,尤其是在需要同时处理局部和全局信息的任务中。
Abstract: Advances in computational pathology increasingly rely on extracting meaningful representations from Whole Slide Images (WSIs) to support various clinical and biological tasks. In this study, we propose a generalizable deep learning framework that integrates the Mamba architecture with Graph Neural Networks (GNNs) for enhanced WSI analysis. Our method is designed to capture both local spatial relationships and long-range contextual dependencies, offering a flexible architecture for digital pathology analysis. Mamba modules excels in capturing long-range global dependencies, while GNNs emphasize fine-grained short-range spatial interactions. To effectively combine these complementary signals, we introduce an adaptive fusion strategy that uses an entropy-based confidence weighting mechanism. This approach dynamically balances contributions from both branches by assigning higher weight to the branch with more confident (lower-entropy) predictions, depending on the contextual importance of local versus global information for different downstream tasks. We demonstrate the utility of our approach on a representative task: predicting gene fusion and mutation status from WSIs. Our framework, SlideMamba, achieves an area under the precision recall curve (PRAUC) of 0.751 \pm 0.05, outperforming MIL (0.491 \pm 0.042), Trans-MIL (0.39 \pm 0.017), Mamba-only (0.664 \pm 0.063), GNN-only (0.748 \pm 0.091), and a prior similar work GAT-Mamba (0.703 \pm 0.075). SlideMamba also achieves competitive results across ROC AUC (0.738 \pm 0.055), sensitivity (0.662 \pm 0.083), and specificity (0.725 \pm 0.094). These results highlight the strength of the integrated architecture, enhanced by the proposed entropy-based adaptive fusion strategy, and suggest promising potential for application of spatially-resolved predictive modeling tasks in computational pathology.
[91] Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets
Team Hunyuan3D,:,Bowen Zhang,Chunchao Guo,Haolin Liu,Hongyu Yan,Huiwen Shi,Jingwei Huang,Junlin Yu,Kunhong Li,Linus,Penghao Wang,Qingxiang Lin,Sicong Liu,Xianghui Yang,Yixuan Tang,Yunfei Zhao,Zeqiang Lai,Zhihao Liang,Zibo Zhao
Main category: cs.CV
TL;DR: Hunyuan3D-Omni是一个统一框架,通过多模态输入(如点云、体素、骨骼姿态等)实现精细可控的3D资产生成,提升了生成精度和生产工作流的鲁棒性。
Details
Motivation: 现有3D生成模型主要依赖图像或文本输入,缺乏细粒度的跨模态控制,限制了可控性和实际应用。作者旨在填补这一空白。Contribution: 提出Hunyuan3D-Omni,一个基于Hunyuan3D 2.1的统一框架,支持多种输入信号(如点云、骨骼姿态)的细粒度控制,实现了几何、拓扑和姿态的精确生成。
Method: 采用单一跨模态架构统一处理所有输入信号,结合渐进式、难度感知的采样策略,动态调整对不同模态的权重,提升多模态融合的鲁棒性。
Result: 实验表明,该框架提高了生成精度,支持几何感知的变换,并增强了生产工作流的鲁棒性。
Insight: 通过统一架构和难度感知采样策略,能够有效处理复杂的多模态输入,并为实际应用提供更灵活的控制方式。
Abstract: Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.
[92] Learning to Look: Cognitive Attention Alignment with Vision-Language Models
Ryan L. Yang,Dipkamal Bhusal,Nidhi Rastogi
Main category: cs.CV
TL;DR: 该论文提出了一种利用视觉-语言模型自动生成语义注意力图的方法,通过引入辅助损失函数对齐CNN的注意力与语言引导的注意力图,从而提升模型的可靠性和认知合理性,同时在实验中表现优异。
Details
Motivation: 现有的CNN模型常通过利用表面相关性“作弊”预测,依赖专家标注的注意力监督方法无法扩展。受认知科学启发,作者提出了一种无需人工标注的方法。Contribution: 提出了一种可扩展的框架,利用视觉-语言模型自动生成语义注意力图,并通过注意力对齐损失提升模型的认知合理性和泛化能力。
Method: 利用自然语言提示生成语义注意力图,设计辅助损失函数对齐CNN注意力与语言引导的注意力图。
Result: 在ColoredMNIST和DecoyMNIST数据集上表现优异,减少了模型对捷径学习的依赖,注意力更符合人类直觉。
Insight: 视觉-语言模型可以自动化注意力监督任务,减少对人工标注的依赖,同时提升模型的解释性和泛化能力。
Abstract: Convolutional Neural Networks (CNNs) frequently “cheat” by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.
[93] Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations
Zhijian Yang,Noel DSouza,Istvan Megyeri,Xiaojian Xu,Amin Honarmandi Shandiz,Farzin Haddadpour,Krisztian Koos,Laszlo Rusko,Emanuele Valeriano,Bharadwaj Swaninathan,Lei Wu,Parminder Bhatia,Taha Kass-Hout,Erhan Bas
Main category: cs.CV
TL;DR: Decipher-MR 是一个专为 3D MRI 设计的视觉-语言基础模型,通过结合自监督视觉学习和报告文本监督,构建了鲁棒且通用的表征,支持多种临床任务。
Details
Motivation: MRI 的复杂性和异质性给自动化分析带来挑战,而现有的基础模型在 MRI 中的应用仍有限。Decipher-MR 旨在填补这一空白。Contribution: 提出了首个专为 3D MRI 设计的大规模视觉-语言基础模型 Decipher-MR,支持轻量级任务解码器的模块化设计。
Method: 结合自监督视觉学习和报告文本监督训练模型,并采用冻结预训练编码器加轻量任务解码器的结构。
Result: 在疾病分类、人口统计预测、解剖定位和跨模态检索等多个任务中表现优于现有基础模型和任务专用方法。
Insight: Decipher-MR 为 MRI 的 AI 应用提供了可扩展且通用的基础,降低了计算开销,推动了临床和研究领域的发展。
Abstract: Magnetic Resonance Imaging (MRI) is a critical medical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity pose challenges for automated analysis, particularly in scalable and generalizable machine learning applications. While foundation models have revolutionized natural language and vision tasks, their application to MRI remains limited due to data scarcity and narrow anatomical focus. In this work, we present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on a large-scale dataset comprising 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust, generalizable representations, enabling effective adaptation across broad applications. To enable robust and diverse clinical tasks with minimal computational overhead, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across diverse benchmarks including disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent performance gains over existing foundation models and task-specific approaches. Our results establish Decipher-MR as a scalable and versatile foundation for MRI-based AI, facilitating efficient development across clinical and research domains.
[94] Instruction-tuned Self-Questioning Framework for Multimodal Reasoning
You-Won Jang,Yu-Jung Heo,Jaeseok Kim,Minsu Lee,Du-Seong Chang,Byoung-Tak Zhang
Main category: cs.CV
TL;DR: 本文提出了SQ-InstructBLIP框架,通过自问自答的多步推理提升视觉语言理解任务中的推理能力。
Details
Motivation: 当前的大型语言模型(LLMs)在多步推理任务中面临两大问题:无法利用图像的细粒度视觉内容,以及其黑箱特性导致难以复现和优化。本文旨在解决这些问题。Contribution: 提出了SQ-InstructBLIP框架,通过一个共享架构的Questioner、Answerer和Reasoner模块,迭代生成图像感知的子问题与子答案,以提升多模态推理的准确性。
Method: SQ-InstructBLIP包含三个组件:Questioner生成子问题,Answerer生成子答案,Reasoner结合这些信息完成主问题的推理。这种方法通过迭代生成子问题优化整体推理。
Result: 实验表明,SQ-InstructBLIP在VQA任务中利用生成的子问题信息,比现有方法实现了更准确的多步推理。
Insight: 通过模块化设计结合自问自答的迭代过程,可以有效利用视觉信息并提升推理的可解释性。这为多模态任务的推理提供了一种新的思路。
Abstract: The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.
[95] Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation
Seyed Amir Kasaei,Mohammad Hossein Rohban
Main category: cs.CV
TL;DR: 该论文提出了一种新的T2I评估视角,将幻觉定义为偏差驱动的生成内容,并提出了三种幻觉分类(属性、关系和对象),为T2I模型的评估提供了更丰富的框架。
Details
Motivation: 现有T2I评估主要关注对齐性(prompt与生成内容的匹配),忽略了模型未基于输入生成的偏差内容(即幻觉)。研究者希望填补这一空白,并提出幻觉评估的新方法。Contribution: 1. 定义了T2I中的幻觉为偏差驱动的生成;2. 提出了幻觉的三分类(属性、关系、对象);3. 将幻觉作为评估的上界,揭示了模型的潜在偏差。
Method: 提出了一种新颖的幻觉分类方法,并通过实验分析了T2I模型在这些类别下的表现。
Result: 研究表明,幻觉为评估T2I模型提供了额外的维度,并揭示了模型设计中潜在的偏差问题。
Insight: 幻觉不仅是评估T2I模型的新视角,还能帮助识别和改进模型的偏差问题,从而提升生成质量。
Abstract: In language and vision-language models, hallucination is broadly understood as content generated from a model’s prior knowledge or biases rather than from the given input. While this phenomenon has been studied in those domains, it has not been clearly framed for text-to-image (T2I) generative models. Existing evaluations mainly focus on alignment, checking whether prompt-specified elements appear, but overlook what the model generates beyond the prompt. We argue for defining hallucination in T2I as bias-driven deviations and propose a taxonomy with three categories: attribute, relation, and object hallucinations. This framing introduces an upper bound for evaluation and surfaces hidden biases, providing a foundation for richer assessment of T2I models.
[96] Dense Semantic Matching with VGGT Prior
Songlin Yang,Tianyi Wei,Yushi Lan,Zeqi Xiao,Anyi Rao,Xingang Pan
Main category: cs.CV
TL;DR: 该论文提出了一种基于VGGT先验的密集语义匹配方法,解决了现有方法在几何歧义和最近邻规则上的局限性,通过改进VGGT模型以适应语义匹配任务,显著提升了性能。
Details
Motivation: 现有语义匹配方法存在几何歧义和最近邻规则的局限性,需要几何感知的像素描述符和全面的密集匹配机制。VGGT提供了几何基础特征和密集匹配能力,但直接迁移应用于跨实例语义匹配存在问题,因此需要改进以适应语义匹配任务和数据稀缺的情境。Contribution: 1. 提出了改进VGGT的方法,保留其早期特征阶段并微调后期阶段,添加语义头以实现双向对应;2. 设计了循环一致训练策略、合成数据增强和渐进训练方法,以解决数据稀缺问题;3. 实验表明该方法在几何感知、匹配可靠性和流形保持方面优于现有基线。
Method: 1. 重用VGGT的早期特征阶段,微调后期阶段,并添加语义头;2. 采用循环一致训练策略、合成数据增强和渐进训练方法;3. 引入混叠伪影缓解技术。
Result: 实验表明该方法在几何感知、匹配可靠性和流形保持方面优于现有基线,性能显著提升。
Insight: 几何基础模型(如VGGT)的特征迁移和适应是关键;数据稀缺可通过合成数据和渐进训练缓解;循环一致训练有助于跨实例匹配。
Abstract: Semantic matching aims to establish pixel-level correspondences between instances of the same category and represents a fundamental task in computer vision. Existing approaches suffer from two limitations: (i) Geometric Ambiguity: Their reliance on 2D foundation model features (e.g., Stable Diffusion, DINO) often fails to disambiguate symmetric structures, requiring extra fine-tuning yet lacking generalization; (ii) Nearest-Neighbor Rule: Their pixel-wise matching ignores cross-image invisibility and neglects manifold preservation. These challenges call for geometry-aware pixel descriptors and holistic dense correspondence mechanisms. Inspired by recent advances in 3D geometric foundation models, we turn to VGGT, which provides geometry-grounded features and holistic dense matching capabilities well aligned with these needs. However, directly transferring VGGT is challenging, as it was originally designed for geometry matching within cross views of a single instance, misaligned with cross-instance semantic matching, and further hindered by the scarcity of dense semantic annotations. To address this, we propose an approach that (i) retains VGGT’s intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences; and (ii) adapts VGGT to the semantic matching scenario under data scarcity through cycle-consistent training strategy, synthetic data augmentation, and progressive training recipe with aliasing artifact mitigation. Extensive experiments demonstrate that our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.
[97] MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation
Xinyu Liu,Guolei Sun,Cheng Wang,Yixuan Yuan,Ender Konukoglu
Main category: cs.CV
TL;DR: MedVSR提出了一种专为医学视频超分辨率设计的框架,通过Cross State-Space Propagation(CSSP)解决对齐问题,并利用Inner State-Space Reconstruction(ISSR)模块增强组织结构和减少伪影,显著优于现有方法。
Details
Motivation: 医学高分辨率视频对诊断至关重要,但由于硬件限制和生理约束难以获取。现有VSR模型在处理低分辨率医学视频时面临对齐困难和伪影问题,影响诊断准确性。Contribution: 1. 提出MedVSR框架,针对医学视频设计;2. 引入CSSP模块,通过状态空间模型实现精确对齐;3. 设计ISSR模块,增强组织结构和减少伪影;4. 在多个医学数据集中验证了其优越性。
Method: 1. 使用Cross State-Space Propagation(CSSP)将远帧投影为状态空间控制矩阵,选择性地传递一致特征以优化对齐;2. 设计Inner State-Space Reconstruction(ISSR)模块,结合长程空间特征学习和短程大核信息聚合。
Result: 在多个医学场景(如内窥镜和白内障手术数据集)中,MedVSR在重建性能和效率上显著优于现有VSR模型。
Insight: 针对医学视频的特点(如组织连续性)设计模型,能有效解决对齐和伪影问题,提升诊断辅助工具的实用性。
Abstract: High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.
[98] MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Sicong Leng,Jing Wang,Jiaxi Li,Hao Zhang,Zhiqiang Hu,Boqiang Zhang,Yuming Jiang,Hang Zhang,Xin Li,Lidong Bing,Deli Zhao,Wei Lu,Yu Rong,Aixin Sun,Shijian Lu
Main category: cs.CV
TL;DR: 论文通过提出Variance-Aware Sampling(VAS)策略和开源高质量长链思维数据集与RL QA对,增强多模态推理模型的稳定性与性能。
Details
Motivation: 当前多模态推理模型的进展受限于缺乏高质量开放的长链思维数据及RL训练中的不稳定问题(如梯度消失)。Contribution: 1) 提出VAS策略,通过Variance Promotion Score提升奖励方差;2) 发布1.6M长链思维数据和15k RL QA对;3) 开源多尺度多模态推理模型家族。
Method: VAS结合结果方差和轨迹多样性选择数据,GRPO框架结合VPS优化策略。
Result: 实验表明VAS和开源数据有效提升数学推理任务性能,理论证明奖励方差下界策略梯度幅度。
Insight: 奖励方差的关键性及数据多样性对RL训练的重要性。
Abstract: Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.
[99] A Sentinel-3 foundation model for ocean colour
Geoffrey Dawson,Remy Vandaele,Andrew Taylor,David Moffat,Helen Tamura-Wicks,Sarah Jackson,Rosie Lickorish,Paolo Fraccaro,Hywel Williams,Chunbo Luo,Anne Jones
Main category: cs.CV
TL;DR: 该论文提出了一种基于Prithvi-EO Vision Transformer架构的海洋色彩基础模型,通过自监督预训练在Sentinel-3 OLCI数据上进行重构,并在下游任务(如叶绿素浓度估计和海洋初级生产力估算)中验证了其性能。
Details
Motivation: 由于海洋科学中标记数据稀缺且成本高昂,预训练的基础模型(FMs)可以利用大量未标记数据提升海洋监测任务的性能。Contribution: 1) 提出了一种针对海洋色彩的Sentinel-3基础模型;2) 在叶绿素浓度和海洋初级生产力估算任务中验证了模型的优越性。
Method: 采用Prithvi-EO Vision Transformer架构进行自监督预训练,通过重构Sentinel-3 OLCI数据学习特征表示,并在下游任务中进行微调。
Result: 模型在少量高质量标记数据下表现优异,能够捕捉海洋色彩的详细空间模式,并与点观测结果匹配。
Insight: 该研究展示了基础模型在海洋科学中的潜力,能够为海洋生态系统和全球气候过程提供更稳健的数据驱动分析。
Abstract: Artificial Intelligence (AI) Foundation models (FMs), pre-trained on massive unlabelled datasets, have the potential to drastically change AI applications in ocean science, where labelled data are often sparse and expensive to collect. In this work, we describe a new foundation model using the Prithvi-EO Vision Transformer architecture which has been pre-trained to reconstruct data from the Sentinel-3 Ocean and Land Colour Instrument (OLCI). We evaluate the model by fine-tuning on two downstream marine earth observation tasks. We first assess model performance compared to current baseline models used to quantify chlorophyll concentration. We then evaluate the FMs ability to refine remote sensing-based estimates of ocean primary production. Our results demonstrate the utility of self-trained FMs for marine monitoring, in particular for making use of small amounts of high quality labelled data and in capturing detailed spatial patterns of ocean colour whilst matching point observations. We conclude that this new generation of geospatial AI models has the potential to provide more robust, data-driven insights into ocean ecosystems and their role in global climate processes.
[100] Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Shilin Lu,Zhuming Lian,Zihan Zhou,Shaocong Zhang,Chen Zhao,Adams Wai-Kin Kong
Main category: cs.CV
TL;DR: 论文提出了SHINE框架,用于在复杂光照和高分辨率输入下实现物理上合理的图像合成,无需额外训练,通过预训练适配器和新损失函数提升性能。
Details
Motivation: 现有图像合成方法在处理复杂光照(如阴影、反射)和高分辨率输入时表现不足,而现有的扩散模型(如FLUX)虽然具备相关先验知识,但缺乏合适的框架来利用这些知识。Contribution: 1. 提出SHINE框架,无需训练即可实现高质量图像合成;2. 引入manifold-steered anchor loss和自适应背景融合技术;3. 提出ComplexCompo基准,覆盖复杂场景和多样分辨率。
Method: 1. 利用预训练适配器(如IP-Adapter)引导潜空间;2. 提出退化抑制指导和自适应背景融合;3. 通过manifold-steered anchor loss保持主体和背景的完整性。
Result: 在ComplexCompo和DreamEditBench上达到SOTA性能,在DINOv2、DreamSim等指标上表现优异。
Insight: SHINE展示了预训练扩散模型在图像合成任务中的潜力,无需额外训练即可利用其物理先验知识,为复杂场景下的合成提供了新思路。
Abstract: Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
[101] NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics
Yu Yuan,Xijun Wang,Tharindu Wickremasinghe,Zeeshan Nadir,Bole Ma,Stanley H. Chan
Main category: cs.CV
TL;DR: NewtonGen是一种文本到视频生成框架,通过神经牛顿动力学(NND)整合数据驱动的合成与可学习的物理原理,实现物理一致性和可控性强的视频生成。
Details
Motivation: 当前文本到视频生成的模型在物理一致性(如物体运动不合理)和参数可控性方面存在瓶颈,主要原因是缺乏对底层动力学的理解。Contribution: 提出NewtonGen框架,首次将可训练的神经牛顿动力学(NND)引入视频生成,实现物理一致的运动生成和精确参数控制。
Method: 通过结合数据先验和动力学指导,利用NND建模牛顿动力学,将其作为视频生成过程中的潜在约束。
Result: NewtonGen能够生成物理一致且运动可控的视频,显著优于现有方法。
Insight: 物理规律的显式建模是提升生成模型在动态场景中表现的关键。
Abstract: A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.
[102] SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
Hmrishav Bandyopadhyay,Rahim Entezari,Jim Scott,Reshinth Adithyan,Yi-Zhe Song,Varun Jampani
Main category: cs.CV
TL;DR: SD3.5-Flash提出了一种高效的少步蒸馏框架,通过分布匹配目标和创新技术(如时间步共享和分割时间步微调),实现了高质量图像生成在消费设备上的快速部署。
Details
Motivation: 当前生成模型计算成本高,难以在消费设备上高效运行。作者希望通过蒸馏技术优化模型性能,使其适用于从手机到台式机的多种硬件配置。Contribution: 1. 提出了针对少步生成的分布匹配目标;2. 引入了时间步共享和分割时间步微调两大创新技术;3. 通过文本编码器重构和量化优化了生成流程。
Method: 通过蒸馏技术优化修正流模型,结合分布匹配目标、时间步共享和分割时间步微调,并对生成流程进行全面优化。
Result: 实验和用户研究表明,SD3.5-Flash在少步生成任务中表现优于现有方法,实现了高效的生成和部署。
Insight: 通过创新的蒸馏和优化技术,高质量生成模型可以在资源受限的设备上高效运行,从而推动生成AI的实际应用普及。
Abstract: We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to accessible consumer devices. Our approach distills computationally prohibitive rectified flow models through a reformulated distribution matching objective tailored specifically for few-step generation. We introduce two key innovations: “timestep sharing” to reduce gradient noise and “split-timestep fine-tuning” to improve prompt alignment. Combined with comprehensive pipeline optimizations like text encoder restructuring and specialized quantization, our system enables both rapid generation and memory-efficient deployment across different hardware configurations. This democratizes access across the full spectrum of devices, from mobile phones to desktop computers. Through extensive evaluation including large-scale user studies, we demonstrate that SD3.5-Flash consistently outperforms existing few-step methods, making advanced generative AI truly accessible for practical deployment.
cs.LG [Back]
[103] CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
Zhenpeng Su,Leiyu Pan,Minxuan Lv,Yuntao Li,Wenping Hu,Fuzheng Zhang,Kun Gai,Guorui Zhou
Main category: cs.LG
TL;DR: CE-GPPO是一种新型强化学习优化算法,通过在PPO中温和且有界地引入被裁剪标记的梯度,有效控制策略熵的稳定性,显著提升了数学推理任务的表现。
Details
Motivation: 现有方法(如PPO)因裁剪机制丢失了低概率标记的梯度信号,而这些信号对策略熵的调节至关重要。研究提出了一种新的方法来保留这些关键梯度。Contribution: 提出了CE-GPPO算法,通过控制梯度幅度,重新引入被裁剪标记的梯度,实现了探索与利用的平衡,并在理论上和实验中验证了其有效性。
Method: CE-GPPO在标准PPO的基础上,温和且有界地恢复被裁剪标记的梯度,以此调节策略熵的动态变化,从而优化训练过程。
Result: 在多个数学推理基准测试中,CE-GPPO在不同规模的模型上均显著优于基线方法,展现了稳定的熵控制和性能提升。
Insight: 低概率标记的梯度对策略熵的动态调节具有重要作用,通过合理调控这些梯度可以显著改善强化学习的训练效果。
Abstract: Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}ontrolling \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.
[104] StyleBench: Evaluating thinking styles in Large Language Models
Junyu Guo,Shangding Gu,Ming Jin,Costas Spanos,Javad Lavaei
Main category: cs.LG
TL;DR: StyleBench是一个用于评估不同推理风格在大型语言模型中表现的基准测试,通过分析五种推理风格在15个开源模型上的表现,揭示了推理风格的有效性与模型规模和任务类型的密切关系。
Details
Motivation: 大型语言模型的推理风格对任务表现有重要影响,但风格、模型结构和任务类型之间的关系尚不明确。StyleBench旨在系统评估这种关系。Contribution: 1. 提出了StyleBench基准,评估五种推理风格;2. 揭示了推理风格的有效性与模型规模和任务类型的关联;3. 发现小模型更容易出现输出失败。
Method: 在五种推理任务中使用15个开源模型(270M到120B参数)评估五种推理风格(CoT、ToT、AoT、SoT、CoD),进行大规模分析。
Result: 结果显示:1. 没有单一最优推理风格;2. 搜索方法(AoT、ToT)适合开放任务,但需大模型;3. 简洁风格(SoT、CoD)在定义明确任务中更高效。
Insight: 模型规模对推理稳健性至关重要,小模型容易失败;任务类型和模型规模是选择推理风格的关键因素。
Abstract: The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.
[105] DELTA-Code: How Does RL Unlock and Transfer New Programming Algorithms in LLMs?
Yiyou Sun,Yuhan Cao,Pohao Huang,Haoyue Bai,Hannaneh Hajishirzi,Nouha Dziri,Dawn Song
Main category: cs.LG
TL;DR: DELTA-Code是一个合成的编程问题基准,用于探究LLMs是否可以通过强化学习(RL)学习和转移新的算法技能。实验揭示了模型的’顿悟’现象,并在训练中探索了关键策略。
Details
Motivation: 当前LLMs是否能通过RL学习和推广全新的推理策略,还是仅依赖预训练或微调中的技能,仍是一个开放性问题。DELTA-Code旨在解决这一问题。Contribution: 提出了DELTA-Code基准,专注于算法的可学习性和可迁移性;揭示了RL训练中的’顿悟’现象;探索了提升学习效果的关键训练策略。
Method: 使用合成的问题生成模板,通过RL训练LLMs;研究了分阶段密集奖励、经验回放、课程学习等方法对学习效果的影响。
Result: 实验显示RL训练的模型在经历长时间低奖励后突然达到高准确率;在家族内和技能重组中表现优秀,但在转型性任务中仍有局限。
Insight: DELTA-Code为RL驱动的推理提供了一个清晰的研究框架,揭示了LLMs超越现有先验能力的学习潜力及其局限性。
Abstract: It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA-Code–Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding, a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability – can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)? –and transferrability – if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop. Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.
[106] ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning
Qizhi Pei,Zhuoshi Pan,Honglin Lin,Xin Gao,Yu Li,Zinan Tang,Conghui He,Rui Yan,Lijun Wu
Main category: cs.LG
TL;DR: ScaleDiff提出了一种高效生成复杂数学问题的方法,通过自适应思维模型筛选高难度问题并训练专用生成器,显著提升了模型在数学推理任务中的表现。
Details
Motivation: 现有方法在生成复杂数学问题时面临高计算成本、提示复杂性及生成问题难度有限的挑战,亟需更高效的解决方案。Contribution: 1. 提出ScaleDiff流程,通过自适应思维模型高效筛选高难度问题。2. 训练专用生成器DiffGen-8B,无需复杂提示即可大规模生成高难度问题。3. 在多个数学推理基准上显著提升模型性能。
Method: 1. 使用自适应思维模型快速筛选高难度问题。2. 用筛选数据训练DiffGen-8B生成器。3. 在ScaleDiff-Math数据集上微调模型(如Qwen2.5-Math-7B)。
Result: 性能提升11.3%,在多个数学竞赛中获得65.9%的平均准确率,优于OpenThinker3。
Insight: 1. 高效筛选与生成高难度问题可显著提升模型能力。2. 无需依赖昂贵的大模型即可实现高性能。
Abstract: Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between “Thinking” and “NoThinking” modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME’24, AIME’25, HMMT-Feb’25, BRUMO’25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.
[107] Beyond Visual Similarity: Rule-Guided Multimodal Clustering with explicit domain rules
Kishor Datta Gupta,Mohd Ariful Haque,Marufa Kamal,Ahmed Rafi Hasan,Md. Mahfuzur Rahman,Roy George
Main category: cs.LG
TL;DR: 论文提出了一种结合领域规则的多模态聚类方法DARTVAE,通过将显式规则、语义表示和驱动特征统一到一个潜在空间中,实现了更具意义和可解释的聚类结果。
Details
Motivation: 传统聚类方法仅依赖输入数据的相似性,难以捕捉结构或语义约束,而现实领域中这些约束往往是关键的。Contribution: 提出了DARTVAE框架,将领域规则作为学习信号嵌入到变分自编码器中,实现了规则驱动的多模态聚类。
Method: DARTVAE扩展了VAE架构,通过损失函数中的一致性惩罚和违规惩罚,结合LLM生成的规则知识图谱,统一学习潜在表示。
Result: 在飞机和汽车数据集上的实验表明,规则引导的聚类产生了更具操作意义和可解释性的聚类,同时提升了传统聚类指标。
Insight: 结合规则学习和数据驱动表示能够提升聚类质量和可解释性,但LLM生成的规则可能存在冲突或过拟合问题。
Abstract: Traditional clustering techniques often rely solely on similarity in the input data, limiting their ability to capture structural or semantic constraints that are critical in many domains. We introduce the Domain Aware Rule Triggered Variational Autoencoder (DARTVAE), a rule guided multimodal clustering framework that incorporates domain specific constraints directly into the representation learning process. DARTVAE extends the VAE architecture by embedding explicit rules, semantic representations, and data driven features into a unified latent space, while enforcing constraint compliance through rule consistency and violation penalties in the loss function. Unlike conventional clustering methods that rely only on visual similarity or apply rules as post hoc filters, DARTVAE treats rules as first class learning signals. The rules are generated by LLMs, structured into knowledge graphs, and enforced through a loss function combining reconstruction, KL divergence, consistency, and violation penalties. Experiments on aircraft and automotive datasets demonstrate that rule guided clustering produces more operationally meaningful and interpretable clusters for example, isolating UAVs, unifying stealth aircraft, or separating SUVs from sedans while improving traditional clustering metrics. However, the framework faces challenges: LLM generated rules may hallucinate or conflict, excessive rules risk overfitting, and scaling to complex domains increases computational and consistency difficulties. By combining rule encodings with learned representations, DARTVAE achieves more meaningful and consistent clustering outcomes than purely data driven models, highlighting the utility of constraint guided multimodal clustering for complex, knowledge intensive settings.
[108] Bispectral OT: Dataset Comparison using Symmetry-Aware Optimal Transport
Annabel Ma,Kaiying Hou,David Alvarez-Melis,Melanie Weber
Main category: cs.LG
TL;DR: 该论文提出了双谱最优传输(Bispectral OT),一种对称感知的离散最优传输方法,通过双谱表示数据元素以保留信号结构并消除因群作用引起的变异。
Details
Motivation: 在对称性丰富的场景中,基于原始特征的几何距离的最优传输(OT)可能忽略数据的内在一致性结构,因此需要一种方法在比较数据时保留对称性信息。Contribution: 论文的主要贡献是引入双谱最优传输,利用双谱作为群傅里叶不变量来表示数据,从而在去除群作用引起的变异的同时保留信号的结构信息。
Method: 方法通过离散最优传输的对称感知扩展,用双谱表示数据元素,从而捕捉语义标签结构并消除无关变异。
Result: 实验表明,双谱OT在视觉对称变换的基准数据集上比原始特征OT实现了更高的类别保持准确性。
Insight: 双谱表示能够有效保留数据的语义结构,同时去除对称性引起的变异,为数据集比较提供了更鲁棒和有意义的方法。
Abstract: Optimal transport (OT) is a widely used technique in machine learning, graphics, and vision that aligns two distributions or datasets using their relative geometry. In symmetry-rich settings, however, OT alignments based solely on pairwise geometric distances between raw features can ignore the intrinsic coherence structure of the data. We introduce Bispectral Optimal Transport, a symmetry-aware extension of discrete OT that compares elements using their representation using the bispectrum, a group Fourier invariant that preserves all signal structure while removing only the variation due to group actions. Empirically, we demonstrate that the transport plans computed with Bispectral OT achieve greater class preservation accuracy than naive feature OT on benchmark datasets transformed with visual symmetries, improving the quality of meaningful correspondences that capture the underlying semantic label structure in the dataset while removing nuisance variation not affecting class or content.
[109] FERD: Fairness-Enhanced Data-Free Robustness Distillation
Zhengxiao Li,Liming Lu,Xu Zheng,Siyuan Liang,Zhenghan Chen,Yongbin Zhou,Shuchao Pang
Main category: cs.LG
TL;DR: FERD提出了一种公平性增强的无数据鲁棒性蒸馏框架,通过调整对抗样本的比例和分布,解决了现有方法在不同类别间的鲁棒性差异问题。
Details
Motivation: 现有无数据鲁棒性蒸馏方法忽视了鲁棒公平性问题,导致不同类别间的鲁棒性差异显著。FERD旨在解决这一问题。Contribution: 1) 提出了首个公平性增强的无数据鲁棒性蒸馏框架FERD;2) 通过鲁棒性引导的类别重新加权策略和均匀特征预测约束,生成更平衡的对抗样本。
Method: 1) 采用鲁棒性引导的类别重新加权策略调整对抗样本比例;2) 生成公平感知样本(FAEs)和均匀目标对抗样本(UTAEs)以优化分布。
Result: 在三个公开数据集上,FERD在所有对抗攻击下实现了最差类别鲁棒性的显著提升(例如,在CIFAR-10上,FGSM和AutoAttack的最差类别鲁棒性分别提高了15.1%和6.4%)。
Insight: 公平性和鲁棒性可以协同优化,动态调整对抗样本的比例和分布是提升模型整体性能的关键。
Abstract: Data-Free Robustness Distillation (DFRD) aims to transfer the robustness from the teacher to the student without accessing the training data. While existing methods focus on overall robustness, they overlook the robust fairness issues, leading to severe disparity of robustness across different categories. In this paper, we find two key problems: (1) student model distilled with equal class proportion data behaves significantly different across distinct categories; and (2) the robustness of student model is not stable across different attacks target. To bridge these gaps, we present the first Fairness-Enhanced data-free Robustness Distillation (FERD) framework to adjust the proportion and distribution of adversarial examples. For the proportion, FERD adopts a robustness-guided class reweighting strategy to synthesize more samples for the less robust categories, thereby improving robustness of them. For the distribution, FERD generates complementary data samples for advanced robustness distillation. It generates Fairness-Aware Examples (FAEs) by enforcing a uniformity constraint on feature-level predictions, which suppress the dominance of class-specific non-robust features, providing a more balanced representation across all categories. Then, FERD constructs Uniform-Target Adversarial Examples (UTAEs) from FAEs by applying a uniform target class constraint to avoid biased attack directions, which distribute the attack targets across all categories and prevents overfitting to specific vulnerable categories. Extensive experiments on three public datasets show that FERD achieves state-of-the-art worst-class robustness under all adversarial attack (e.g., the worst-class robustness under FGSM and AutoAttack are improved by 15.1% and 6.4% using MobileNet-V2 on CIFAR-10), demonstrating superior performance in both robustness and fairness aspects.
[110] CaTS-Bench: Can Language Models Describe Numeric Time Series?
Luca Zhou,Pratham Yashwante,Marshall Fisher,Alessio Sampieri,Zihao Zhou,Fabio Galasso,Rose Yu
Main category: cs.LG
TL;DR: CaTS-Bench是一个首个基于真实世界数据的大规模上下文感知时间序列描述基准,包含多样化的数据集和任务,并引入了新颖的评估指标和方法。
Details
Motivation: 现有时间序列描述任务依赖于合成数据或简单描述,缺乏真实世界的上下文和视觉表示,限制了研究的深度和实用性。Contribution: 1. 引入了首个大规模的上下文感知时间系列描述基准CaTS-Bench;2. 提出了一个可扩展的生成参考描述的流程;3.提供了460个多项选择题以测试更深层次的时间序列推理能力;4.提出了新的评估指标并测试了领先的视觉语言模型(VLM)。
Method: CaTS-Bench通过11个多样化数据集构建,包含数字序列、元数据、线图图像和描述。参考描述主要通过强大的LLM生成,并结合事实检查、人类难辨研究和多样性分析验证,同时提供了579个人工修订的测试描述。
Result: CaTS-Bench包含了465k训练和105k测试时间戳,并验证了VLMs在时间序列描述任务中的优势和局限性。
Insight: 1. 真实世界数据和时间序列描述的复杂性需要更强的上下文理解和数值推理能力;2. LLM生成的描述通过人工修订可以显著提升质量和风格;3.VLMs在时间序列分析中表现良好,但仍需改进。
Abstract: Time series captioning, the task of describing numeric time series in natural language, requires numerical reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on synthetic data or overly simplistic captions, and typically neglect metadata and visual representations. To close this gap, we introduce CaTS-Bench, the first large-scale, real-world benchmark for Context-aware Time Series captioning. CaTS-Bench is derived from 11 diverse datasets reframed as captioning and Q&A tasks, comprising roughly 465k training and 105k test timestamps. Each sample includes a numeric series segment, contextual metadata, a line-chart image, and a caption. A key contribution of this work is the scalable pipeline used to generate reference captions: while most references are produced by an oracle LLM and verified through factual checks, human indistinguishability studies, and diversity analyses, we also provide a human-revisited subset of 579 test captions, refined from LLM outputs to ensure accuracy and human-like style. Beyond captioning, CaTS-Bench offers 460 multiple-choice questions targeting deeper aspects of time series reasoning. We further propose new tailored evaluation metrics and benchmark leading VLMs, highlighting both their strengths and persistent limitations. Together, these contributions establish CaTS-Bench and its captioning pipeline as a reliable and extensible foundation for future research at the intersection of time series analysis and foundation models.
[111] FHRFormer: A Self-supervised Transformer Approach for Fetal Heart Rate Inpainting and Forecasting
Kjersti Engan,Neel Kanwal,Anita Yeconia,Ladislaus Blacy,Yuda Munyaw,Estomih Mduma,Hege Ersdal
Main category: cs.LG
TL;DR: 论文提出了一种基于掩码Transformer的自编码器方法FHRFormer,用于修复和预测胎儿心率信号中的缺失数据,解决了传统插值方法无法保留信号频谱特性的问题。
Details
Motivation: 胎儿心率监测在产前护理中至关重要,但信号缺失限制了数据的有效分析和AI算法的应用。传统方法无法有效处理缺失数据,因此需要一种新方法来捕获信号的时空和频率特征。Contribution: 1. 提出了一种自监督的Transformer方法FHRFormer;2. 能够同时处理信号修复和预测任务;3. 方法在缺失数据的不同持续时间内表现鲁棒。
Method: 采用掩码Transformer自编码器,通过捕获信号的时空和频率特征来重建缺失的FHR数据。方法支持信号修复和预测。
Result: FHRFormer在信号修复和预测任务中表现出色,能够在不同缺失持续时间下保持鲁棒性,优于传统插值方法。
Insight: 该方法不仅可用于研究数据集的后处理,未来还可集成到可穿戴设备中,实现更早、更可靠的胎儿风险检测。
Abstract: Approximately 10% of newborns require assistance to initiate breathing at birth, and around 5% need ventilation support. Fetal heart rate (FHR) monitoring plays a crucial role in assessing fetal well-being during prenatal care, enabling the detection of abnormal patterns and supporting timely obstetric interventions to mitigate fetal risks during labor. Applying artificial intelligence (AI) methods to analyze large datasets of continuous FHR monitoring episodes with diverse outcomes may offer novel insights into predicting the risk of needing breathing assistance or interventions. Recent advances in wearable FHR monitors have enabled continuous fetal monitoring without compromising maternal mobility. However, sensor displacement during maternal movement, as well as changes in fetal or maternal position, often lead to signal dropouts, resulting in gaps in the recorded FHR data. Such missing data limits the extraction of meaningful insights and complicates automated (AI-based) analysis. Traditional approaches to handle missing data, such as simple interpolation techniques, often fail to preserve the spectral characteristics of the signals. In this paper, we propose a masked transformer-based autoencoder approach to reconstruct missing FHR signals by capturing both spatial and frequency components of the data. The proposed method demonstrates robustness across varying durations of missing data and can be used for signal inpainting and forecasting. The proposed approach can be applied retrospectively to research datasets to support the development of AI-based risk algorithms. In the future, the proposed method could be integrated into wearable FHR monitoring devices to achieve earlier and more robust risk detection.
[112] Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers
Killian Steunou,Sigurd Saue,Théo Druilhe
Main category: cs.LG
TL;DR: 该论文探讨了稀疏表示如何通过稀疏主成分分析(SPCA)提高神经网络分类器的对抗鲁棒性,理论分析和实验结果均表明SPCA比标准PCA更能抵御对抗攻击。
Details
Motivation: 深度神经网络在图像分类任务中表现出色,但对对抗扰动非常脆弱。本文旨在通过稀疏表示(如SPCA)提升对抗鲁棒性,探讨其理论机制和实际效果。Contribution: 1) 理论推导了SPCA线性分类器的对抗鲁棒性证明;2) 通过实验验证了SPCA在非线性分类器中优于PCA的鲁棒性;3) 提出了稀疏投影减少对抗杠杆的理论机制。
Method: 1) 使用PCA和SPCA作为前端特征提取器;2) 理论分析SPCA特征分类器的鲁棒性(包括$ℓ_∞$和$ℓ_2$威胁模型);3) 实验评估SPCA在对抗攻击下的性能。
Result: 实验表明,SPCA在强白盒和黑盒攻击下性能下降更为平缓,且保持较高的清洁准确率。理论证明了稀疏投影能够降低对抗杠杆。
Insight: 稀疏表示(如SPCA)通过减少特征空间的对抗杠杆,显著提升了分类器的对抗鲁棒性,这一效果在非线性场景中依然有效。
Abstract: Deep neural networks perform remarkably well on image classification tasks but remain vulnerable to carefully crafted adversarial perturbations. This work revisits linear dimensionality reduction as a simple, data-adapted defense. We empirically compare standard Principal Component Analysis (PCA) with its sparse variant (SPCA) as front-end feature extractors for downstream classifiers, and we complement these experiments with a theoretical analysis. On the theory side, we derive exact robustness certificates for linear heads applied to SPCA features: for both $\ell_\infty$ and $\ell_2$ threat models (binary and multiclass), the certified radius grows as the dual norms of $W^\top u$ shrink, where $W$ is the projection and $u$ the head weights. We further show that for general (non-linear) heads, sparsity reduces operator-norm bounds through a Lipschitz composition argument, predicting lower input sensitivity. Empirically, with a small non-linear network after the projection, SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Taken together, the theory identifies the mechanism (sparser projections reduce adversarial leverage) and the experiments verify that this benefit persists beyond the linear setting. Our code is available at https://github.com/killian31/SPCARobustness.
cs.HC [Back]
[113] Perspectra: Choosing Your Experts Enhances Critical Thinking in Multi-Agent Research Ideation
Yiren Liu,Viraj Shah,Sangho Suh,Pao Siangliulue,Tal August,Yun Huang
Main category: cs.HC
TL;DR: Perspectra是一个交互式多代理系统(MAS),通过可视化工具和结构化讨论增强用户对多代理协作的控制和批判性思考能力,显著提升了研究提案的质量和跨学科互动。
Details
Motivation: 当前多代理系统(MAS)在信息搜索和创意生成中缺乏用户对代理协作的有效控制和批判性评估机制。Perspectra旨在填补这一空白。Contribution: Perspectra通过论坛风格界面、实时思维导图和代理定向邀请功能,增强了用户在多代理协作中的批判性思考能力和控制力。
Method: Perspectra采用可视化工具和结构化讨论机制(如@-mention和线程功能),支持用户与多代理的互动。通过18名参与者的对照实验,比较了Perspectra与群聊基线的表现。
Result: 实验显示,Perspectra显著提升了批判性思考行为频率和深度,引发了更多跨学科回复,并促进了更多提案修改。
Insight: 设计支持用户控制的多代理工具(尤其是可视化与结构化交互)能够有效提升协作中的批判性思维和决策质量。
Abstract: Recent advances in multi-agent systems (MAS) enable tools for information search and ideation by assigning personas to agents. However, how users can effectively control, steer, and critically evaluate collaboration among multiple domain-expert agents remains underexplored. We present Perspectra, an interactive MAS that visualizes and structures deliberation among LLM agents via a forum-style interface, supporting @-mention to invite targeted agents, threading for parallel exploration, with a real-time mind map for visualizing arguments and rationales. In a within-subjects study with 18 participants, we compared Perspectra to a group-chat baseline as they developed research proposals. Our findings show that Perspectra significantly increased the frequency and depth of critical-thinking behaviors, elicited more interdisciplinary replies, and led to more frequent proposal revisions than the group chat condition. We discuss implications for designing multi-agent tools that scaffold critical thinking by supporting user control over multi-agent adversarial discourse.
cs.CY [Back]
[114] Blueprints of Trust: AI System Cards for End to End Transparency and Governance
Huzaifa Sidhpurwala,Emily Fox,Garth Mollett,Florencio Cano Gabarda,Roman Zhukov
Main category: cs.CY
TL;DR: 本文介绍了一种名为Hazard-Aware System Card (HASC)的新框架,旨在提高AI系统的透明度和责任性,通过动态记录系统的安全和安全状态,并提出标准化标识符。
Details
Motivation: 当前AI系统在部署和开发过程中缺乏透明度和责任性,传统的模型卡和系统卡功能有限,无法全面记录系统的安全状态。Contribution: 提出HASC框架,集成动态安全和安全记录,并引入AI Safety Hazard (ASH) ID等标准化标识符,增强系统透明度和责任性。
Method: HASC框架扩展了现有模型卡和系统卡的功能,整合安全标识符(如CVEs),并提供动态更新机制。
Result: HASC能够为开发者和利益相关者提供一个可靠的信息源,提升AI系统全生命周期的安全性决策。
Insight: HASC框架可以与ISO/IEC 42001:2023标准互补,为AI系统的透明度和责任性提供更全面的支持。
Abstract: This paper introduces the Hazard-Aware System Card (HASC), a novel framework designed to enhance transparency and accountability in the development and deployment of AI systems. The HASC builds upon existing model card and system card concepts by integrating a comprehensive, dynamic record of an AI system’s security and safety posture. The framework proposes a standardized system of identifiers, including a novel AI Safety Hazard (ASH) ID, to complement existing security identifiers like CVEs, allowing for clear and consistent communication of fixed flaws. By providing a single, accessible source of truth, the HASC empowers developers and stakeholders to make more informed decisions about AI system safety throughout its lifecycle. Ultimately, we also compare our proposed AI system cards with the ISO/IEC 42001:2023 standard and discuss how they can be used to complement each other, providing greater transparency and accountability for AI systems.
[115] Communication Bias in Large Language Models: A Regulatory Perspective
Adrian Kuenzler,Stefan Schmid
Main category: cs.CY
TL;DR: 该论文探讨了大语言模型(LLMs)在应用中存在的偏见问题及其社会影响,并呼吁在现有法规基础上加强竞争与设计治理以确保AI的公平性。
Details
Motivation: 随着大语言模型在各类应用中的普及,其输出偏见引发的公平性和合规性问题日益突出。论文旨在从监管视角分析这一问题。Contribution: 论文的主要贡献是结合欧盟的《人工智能法案》和《数字服务法案》,提出了在法规外加强竞争与设计治理的必要性。
Method: 论文通过文献综述和案例分析,评估了LLMs的偏见风险及其对社会的影响,并讨论了现有法规的局限性。
Result: 研究发现,仅依靠法规不足以保证LLMs的公平性,需要更全面的竞争和设计治理机制。
Insight: 论文指出,监管机构应更多关注LLMs的设计过程和市场动态,以系统性解决偏见问题。
Abstract: Large language models (LLMs) are increasingly central to many applications, raising concerns about bias, fairness, and regulatory compliance. This paper reviews risks of biased outputs and their societal impact, focusing on frameworks like the EU’s AI Act and the Digital Services Act. We argue that beyond constant regulation, stronger attention to competition and design governance is needed to ensure fair, trustworthy AI. This is a preprint of the Communications of the ACM article of the same title.
cs.AI [Back]
[116] CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
Yang Zhao,Chengxiao Dai,Wei Zhuo,Yue Xiu,Dusit Niyato
Main category: cs.AI
TL;DR: CLAUSE提出了一种动态可学习的上下文工程方法,通过多智能体强化学习优化知识图谱推理中的子图构建、路径发现和证据选择,实现了在资源预算下的高效推理。
Details
Motivation: 知识图谱推理的静态扩展方法存在过度检索和运行时不可预测的问题,CLAUSE旨在通过动态决策过程平衡准确性、延迟和成本,同时保留溯源性。Contribution: 1) 提出了CLAUSE框架,通过三个智能体(Subgraph Architect, Path Navigator, Context Curator)协同优化推理过程;2) 提出了LC-MAPPO算法,用于在资源约束下协调多智能体;3) 在多个数据集上验证了性能提升,降低了延迟和子图增长。
Method: CLAUSE采用LC-MAPPO算法协调三个智能体,动态决定子图扩展、路径选择和证据保留,同时考虑用户指定的资源预算(如延迟和token成本)。
Result: 在HotpotQA、MetaQA和FactKG上,CLAUSE提升了EM@1,同时减少了子图增长和延迟。例如,在MetaQA-2-hop上,EM@1提高了39.3%,延迟降低18.6%,边增长减少40.9%。
Insight: 动态上下文工程和资源预算的结合可以有效提升推理效率,同时保留溯源性,适用于实际部署场景。
Abstract: Knowledge graphs provide structured context for multi-hop question answering, but deployed systems must balance answer accuracy with strict latency and cost targets while preserving provenance. Static k-hop expansions and “think-longer” prompting often over-retrieve, inflate context, and yield unpredictable runtime. We introduce CLAUSE, an agentic three-agent neuro-symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to keep, and when to stop. Latency (interaction steps) and prompt cost (selected tokens) are exposed as user-specified budgets or prices, allowing per-query adaptation to trade-offs among accuracy, latency, and cost without retraining. CLAUSE employs the proposed Lagrangian-Constrained Multi-Agent Proximal Policy Optimization (LC-MAPPO) algorithm to coordinate three agents: Subgraph Architect, Path Navigator, and Context Curator, so that subgraph construction, reasoning-path discovery, and evidence selection are jointly optimized under per-query resource budgets on edge edits, interaction steps, and selected tokens. Across HotpotQA, MetaQA, and FactKG, CLAUSE yields higher EM@1 while reducing subgraph growth and end-to-end latency at equal or lower token budgets. On MetaQA-2-hop, relative to the strongest RAG baseline (GraphRAG), CLAUSE achieves +39.3 EM@1 with 18.6% lower latency and 40.9% lower edge growth. The resulting contexts are compact, provenance-preserving, and deliver predictable performance under deployment constraints.
[117] Disagreements in Reasoning: How a Model’s Thinking Process Dictates Persuasion in Multi-Agent Systems
Haodong Zhao,Jidong Li,Zhaomin Wu,Tianjie Ju,Zhuosheng Zhang,Bingsheng He,Gongshen Liu
Main category: cs.AI
TL;DR: 该论文通过多智能体系统的实验,挑战了模型规模主导说服力的假设,提出认知过程和推理能力是说服动态的关键因素,揭示了说服二元性,并探讨了传播说服中的复杂影响动态。
Details
Motivation: 研究多智能体系统中大型语言模型和大型推理模型的交互时,说服动态的主导因素,挑战了模型规模是主要影响因素的假设。Contribution: 提出了说服二元性(Persuasion Duality),即推理能力的增强既提高了抗说服能力,也通过透明化推理过程增强了说服力,揭示了模型内部处理架构与外部行为之间的联系。
Method: 通过多智能体说服实验,分析推理过程的透明度对说服效果的影响,并研究多跳传播说服中的影响动态。
Result: 推理能力强的模型抗说服能力更强,但透明化推理内容会显著增强其说服力;多跳传播中影响传播与衰减呈现复杂动态。
Insight: 模型的内部推理架构显著影响其外部说服行为,这对未来多智能体系统的安全性、鲁棒性和设计具有重要启示。
Abstract: The rapid proliferation of recent Multi-Agent Systems (MAS), where Large Language Models (LLMs) and Large Reasoning Models (LRMs) usually collaborate to solve complex problems, necessitates a deep understanding of the persuasion dynamics that govern their interactions. This paper challenges the prevailing hypothesis that persuasive efficacy is primarily a function of model scale. We propose instead that these dynamics are fundamentally dictated by a model’s underlying cognitive process, especially its capacity for explicit reasoning. Through a series of multi-agent persuasion experiments, we uncover a fundamental trade-off we term the Persuasion Duality. Our findings reveal that the reasoning process in LRMs exhibits significantly greater resistance to persuasion, maintaining their initial beliefs more robustly. Conversely, making this reasoning process transparent by sharing the “thinking content” dramatically increases their ability to persuade others. We further consider more complex transmission persuasion situations and reveal complex dynamics of influence propagation and decay within multi-hop persuasion between multiple agent networks. This research provides systematic evidence linking a model’s internal processing architecture to its external persuasive behavior, offering a novel explanation for the susceptibility of advanced models and highlighting critical implications for the safety, robustness, and design of future MAS.
[118] TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Yidong Wang,Yunze Song,Tingyuan Zhu,Xuanwang Zhang,Zhuohao Yu,Hao Chen,Chiyu Song,Qiufeng Wang,Cunxiang Wang,Zhen Wu,Xinyu Dai,Yue Zhang,Wei Ye,Shikun Zhang
Main category: cs.AI
TL;DR: 论文提出了TrustJudge框架,解决了LLM作为评估器时的评分不一致问题,包括评分比较不一致和成对传递不一致,通过概率化方法显著提升了评估的可靠性。
Details
Motivation: 目前使用大型语言模型(LLM)作为自动化评估器时存在严重的评分不一致问题,影响了评估的可靠性和公正性。Contribution: 1. 系统地分析了LLM作为评估器时的两类不一致性问题;
2. 提出了TrustJudge框架,通过分布敏感评分和似然感知聚合解决了这些问题;
3. 在实验中显著减少了不一致性,同时保持了更高的评估准确性。
Method: 1. 分布敏感评分:利用离散评分的概率分布计算连续期望,保留信息熵;
2. 似然感知聚合:通过双向偏好概率或困惑度解决传递性冲突。
Result: TrustJudge将评分比较不一致性降低了8.43%,成对传递不一致性降低了10.82%,并在实验中保持了更高的评估准确性。
Insight: 信息损失和模糊的判断导致了LLM评估的不一致性,概率化的评分和聚合方法可以有效缓解这些问题,提升评估的可靠性。
Abstract: The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge’s components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.
[119] Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns
Xuemiao Zhang,Can Ren,Chengying Tu,Rongxiang Weng,Shuo Wang,Hongfei Yan,Jingang Wang,Xunliang Cai
Main category: cs.AI
TL;DR: 该论文提出了一种通过多样化思维链模式(CoT)数据提升基础模型推理能力的方法,定义了推理潜力指标,并设计了高效的数据选择算法。
Details
Motivation: 当前的推理模型在数学推理任务中表现提升主要依赖于长思维链数据,但缺乏对哪些数据最能有效提升模型能力的深入研究。论文旨在填补这一空白。Contribution: 首次定义了模型的推理潜力指标,并提出通过多样化高价值推理模式数据扩展潜力;设计了双粒度算法高效选择数据。
Method: 从CoT序列中抽象原子推理模式,构建核心参考集;基于推理模式链和令牌熵的双粒度算法选择高价值数据(CoTP)。
Result: 仅用100亿令牌的CoTP数据使850亿参数的MoE模型在AIME任务上提升9.58%,并将下游RL性能上限提高7.81%。
Insight: 高价值推理模式数据的筛选能显著提升模型推理深度和性能,为未来推理模型的数据选择提供了新思路。
Abstract: Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model’s reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.
[120] VC-Agent: An Interactive Agent for Customized Video Dataset Collection
Yidan Zhang,Mutian Xu,Yiming Hao,Kun Zhou,Jiahao Chang,Xiaoqiang Liu,Pengfei Wan,Hongbo Fu,Xiaoguang Han
Main category: cs.AI
TL;DR: VC-Agent 是一个交互式代理,通过理解用户查询和反馈,快速检索/扩展相关视频片段,减少用户输入需求。
Details
Motivation: 互联网视频数据对模型训练至关重要,但针对特定需求收集大规模视频数据费时费力。作者希望通过交互式代理加速这一过程。Contribution: 1. 提出首个交互式代理 VC-Agent,支持用户通过文本描述和确认指定需求;2. 利用多模态大语言模型连接用户需求和视频内容;3. 提出两种动态更新的过滤策略;4. 提供新基准验证代理实用性。
Method: 1. 定义用户友好界面,支持多样化需求输入;2. 结合多模态大语言模型匹配用户需求与视频内容;3. 动态更新过滤策略以优化检索效果。
Result: 实验验证了 VC-Agent 在个性化视频数据集收集中的高效性和实用性。
Insight: 交互式代理能显著减少用户工作量,动态更新的过滤策略提升了视频检索的精准性和效率。
Abstract: Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users’ queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal large language models to connect the user’s requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent’s usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: https://allenyidan.github.io/vcagent_page/.
cs.MA [Back]
[121] RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows
Kai Zhang,Corey D Barrett,Jangwon Kim,Lichao Sun,Tara Taghavi,Krishnaram Kenthapadi
Main category: cs.MA
TL;DR: RadAgents是一个多智能体框架,用于胸部X射线图像(CXR)的临床解释,结合了临床先验知识和多模态任务感知推理,提升了结果的可靠性和透明度。
Details
Motivation: 现有的CXR解释方法存在以下问题:缺乏临床可解释性;多模态信息融合不足;无法检测和解决工具间的不一致性。RadAgents旨在解决这些问题。Contribution: 提出RadAgents框架,结合临床先验和多模态推理,引入基于检索增强的验证机制,提升了CXR解释的可靠性和一致性。
Method: 使用多智能体架构,结合临床先验和任务感知的多模态推理,整合grounding和多模态检索增强以解决上下文冲突。
Result: 生成的输出更具可靠性、透明性,且更符合临床实践。
Insight: 多智能体协作和检索增强的结合是提升医学影像解释质量和临床适用性的有效途径。
Abstract: Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X-ray (CXR) interpretation, prevailing methods remain limited: (i) reasoning is frequently neither clinically interpretable nor aligned with guidelines, reflecting mere aggregation of tool outputs; (ii) multimodal evidence is insufficiently fused, yielding text-only rationales that are not visually grounded; and (iii) systems rarely detect or resolve cross-tool inconsistencies and provide no principled verification mechanisms. To bridge the above gaps, we present RadAgents, a multi-agent framework for CXR interpretation that couples clinical priors with task-aware multimodal reasoning. In addition, we integrate grounding and multimodal retrieval-augmentation to verify and resolve context conflicts, resulting in outputs that are more reliable, transparent, and consistent with clinical practice.
cs.IR [Back]
[122] Interactive Recommendation Agent with Active User Commands
Jiakai Tang,Yujie Luo,Xunke Xi,Fei Sun,Xueyang Feng,Sunhao Dai,Chao Yi,Dian Chen,Zhujin Gao,Yang Li,Xu Chen,Wen Chen,Jian Wu,Yuning Jiang,Bo Zheng
Main category: cs.IR
TL;DR: 该论文提出了一种交互式推荐系统IRF,通过自然语言命令使用户能够主动控制推荐策略,解决了传统被动反馈系统的局限性。
Details
Motivation: 传统推荐系统依赖被动反馈(如喜欢/不喜欢),无法捕捉用户细微的行为动机和意图,导致偏好建模不准确,影响用户满意度和系统效果。Contribution: 1. 引入交互式推荐反馈(IRF)范式,支持自然语言命令;2. 提出双代理架构RecBot(Parser Agent和Planner Agent),实现动态策略调整;3. 采用仿真增强知识蒸馏,平衡效率与推理能力。
Method: 1. IRF范式允许自然语言命令;2. RecBot的双代理架构(Parser Agent解析语言,Planner Agent动态调整策略);3. 仿真增强知识蒸馏优化性能。
Result: 通过离线与在线实验,RecBot显著提升了用户满意度和商业效果。
Insight: 主动的语言交互能够更精准地捕捉用户意图,弥补传统系统缺陷。
Abstract: Traditional recommender systems rely on passive feedback mechanisms that limit users to simple choices such as like and dislike. However, these coarse-grained signals fail to capture users’ nuanced behavior motivations and intentions. In turn, current systems cannot also distinguish which specific item attributes drive user satisfaction or dissatisfaction, resulting in inaccurate preference modeling. These fundamental limitations create a persistent gap between user intentions and system interpretations, ultimately undermining user satisfaction and harming system effectiveness. To address these limitations, we introduce the Interactive Recommendation Feed (IRF), a pioneering paradigm that enables natural language commands within mainstream recommendation feeds. Unlike traditional systems that confine users to passive implicit behavioral influence, IRF empowers active explicit control over recommendation policies through real-time linguistic commands. To support this paradigm, we develop RecBot, a dual-agent architecture where a Parser Agent transforms linguistic expressions into structured preferences and a Planner Agent dynamically orchestrates adaptive tool chains for on-the-fly policy adjustment. To enable practical deployment, we employ simulation-augmented knowledge distillation to achieve efficient performance while maintaining strong reasoning capabilities. Through extensive offline and long-term online experiments, RecBot shows significant improvements in both user satisfaction and business outcomes.
[123] Provenance Analysis of Archaeological Artifacts via Multimodal RAG Systems
Tuo Zhang,Yuechun Sun,Ruiliang Liu
Main category: cs.IR
TL;DR: 该论文提出了一种基于检索增强生成(RAG)的多模态系统,用于考古文物的起源分析,通过结合多模态检索和大规模视觉-语言模型(VLM),支持专家推理。
Details
Motivation: 考古文物起源分析需要处理大量多模态数据(如文本和图像),传统方法难以高效整合这些信息,增加了专家的认知负担。因此,需要一种能够自动化检索和推理的系统。Contribution: 1. 设计了基于RAG的双模态知识库,支持视觉、边缘增强和语义检索;2. 利用VLM生成结构化推理,包括年代、地理和文化归属;3. 在东方欧亚青铜时代文物数据集上验证了系统有效性。
Method: 1. 构建双模态知识库(文本+图像);2. 采用多模态检索(原始视觉、边缘增强、语义);3. 使用VLM对检索结果进行结构化推理生成。
Result: 专家评估表明,系统生成的输出有意义且可解释,为分析提供具体起点,并显著减轻了专家处理海量数据的负担。
Insight: 多模态RAG系统在考古学等需要复杂推理的领域具有潜力,可以有效整合异构数据并提供可解释的辅助决策。
Abstract: In this work, we present a retrieval-augmented generation (RAG)-based system for provenance analysis of archaeological artifacts, designed to support expert reasoning by integrating multimodal retrieval and large vision-language models (VLMs). The system constructs a dual-modal knowledge base from reference texts and images, enabling raw visual, edge-enhanced, and semantic retrieval to identify stylistically similar objects. Retrieved candidates are synthesized by the VLM to generate structured inferences, including chronological, geographical, and cultural attributions, alongside interpretive justifications. We evaluate the system on a set of Eastern Eurasian Bronze Age artifacts from the British Museum. Expert evaluation demonstrates that the system produces meaningful and interpretable outputs, offering scholars concrete starting points for analysis and significantly alleviating the cognitive burden of navigating vast comparative corpora.
cs.GR [Back]
[124] SeamCrafte: Enhancing Mesh Seam Generation for Artist UV Unwrapping via Reinforcement Learning
Duoteng Xu,Yuguang Chen,Jing Li,Xinhai Liu,Xueqi Ma,Zhuo Chen,Dongyu Zhang,Chunchao Guo
Main category: cs.GR
TL;DR: SeamCrafter提出了一种基于强化学习的自动生成网格接缝的方法,通过双分支点云编码器和偏好优化,显著降低了UV展开中的失真和碎片化问题。
Details
Motivation: 现有的网格接缝生成方法往往在高失真和碎片化之间难以平衡,影响了纹理合成和艺术家的工作流程,因此需要一种更优的解决方案。Contribution: 1. 引入SeamCrafter,一种基于GPT风格的自回归接缝生成器;2. 提出双分支点云编码器,分离并捕捉拓扑和几何特征;3. 使用DPO进行偏好优化,进一步提升接缝质量。
Method: 1. 采用自回归模型生成接缝;2. 通过双分支编码器处理点云输入;3. 使用DPO在偏好数据集上进行微调,评估指标包括失真和碎片化。
Result: 实验表明,SeamCrafter生成的接缝在失真和碎片化方面优于现有方法,同时保持了拓扑一致性和视觉保真度。
Insight: 通过强化学习和偏好优化,可以显著改善UV展开的质量,为艺术家提供更高效的辅助工具。
Abstract: Mesh seams play a pivotal role in partitioning 3D surfaces for UV parametrization and texture mapping. Poorly placed seams often result in severe UV distortion or excessive fragmentation, thereby hindering texture synthesis and disrupting artist workflows. Existing methods frequently trade one failure mode for another-producing either high distortion or many scattered islands. To address this, we introduce SeamCrafter, an autoregressive GPT-style seam generator conditioned on point cloud inputs. SeamCrafter employs a dual-branch point-cloud encoder that disentangles and captures complementary topological and geometric cues during pretraining. To further enhance seam quality, we fine-tune the model using Direct Preference Optimization (DPO) on a preference dataset derived from a novel seam-evaluation framework. This framework assesses seams primarily by UV distortion and fragmentation, and provides pairwise preference labels to guide optimization. Extensive experiments demonstrate that SeamCrafter produces seams with substantially lower distortion and fragmentation than prior approaches, while preserving topological consistency and visual fidelity.
[125] ArchGPT: Understanding the World’s Architectures with Large Multimodal Models
Yuze Wang,Luo Yang,Junyi Wang,Yue Qi
Main category: cs.GR
TL;DR: ArchGPT是一个多模态的建筑视觉问答模型,通过大规模数据集Arch-300K训练而成,结合了3D重建和语义分割技术,解决了现有VR/MR/AR系统在建筑领域中的扩展性问题。
Details
Motivation: 现有VR/MR/AR系统通常针对特定案例开发,缺乏通用性和扩展性,无法适应多样化的建筑环境需求。因此,需要一种更通用的方法来理解和分析建筑。Contribution: 1. 提出ArchGPT,一个多模态建筑视觉问答模型;2.创建Arch-300K数据集,包含约31.5万条高质量建筑图像-问题-答案三元组;3.提出一种从粗到细的数据构建流水线,结合3D重建和语义分割技术。
Method: 1. 从Wikimedia Commons和旅游照片中筛选建筑场景,使用粗到细策略结合3D重建和语义分割选择高质量的图像;2. 使用LLM引导的文本验证和知识蒸馏生成可靠的问题-答案对;3.通过在Arch-300K上微调ShareGPT4V-7B,得到ArchGPT。
Result: ArchGPT能够提供高质量的建筑视觉问答能力,解决了现有系统扩展性不足的问题。
Insight: 通过多模态技术和规模化数据集,可以显著提升建筑领域的视觉理解能力,为教育、文化遗产保护和设计实践提供新的工具。
Abstract: Architecture embodies aesthetic, cultural, and historical values, standing as a tangible testament to human civilization. Researchers have long leveraged virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable immersive exploration and interpretation of architecture, enhancing accessibility, public understanding, and creative workflows around architecture in education, heritage preservation, and professional design practice. However, existing VR/MR/AR systems are often developed case-by-case, relying on hard-coded annotations and task-specific interactions that do not scale across diverse built environments. In this work, we present ArchGPT, a multimodal architectural visual question answering (VQA) model, together with a scalable data-construction pipeline for curating high-quality, architecture-specific VQA annotations. This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets. Arch-300K is built via a multi-stage process: first, we curate architectural scenes from Wikimedia Commons and filter unconstrained tourist photo collections using a novel coarse-to-fine strategy that integrates 3D reconstruction and semantic segmentation to select occlusion-free, structurally consistent architectural images. To mitigate noise and inconsistency in raw textual metadata, we propose an LLM-guided text verification and knowledge-distillation pipeline to generate reliable, architecture-specific question-answer pairs. Using these curated images and refined metadata, we further synthesize formal analysis annotations-including detailed descriptions and aspect-guided conversations-to provide richer semantic variety while remaining faithful to the data. We perform supervised fine-tuning of an open-source multimodal backbone ,ShareGPT4V-7B, on Arch-300K, yielding ArchGPT.
cs.SI [Back]
[126] Visual Authority and the Rhetoric of Health Misinformation: A Multimodal Analysis of Social Media Videos
Mohammad Reza Zarei,Barbara Stead-Coyle,Michael Christensen,Sarah Everts,Majid Komeili
Main category: cs.SI
TL;DR: 这篇论文通过多模态分析方法,研究了社交媒体视频中健康错误信息的视觉权威和叙事策略,揭示了权威信号、叙事技巧和盈利模式的交织现象。
Details
Motivation: 短视频平台成为健康建议的核心来源,但内容混杂有用、误导和有害信息。论文旨在探讨这些视频中可信度如何通过视觉权威和叙事技巧包装,而非简单区分真假。Contribution: 构建了一个跨平台视频语料库,设计了一个透明的多模态注释流程,揭示了权威信号与叙事策略、盈利模式的关联。
Method: 收集了152个来自TikTok、Instagram和YouTube的视频,采用自动语音识别、多模态模型和人工验证的标注方法,分析了26个视觉权威、叙事技巧和盈利特征。
Result: 研究发现,视频中常见自信的单个主持人在家庭或工作室环境下展示健康建议,临床背景较少。权威标志如证书、幻灯片常与说服性叙事(如术语、引用、恐惧或阴谋论)和盈利手段(如销售链接)结合。
Insight: 科学化的视觉元素常伴随情感化和对立性叙事,而非中立信息。这揭示了健康错误信息如何通过复杂包装显得可信。
Abstract: Short form video platforms are central sites for health advice, where alternative narratives mix useful, misleading, and harmful content. Rather than adjudicating truth, this study examines how credibility is packaged in nutrition and supplement videos by analyzing the intersection of authority signals, narrative techniques, and monetization. We assemble a cross platform corpus of 152 public videos from TikTok, Instagram, and YouTube and annotate each on 26 features spanning visual authority, presenter attributes, narrative strategies, and engagement cues. A transparent annotation pipeline integrates automatic speech recognition, principled frame selection, and a multimodal model, with human verification on a stratified subsample showing strong agreement. Descriptively, a confident single presenter in studio or home settings dominates, and clinical contexts are rare. Analytically, authority cues such as titles, slides and charts, and certificates frequently occur with persuasive elements including jargon, references, fear or urgency, critiques of mainstream medicine, and conspiracies, and with monetization including sales links and calls to subscribe. References and science like visuals often travel with emotive and oppositional narratives rather than signaling restraint.
cs.CR [Back]
[127] Every Character Counts: From Vulnerability to Defense in Phishing Detection
Maria Chiper,Radu Tudor Ionescu
Main category: cs.CR
TL;DR: 论文探讨了字符级深度学习模型在钓鱼邮件检测中的表现,对比了三种模型(CharCNN、CharGRU、CharBiLSTM)的性能,并分析了它们在对抗攻击下的鲁棒性和可解释性。
Details
Motivation: 当前钓鱼攻击日益复杂,但自动检测方法缺乏可解释性和对新攻击的鲁棒性。论文旨在研究字符级模型是否能提供更优的检测性能。Contribution: 1. 对比了三种字符级深度学习模型在钓鱼检测中的表现;2. 分析了对抗攻击下的模型鲁棒性;3. 通过Grad-CAM实现了模型的可解释性。
Method: 在自定义邮件数据集上测试了CharCNN、CharGRU和CharBiLSTM,并通过标准训练、对抗攻击测试和对抗训练三种场景评估性能。
Result: CharGRU在所有场景中表现最佳;对抗训练显著提升了模型鲁棒性;Grad-CAM成功可视化模型决策依据。
Insight: 字符级模型在钓鱼检测中兼具性能和可解释性,对抗训练是提升鲁棒性的有效手段。
Abstract: Phishing attacks targeting both organizations and individuals are becoming an increasingly significant threat as technology advances. Current automatic detection methods often lack explainability and robustness in detecting new phishing attacks. In this work, we investigate the effectiveness of character-level deep learning models for phishing detection, which can provide both robustness and interpretability. We evaluate three neural architectures adapted to operate at the character level, namely CharCNN, CharGRU, and CharBiLSTM, on a custom-built email dataset, which combines data from multiple sources. Their performance is analyzed under three scenarios: (i) standard training and testing, (ii) standard training and testing under adversarial attacks, and (iii) training and testing with adversarial examples. Aiming to develop a tool that operates as a browser extension, we test all models under limited computational resources. In this constrained setup, CharGRU proves to be the best-performing model across all scenarios. All models show vulnerability to adversarial attacks, but adversarial training substantially improves their robustness. In addition, by adapting the Gradient-weighted Class Activation Mapping (Grad-CAM) technique to character-level inputs, we are able to visualize which parts of each email influence the decision of each model. Our open-source code and data is released at https://github.com/chipermaria/every-character-counts.
cs.IT [Back]
[128] On Theoretical Interpretations of Concept-Based In-Context Learning
Huaze Tang,Tianren Peng,Shao-lun Huang
Main category: cs.IT
TL;DR: 本文研究了概念上下文中学习(CB-ICL)的理论解释,分析了其在小样本提示任务中表现良好的原因,并量化了LLM可用的知识,为模型预训练和提示工程提供了指导。
Details
Motivation: 当前上下文中学习(ICL)的理论机制尚不清晰,作者希望通过研究CB-ICL,填补这一理论空白,并指导实际应用。Contribution: 1. 提出了CB-ICL的理论分析框架;2. 量化了LLM可用的知识并提出了相似性度量;3. 探索了提示规模和嵌入维度对ICL的影响。
Method: 通过理论分析CB-ICL在ICL任务中的应用,研究了其在提示任务中的表现机制,并结合实验验证。
Result: 实验验证了CB-ICL及其理论的实用性,表明其在预测查询标签时的有效性。
Insight: 提示演示与查询输入的相似性是影响CB-ICL性能的关键因素,这对模型预训练和提示设计有重要指导意义。
Abstract: In-Context Learning (ICL) has emerged as an important new paradigm in natural language processing and large language model (LLM) applications. However, the theoretical understanding of the ICL mechanism remains limited. This paper aims to investigate this issue by studying a particular ICL approach, called concept-based ICL (CB-ICL). In particular, we propose theoretical analyses on applying CB-ICL to ICL tasks, which explains why and when the CB-ICL performs well for predicting query labels in prompts with only a few demonstrations. In addition, the proposed theory quantifies the knowledge that can be leveraged by the LLMs to the prompt tasks, and leads to a similarity measure between the prompt demonstrations and the query input, which provides important insights and guidance for model pre-training and prompt engineering in ICL. Moreover, the impact of the prompt demonstration size and the dimension of the LLM embeddings in ICL are also explored based on the proposed theory. Finally, several real-data experiments are conducted to validate the practical usefulness of CB-ICL and the corresponding theory.
cs.RO [Back]
[129] Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems
Junfeng Yan,Biao Wu,Meng Fang,Ling Chen
Main category: cs.RO
TL;DR: 该论文提出了Automotive-ENV,首个专为车辆GUI设计的高保真基准测试平台,并开发了地理感知的多模态代理ASURADA,显著提升了任务完成率。
Details
Motivation: 车辆GUI交互的特殊性(如驾驶员注意力有限、严格的安全要求和基于位置的复杂交互模式)尚未被深入研究,缺乏专门的基准测试平台。Contribution: 1. 提出Automotive-ENV,首个专为车辆GUI设计的基准测试平台;2. 开发了ASURADA代理,通过地理感知动态调整行为。
Method: 1. 构建包含185项任务的基准测试平台;2. 提出ASURADA代理,整合GPS信息以实现动态行为调整。
Result: 实验表明,地理感知信息显著提升了安全相关任务的完成率。
Insight: 在车辆GUI交互中,基于位置的上下文信息对任务完成和安全至关重要。
Abstract: Multimodal agents have demonstrated strong performance in general GUI interactions, but their application in automotive systems has been largely unexplored. In-vehicle GUIs present distinct challenges: drivers’ limited attention, strict safety requirements, and complex location-based interaction patterns. To address these challenges, we introduce Automotive-ENV, the first high-fidelity benchmark and interaction environment tailored for vehicle GUIs. This platform defines 185 parameterized tasks spanning explicit control, implicit intent understanding, and safety-aware tasks, and provides structured multimodal observations with precise programmatic checks for reproducible evaluation. Building on this benchmark, we propose ASURADA, a geo-aware multimodal agent that integrates GPS-informed context to dynamically adjust actions based on location, environmental conditions, and regional driving norms. Experiments show that geo-aware information significantly improves success on safety-aware tasks, highlighting the importance of location-based context in automotive environments. We will release Automotive-ENV, complete with all tasks and benchmarking tools, to further the development of safe and adaptive in-vehicle agents.
[130] RAM-NAS: Resource-aware Multiobjective Neural Architecture Search Method for Robot Vision Tasks
Shouren Mao,Minghao Qin,Wei Dong,Huajian Liu,Yongzhuo Gao
Main category: cs.RO
TL;DR: RAM-NAS是一种面向机器人视觉任务的多目标神经架构搜索方法,通过改进超网络预训练和硬件资源感知,结合子网互蒸馏和DKD损失提升性能,利用延迟预测器加速搜索,显著降低硬件推理延迟。
Details
Motivation: 现有NAS方法对机器人硬件资源关注不足,且超网络预训练效果不佳。为解决这一问题,提出RAM-NAS方法,旨在提升资源感知能力和模型效率。Contribution: 1. 提出子网互蒸馏和DKD损失以增强超网络预训练效果;2. 引入延迟预测器实现硬件资源感知搜索;3. 在ImageNet和下游任务中验证了模型的性能和低延迟。
Method: 1. 采用子网互蒸馏和DKD损失优化超网络预训练;2. 利用硬件数据训练延迟预测器加速搜索;3. 多目标进化搜索平衡精度和延迟。
Result: RAM-NAS模型在ImageNet上达到76.7%-81.4%的Top-1准确率,显著降低硬件推理延迟。下游任务中检测和分割时间均优于MobileNetv3。
Insight: 资源感知的NAS方法在机器人边缘设备上有重要意义,子网互蒸馏和硬件延迟预测是关键创新,为轻量化模型设计提供了新思路。
Abstract: Neural architecture search (NAS) has shown great promise in automatically designing lightweight models. However, conventional approaches are insufficient in training the supernet and pay little attention to actual robot hardware resources. To meet such challenges, we propose RAM-NAS, a resource-aware multi-objective NAS method that focuses on improving the supernet pretrain and resource-awareness on robot hardware devices. We introduce the concept of subnets mutual distillation, which refers to mutually distilling all subnets sampled by the sandwich rule. Additionally, we utilize the Decoupled Knowledge Distillation (DKD) loss to enhance logits distillation performance. To expedite the search process with consideration for hardware resources, we used data from three types of robotic edge hardware to train Latency Surrogate predictors. These predictors facilitated the estimation of hardware inference latency during the search phase, enabling a unified multi-objective evolutionary search to balance model accuracy and latency trade-offs. Our discovered model family, RAM-NAS models, can achieve top-1 accuracy ranging from 76.7% to 81.4% on ImageNet. In addition, the resource-aware multi-objective NAS we employ significantly reduces the model’s inference latency on edge hardware for robots. We conducted experiments on downstream tasks to verify the scalability of our methods. The inference time for detection and segmentation is reduced on all three hardware types compared to MobileNetv3-based methods. Our work fills the gap in NAS for robot hardware resource-aware.
[131] Joint Flow Trajectory Optimization For Feasible Robot Motion Generation from Video Demonstrations
Xiaoxiang Dong,Matthew Johnson-Roberson,Weiming Zhi
Main category: cs.RO
TL;DR: 论文提出了一种名为JFTO的框架,通过联合优化抓取姿态和物体轨迹,解决了从视频演示中学习机器人动作的可行性问题。
Details
Motivation: 人类视频演示为机器人学习提供了一种可扩展的替代方案,但由于身体差异和关节可行性约束,直接模仿存在挑战。论文旨在解决这些问题。Contribution: 提出了JFTO框架,将抓取姿态选择和物体轨迹生成统一优化,并实现了碰撞避免,同时扩展了流匹配方法以支持概率建模。
Method: 通过联合优化抓取相似性、轨迹似然和碰撞惩罚,结合SE(3)流匹配对物体轨迹进行概率建模,实现密度感知的模仿。
Result: 在模拟和真实世界的多样化操作任务中验证了方法的有效性。
Insight: 将演示视为以物体为中心的指导,而非直接模仿人类动作,更有效地解决了机器人动作生成的可行性问题。
Abstract: Learning from human video demonstrations offers a scalable alternative to teleoperation or kinesthetic teaching, but poses challenges for robot manipulators due to embodiment differences and joint feasibility constraints. We address this problem by proposing the Joint Flow Trajectory Optimization (JFTO) framework for grasp pose generation and object trajectory imitation under the video-based Learning-from-Demonstration (LfD) paradigm. Rather than directly imitating human hand motions, our method treats demonstrations as object-centric guides, balancing three objectives: (i) selecting a feasible grasp pose, (ii) generating object trajectories consistent with demonstrated motions, and (iii) ensuring collision-free execution within robot kinematics. To capture the multimodal nature of demonstrations, we extend flow matching to $\SE(3)$ for probabilistic modeling of object trajectories, enabling density-aware imitation that avoids mode collapse. The resulting optimization integrates grasp similarity, trajectory likelihood, and collision penalties into a unified differentiable objective. We validate our approach in both simulation and real-world experiments across diverse real-world manipulation tasks.
[132] SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning
Guoyang Zhao,Yudong Li,Weiqing Qi,Kai Zhang,Bonan Liu,Kai Chen,Haoang Li,Jun Ma
Main category: cs.RO
TL;DR: 该论文提出了一种无需SLAM的视觉导航框架,通过结合语义推理和轻量级拓扑表示,实现了任务驱动的探索和导航。
Details
Motivation: 传统SLAM方法在快速运动、校准需求和传感器漂移等问题下表现脆弱,且缺乏语义推理能力。为了解决这些问题,作者提出了一种基于视觉和语言的新型导航框架。Contribution: 1. 提出了一种无需SLAM的视觉导航框架;2. 设计了分层视觉-语言感知模块,结合场景和物体级信息;3. 引入了语义-概率拓扑图,支持粗到细的规划策略。
Method: 1. 分层视觉-语言感知模块融合场景和物体级语义信息;2. 语义-概率拓扑图用于全局推理(基于LLM的子目标选择)和局部规划(基于视觉的障碍物避让);3. 结合强化学习运动控制器,适配多种足式机器人平台。
Result: 仿真和真实环境实验表明,该方法在语义准确性、规划质量和导航成功率上有显著提升。消融实验验证了分层感知和局部规划的必要性。
Insight: 该研究提出了一种新的导航范式,从几何中心映射转向语义驱动决策,为机器人探索提供了新的思路。
Abstract: Conventional SLAM pipelines for legged robot navigation are fragile under rapid motion, calibration demands, and sensor drift, while offering limited semantic reasoning for task-driven exploration. To deal with these issues, we propose a vision-only, SLAM-free navigation framework that replaces dense geometry with semantic reasoning and lightweight topological representations. A hierarchical vision-language perception module fuses scene-level context with object-level cues for robust semantic inference. And a semantic-probabilistic topological map supports coarse-to-fine planning: LLM-based global reasoning for subgoal selection and vision-based local planning for obstacle avoidance. Integrated with reinforcement-learning locomotion controllers, the framework is deployable across diverse legged robot platforms. Experiments in simulation and real-world settings demonstrate consistent improvements in semantic accuracy, planning quality, and navigation success, while ablation studies further showcase the necessity of both hierarchical perception and fine local planning. This work introduces a new paradigm for SLAM-free, vision-language-driven navigation, shifting robotic exploration from geometry-centric mapping to semantics-driven decision making.
[133] Autoregressive End-to-End Planning with Time-Invariant Spatial Alignment and Multi-Objective Policy Refinement
Jianbo Zhao,Taiyu Ban,Xiangjie Li,Xingtai Gui,Hangning Zhou,Lei Liu,Hongwei Zhao,Bin Li
Main category: cs.RO
TL;DR: 该论文提出了一种自回归模型的端到端规划方法,通过时间不变空间对齐(TISA)模块和多目标策略优化(DPO),解决了时空对齐问题和行为多目标优化。
Details
Motivation: 自回归模型在自动驾驶端到端规划中表现优异,但其性能受限于时空错位问题,导致未来动作需要依赖过去感知数据。这限制了模型的性能上限。Contribution: 1. 提出TISA模块,将环境特征投影到一致的自我中心框架中,解决了时空错位问题;2. 引入运动学动作预测头,确保轨迹物理可行性;3. 采用多目标DPO优化,提供更细粒度的学习信号。
Method: 1. TISA模块学习将初始环境特征投影到未来时间步的自我中心框架;2. 使用运动学动作预测头生成物理可行的轨迹;3. 通过多目标DPO优化,针对特定驾驶行为提供反馈。
Result: 在NAVSIM数据集上实现了89.8 PDMS,达到自回归模型的SOTA性能。
Insight: 1. TISA模块无需显式预测未来场景,直接修正代理的世界观;2. 多目标DPO优化比单一目标更能针对特定行为提供反馈,提升了规划性能。
Abstract: The inherent sequential modeling capabilities of autoregressive models make them a formidable baseline for end-to-end planning in autonomous driving. Nevertheless, their performance is constrained by a spatio-temporal misalignment, as the planner must condition future actions on past sensory data. This creates an inconsistent worldview, limiting the upper bound of performance for an otherwise powerful approach. To address this, we propose a Time-Invariant Spatial Alignment (TISA) module that learns to project initial environmental features into a consistent ego-centric frame for each future time step, effectively correcting the agent’s worldview without explicit future scene prediction. In addition, we employ a kinematic action prediction head (i.e., acceleration and yaw rate) to ensure physically feasible trajectories. Finally, we introduce a multi-objective post-training stage using Direct Preference Optimization (DPO) to move beyond pure imitation. Our approach provides targeted feedback on specific driving behaviors, offering a more fine-grained learning signal than the single, overall objective used in standard DPO. Our model achieves a state-of-the-art 89.8 PDMS on the NAVSIM dataset among autoregressive models. The video document is available at https://tisa-dpo-e2e.github.io/.
[134] KeyWorld: Key Frame Reasoning Enables Effective and Efficient World Models
Sibo Li,Qianyue Hao,Yu Shang,Yong Li
Main category: cs.RO
TL;DR: 论文KeyWorld提出了一种新的框架,通过利用语义关键帧推理提升机器人世界模型的效率和有效性,显著加速推理速度并增强生成轨迹的物理合理性。
Details
Motivation: 当前基于帧到帧生成的世界模型存在计算冗余和生成轨迹物理合理性不足的问题,限制了其在实时机器人控制等场景中的应用。Contribution: 提出了KeyWorld框架,通过识别语义关键帧并集中计算资源生成这些关键帧,轻量级插值模型填补中间帧,显著提升了模型的效率和生成效果。
Method: 1. 通过简化运动轨迹识别关键帧;2. 训练DiT模型从文本任务描述生成关键帧;3. 使用轻量级插值模型填补中间帧。
Result: 在LIBERO基准测试中,KeyWorld实现了5.68倍的速度提升,并在复杂任务中增强了视频的物理合理性。
Insight: 关键帧推理提供了一种高效且物理合理的方法,可能为实时机器人控制和类似领域的世界模型部署开辟新路径。
Abstract: Robotic world models are a promising paradigm for forecasting future environment states, yet their inference speed and the physical plausibility of generated trajectories remain critical bottlenecks, limiting their real-world applications. This stems from the redundancy of the prevailing frame-to-frame generation approach, where the model conducts costly computation on similar frames, as well as neglecting the semantic importance of key transitions. To address this inefficiency, we propose KeyWorld, a framework that improves text-conditioned robotic world models by concentrating transformers computation on a few semantic key frames while employing a lightweight convolutional model to fill the intermediate frames. Specifically, KeyWorld first identifies significant transitions by iteratively simplifying the robot’s motion trajectories, obtaining the ground truth key frames. Then, a DiT model is trained to reason and generate these physically meaningful key frames from textual task descriptions. Finally, a lightweight interpolator efficiently reconstructs the full video by inpainting all intermediate frames. Evaluations on the LIBERO benchmark demonstrate that KeyWorld achieves a 5.68$\times$ acceleration compared to the frame-to-frame generation baseline, and focusing on the motion-aware key frames further contributes to the physical validity of the generated videos, especially on complex tasks. Our approach highlights a practical path toward deploying world models in real-time robotic control and other domains requiring both efficient and effective world models. Code is released at https://anonymous.4open.science/r/Keyworld-E43D.
[135] Cross-Modal Instructions for Robot Motion Generation
William Barron,Xiaoxiang Dong,Matthew Johnson-Roberson,Weiming Zhi
Main category: cs.RO
TL;DR: 该论文提出了一种通过跨模态指令(如文本标签)来指导机器人运动生成的新范式,避免了传统物理演示的繁琐性。通过结合大型视觉语言模型(VLM)与小规模精细指向模型,实现了高效的机器人行为生成,并利用强化学习进行进一步优化。
Details
Motivation: 传统机器人行为学习依赖于物理演示(如遥操作或直接指导),数据收集困难且难以扩展。论文提出利用跨模态指令(如文本标签)替代物理演示,以提升数据效率和可扩展性。Contribution: 1. 提出CrossInstruct框架,利用跨模态指令(文本标签)驱动机器人运动生成。2. 结合大型VLM与小规模精细指向模型,生成可执行的3D运动轨迹。3. 展示了框架在仿真和真实硬件上的有效性,并为强化学习提供了强初始化。
Method: 1. 跨模态指令作为输入注入大型VLM的上下文。2. VLM迭代查询小型精细指向模型,合成多视角2D运动。3. 融合多视角信息生成3D运动轨迹。4. 通过强化学习进一步优化生成的行为。
Result: CrossInstruct在仿真和真实硬件上均表现良好,无需额外微调即可生成可执行行为,并为后续强化学习提供了高效初始化。
Insight: 跨模态指令为机器人行为学习提供了高效且可扩展的替代方案,结合大型模型与小规模模型的优势,可实现复杂任务的泛化能力。
Abstract: Teaching robots novel behaviors typically requires motion demonstrations via teleoperation or kinaesthetic teaching, that is, physically guiding the robot. While recent work has explored using human sketches to specify desired behaviors, data collection remains cumbersome, and demonstration datasets are difficult to scale. In this paper, we introduce an alternative paradigm, Learning from Cross-Modal Instructions, where robots are shaped by demonstrations in the form of rough annotations, which can contain free-form text labels, and are used in lieu of physical motion. We introduce the CrossInstruct framework, which integrates cross-modal instructions as examples into the context input to a foundational vision-language model (VLM). The VLM then iteratively queries a smaller, fine-tuned model, and synthesizes the desired motion over multiple 2D views. These are then subsequently fused into a coherent distribution over 3D motion trajectories in the robot’s workspace. By incorporating the reasoning of the large VLM with a fine-grained pointing model, CrossInstruct produces executable robot behaviors that generalize beyond the environment of in the limited set of instruction examples. We then introduce a downstream reinforcement learning pipeline that leverages CrossInstruct outputs to efficiently learn policies to complete fine-grained tasks. We rigorously evaluate CrossInstruct on benchmark simulation tasks and real hardware, demonstrating effectiveness without additional fine-tuning and providing a strong initialization for policies subsequently refined via reinforcement learning.
[136] Human-like Navigation in a World Built for Humans
Bhargav Chandaka,Gloria X. Wang,Haozhe Chen,Henry Che,Albert J. Zhai,Shenlong Wang
Main category: cs.RO
TL;DR: ReasonNav 是一个模块化导航系统,通过利用视觉语言模型(VLM)的推理能力,实现了类似人类的导航行为,如阅读标志和询问方向。
Details
Motivation: 现有机器人导航系统缺乏类似人类的导航行为,导致在大规模环境中效率低下。本文旨在通过模仿人类的导航技能(如阅读标志和询问方向)来提高导航效率。Contribution: 提出了 ReasonNav 系统,结合视觉语言模型的推理能力,实现了人类化的导航行为,显著提高了机器人在复杂建筑中的导航效率。
Method: 设计了基于导航地标的紧凑输入输出抽象,使 VLM 能够专注于语言理解和推理。系统模块化,集成了人类导航技能。
Result: 在真实和模拟导航任务中评估表明,ReasonNav 能够通过高级推理高效导航大型复杂建筑。
Insight: 通过与环境的自然交互(如阅读标志和询问方向),机器人可以更高效地完成任务,显示了视觉语言模型在导航任务中的潜力。
Abstract: When navigating in a man-made environment they haven’t visited before–like an office building–humans employ behaviors such as reading signs and asking others for directions. These behaviors help humans reach their destinations efficiently by reducing the need to search through large areas. Existing robot navigation systems lack the ability to execute such behaviors and are thus highly inefficient at navigating within large environments. We present ReasonNav, a modular navigation system which integrates these human-like navigation skills by leveraging the reasoning capabilities of a vision-language model (VLM). We design compact input and output abstractions based on navigation landmarks, allowing the VLM to focus on language understanding and reasoning. We evaluate ReasonNav on real and simulated navigation tasks and show that the agent successfully employs higher-order reasoning to navigate efficiently in large, complex buildings.