Table of Contents
- cs.CL [Total: 27]
- cs.CV [Total: 53]
- cs.RO [Total: 3]
- q-bio.QM [Total: 1]
- eess.IV [Total: 8]
- cs.CR [Total: 1]
- cs.DB [Total: 1]
- cs.MM [Total: 2]
- cs.AI [Total: 4]
- eess.AS [Total: 1]
- cs.HC [Total: 1]
- cs.AR [Total: 1]
- cs.LG [Total: 3]
cs.CL [Back]
[1] Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training
Jianfeng Si,Lin Sun,Zhewen Tan,Xiangzheng Zhang
Main category: cs.CL
TL;DR: 该论文提出了一种通过魔法令牌引导的联合训练框架,统一整合了多种安全行为(积极、消极和拒绝),实现了高效、细粒度且可动态切换的LLM安全控制,显著简化了训练流程并降低了部署成本。
Details
Motivation: 当前LLM的内容安全方法(如SFT和RLHF)依赖多阶段训练流程,且缺乏部署后的细粒度可控性。论文旨在解决这些限制,提供一种更灵活和高效的解决方案。Contribution: 1. 提出了一种联合训练框架,将多种安全行为统一整合在单个SFT阶段。2. 通过系统级指令(魔法令牌)实现动态行为切换,支持多样化的部署场景。3. 实验表明该方法在安全性能上优于传统方法(如SFT+DPO),同时显著降低了复杂性。
Method: 采用魔法令牌引导的联合训练框架,统一训练积极(lawful/prosocial)、消极(unfiltered/risk-prone)和拒绝(refusal-oriented/conservative)行为,并通过简单的指令动态激活。
Result: 8B模型在安全性能上超越DeepSeek-R1(671B),同时显著简化了训练和部署成本。
Insight: 论文证明了输出空间中的安全对齐边界(Safety Alignment Margin)的存在,为模型的稳健性和可控性提供了实证支持。
Abstract: Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream moderation signals. This co-training strategy induces a distinct Safety Alignment Margin in the output space, characterized by well-separated response distributions corresponding to each safety mode. The existence of this margin provides empirical evidence for the model’s safety robustness and enables unprecedented fine-grained control. Experiments show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance, while significantly reducing both training complexity and deployment costs. This work presents a scalable, efficient, and highly controllable solution for LLM content safety.
[2] Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages
Israel Abebe Azime,Tadesse Destaw Belay,Dietrich Klakow,Philipp Slusallek,Anshuman Chhabra
Main category: cs.CL
TL;DR: 论文提出了一种基于大语言模型(LLM)的框架,用于自动生成低资源语言中数学应用题的社会文化本地化数据集,解决现有翻译数据集中英语中心实体带来的偏差问题。
Details
Motivation: 当前多语言数学应用题数据集多为英语翻译而来,保留了英语中心实体(如人名、组织名、货币),难以反映目标语言的真实社会文化背景。Contribution: 提出了一个LLM驱动的框架,自动生成包含本地化实体(如本地人名、货币)的数学应用题数据集,减少英语中心偏差。
Method: 利用LLM从现有数据源自动构建本地化数据集,替换实体为本地文化相关的内容。
Result: 实验表明,该框架能显著减少英语中心偏差,并在引入本地实体后提升模型的鲁棒性。
Insight: 翻译的数据集可能掩盖模型在本地化场景下的真实多语言数学能力,本地化对模型性能至关重要。
Abstract: Large language models (LLMs) have demonstrated significant capabilities in solving mathematical problems expressed in natural language. However, multilingual and culturally-grounded mathematical reasoning in low-resource languages lags behind English due to the scarcity of socio-cultural task datasets that reflect accurate native entities such as person names, organization names, and currencies. Existing multilingual benchmarks are predominantly produced via translation and typically retain English-centric entities, owing to the high cost associated with human annotater-based localization. Moreover, automated localization tools are limited, and hence, truly localized datasets remain scarce. To bridge this gap, we introduce a framework for LLM-driven cultural localization of math word problems that automatically constructs datasets with native names, organizations, and currencies from existing sources. We find that translated benchmarks can obscure true multilingual math ability under appropriate socio-cultural contexts. Through extensive experiments, we also show that our framework can help mitigate English-centric entity bias and improves robustness when native entities are introduced across various languages.
[3] Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner
Bolian Li,Yanran Wu,Xinyu Luo,Ruqi Zhang
Main category: cs.CL
TL;DR: 论文提出了一种名为奖励偏移推测采样(SSS)的高效测试时弱对齐算法,通过利用对齐的小型草稿模型与未对齐目标模型之间的分布偏移,显著降低了推理成本。
Details
Motivation: 测试时对齐技术通常因高昂的推理成本而难以广泛应用,论文希望解决这一效率瓶颈问题。Contribution: 提出SSS算法,利用对齐草稿模型与未对齐目标模型的分布偏移,在不改变目标模型的情况下恢复RLHF最优解。
Method: 通过修改接收标准和奖励令牌分布,利用草稿模型预测未来令牌,从而高效实现对齐目标。
Result: 在测试时弱对齐实验中,SSS算法以显著降低的推理成本实现了更高的黄金奖励分数。
Insight: 通过分布偏移和推测采样技术,可以有效提升对齐效率,同时保持模型性能。
Abstract: Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-Shifted Speculative Sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.
[4] LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text
MohamamdJavad Ardestani,Ehsan Kamalloo,Davood Rafiei
Main category: cs.CL
TL;DR: LongRecall提出了一种三阶段的召回评估框架,通过分解答案、过滤候选匹配和结构化对齐验证,显著提升了长文本QA任务中的召回准确性。
Details
Motivation: 在医学、法律等领域,机器生成文本的完整性至关重要,但现有召回指标依赖词汇重叠或LLM全盘评判,容易产生误判或幻觉。Contribution: 提出了LongRecall框架,通过结构化分解和验证,减少了假阳性和假阴性,适应多样的表达和上下文变化。
Method: 三个阶段:1)分解答案为独立事实;2)通过词汇和语义过滤候选;3)结构化对齐验证。
Result: 在三个长文本QA基准测试中,LongRecall在召回准确性上显著优于词汇重叠和LLM评判基线。
Insight: 结构化分解和验证是提升召回评估鲁棒性的关键,尤其是在复杂上下文和多变表达的场景中。
Abstract: LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.
[5] Mapping the Course for Prompt-based Structured Prediction
Matt Pauk,Maria Leonor Pacheco
Main category: cs.CL
TL;DR: 该论文探讨了如何通过结合大语言模型(LLMs)和组合推断方法,提升结构化预测任务中的一致性和准确性,同时研究了有效的提示策略和结构化学习在LLM时代的价值。
Details
Motivation: 尽管LLMs在许多语言任务中表现优异,但因其自回归特性,仍存在幻觉和复杂推理问题。论文旨在通过结合LLMs与组合推断,解决结构化预测中的这些问题。Contribution: 1. 提出将LLMs与符号推断结合的方法,提升结构化预测的一致性和准确性;2. 研究了提示策略对LLM置信度估计的影响;3. 展示了结构化学习和微调在LLM时代的价值。
Method: 通过实验评估不同提示策略对LLM置信度估计的效用,并引入符号推断方法,结合结构化学习和微调,提升模型性能。
Result: 无论提示策略如何,符号推断的加入均能显著提升预测的一致性和准确性;结构化学习和微调在挑战性任务中进一步提高了性能。
Insight: 结构化学习和符号推断仍然是提升LLMs在复杂任务中表现的有效手段,提示策略的选择和优化也需要特别关注。
Abstract: LLMs have been shown to be useful for a variety of language tasks, without requiring task-specific fine-tuning. However, these models often struggle with hallucinations and complex reasoning problems due to their autoregressive nature. We propose to address some of these issues, specifically in the area of structured prediction, by combining LLMs with combinatorial inference in an attempt to marry the predictive power of LLMs with the structural consistency provided by inference methods. We perform exhaustive experiments in an effort to understand which prompting strategies can effectively estimate LLM confidence values for use with symbolic inference, and show that, regardless of the prompting strategy, the addition of symbolic inference on top of prompting alone leads to more consistent and accurate predictions. Additionally, we show that calibration and fine-tuning using structured prediction objectives leads to increased performance for challenging tasks, showing that structured learning is still valuable in the era of LLMs.
[6] Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset
Rabeeh Karimi Mahabadi,Sanjeev Satheesh,Shrimai Prabhumoye,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro
Main category: cs.CL
TL;DR: Nemotron-CC-Math是一个高质量的数学预训练数据集,通过一种新颖的领域无关流水线从Common Crawl中提取科学文本,特别是数学内容,并标准化为LaTeX表示。
Details
Motivation: 现有的数学数据集因提取方法脆弱和转换过程中的信息丢失导致质量下降,亟需一种更鲁棒的提取和标准化方法。Contribution: 提出了一个新颖的流水线,能够从Common Crawl中高效提取科学文本(尤其是数学内容),并创造了目前最大且最高质量的数学预训练数据集。
Method: 结合布局感知渲染(lynx)和基于LLM的清理步骤,提取多种格式的数学内容,标准化为LaTeX表示。
Result: 数据集在预训练Nemotron-T 8B模型时,在数学(MATH)和代码(MBPP+)任务上表现显著提升,且在通用任务(MMLU)上也有改进。
Insight: 科学文本(特别是数学内容)的鲁棒提取和标准化是提升LLM数学和代码推理能力的关键。
Abstract: Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content–including math–from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.
[7] Identifying and Answering Questions with False Assumptions: An Interpretable Approach
Zijie Wang,Eduardo Blanco
Main category: cs.CL
TL;DR: 本文提出了一种可解释的方法,用于识别和回答基于错误假设的问题,通过结合外部证据验证假设,以减少大语言模型的幻觉问题。
Details
Motivation: 研究发现,大语言模型(LLMs)在面对基于错误假设的问题时,容易因幻觉生成误导性答案,这激发了研究者提出一种更可靠和可解释的解决方案。Contribution: 主要贡献包括:将问题简化为事实验证任务,提出了一种利用外部证据的方法以减少幻觉,并通过实验验证了生成和验证原子假设的有效性。
Method: 方法分为两步:首先通过外部证据验证问题的假设(事实验证),然后生成并验证原子假设以提高回答的准确性和可解释性。
Result: 实验表明,结合外部证据能显著提升性能,且生成和验证原子假设进一步改善了结果,同时提供了更清晰的解释。
Insight: 研究揭示了利用外部证据和原子假设验证的重要性,为LLMs处理复杂问题(如错误假设问题)提供了新的思路。
Abstract: People often ask questions with false assumptions, a type of question that does not have regular answers. Answering such questions require first identifying the false assumptions. Large Language Models (LLMs) often generate misleading answers because of hallucinations. In this paper, we focus on identifying and answering questions with false assumptions in several domains. We first investigate to reduce the problem to fact verification. Then, we present an approach leveraging external evidence to mitigate hallucinations. Experiments with five LLMs demonstrate that (1) incorporating retrieved evidence is beneficial and (2) generating and validating atomic assumptions yields more improvements and provides an interpretable answer by specifying the false assumptions.
[8] ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following
Seungmin Han,Haeun Kwon,Ji-jun Park,Taeyang Yoon
Main category: cs.CL
TL;DR: 本文提出了一种名为ContextualLVLM-Agent的框架,旨在解决多轮视觉对话和复杂指令跟随任务中的挑战。通过引入MMDR-Bench数据集和一个无需重训练底层模型的记忆-感知-规划-执行循环机制,该框架在多模态交互中表现出色。
Details
Motivation: 现有的大语言模型和视觉语言模型在多轮、复杂且视觉相关的任务中表现不佳,尤其是上下文丢失和视觉幻觉问题突出。为此,需要一种更全面的框架来解决这些局限。Contribution: 提出了MMDR-Bench数据集和ContextualLVLM-Agent框架,显著提升了多模态交互中的推理深度、指令跟随和错误抑制能力。
Method: 采用记忆-感知-规划-执行的迭代循环机制,无需对底层模型进行大量重训练,即可增强模型的推理和指令跟随能力。
Result: 在MMDR-Bench上,ContextualLVLM-Agent的平均评分为4.03,优于GPT-4o(3.92)和Gemini 1.5 Pro(3.85),展现了其在复杂任务中的优越性。
Insight: 模块化设计和迭代方法是提升多模态交互性能的关键,尤其在长期对话和复杂指令任务中表现稳健。
Abstract: Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a holistic framework that enhances existing LVLMs with advanced reasoning and instruction following capabilities through an iterative “memory-perception-planning-execution” cycle, requiring no extensive re-training of the underlying models. Our extensive experiments on MMDR-Bench demonstrate that CoLVLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03, notably surpassing state-of-the-art commercial models like GPT-4o (3.92) and Gemini 1.5 Pro (3.85). The framework exhibits significant advantages in reasoning depth, instruction adherence, and error suppression, and maintains robust performance over extended dialogue turns, validating the effectiveness of its modular design and iterative approach for complex multi-modal interactions.
[9] SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling
Dong Liu,Yanxuan Yu
Main category: cs.CL
TL;DR: SemToken是一种语义感知的分词框架,通过减少冗余分句和动态调整分词粒度,显著提升了长上下文语言模型的效率和性能。
Details
Motivation: 当前分词方法(如BPE或WordPiece)仅基于频率统计,忽略了文本的语义结构,导致长上下文中语义冗余和计算效率低的问题。SemToken旨在解决这一问题。Contribution: 提出了SemToken框架,首次将语义结构引入分词过程,实现了动态语义聚类和异构分词粒度分配,显著减少了分词数量并提高了计算效率。
Method: SemToken使用轻量级编码器提取上下文语义嵌入,通过局部语义聚类合并语义相似的token,并根据语义密度动态调整分词粒度。
Result: 在WikiText-103和LongBench等长上下文基准测试中,SemToken实现了2.4倍的分词数量减少和1.9倍的加速,且对困惑度和下游任务精度影响极小。
Insight: 语义结构为优化大型语言模型的分词和计算提供了新的方向,通过动态调整分词粒度可以有效平衡效率和性能。
Abstract: Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbf{SemToken}, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can be seamlessly integrated with modern language models and attention acceleration methods. Experiments on long-context language modeling benchmarks such as WikiText-103 and LongBench show that SemToken achieves up to $2.4\times$ reduction in token count and $1.9\times$ speedup, with negligible or no degradation in perplexity and downstream accuracy. Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.
[10] Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
Yuanchen Zhou,Shuo Jiang,Jie Zhu,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang
Main category: cs.CL
TL;DR: Fin-PRM是一种专用于金融领域的奖励模型,通过监督大语言模型的中间推理过程,提升金融任务中的逻辑和事实准确性。
Details
Motivation: 现有PRM主要针对通用或STEM领域,难以满足金融领域对结构化和符号化推理的需求,尤其是在事实和监管正确性方面的敏感性。Contribution: 提出了Fin-PRM,一种专门针对金融任务的奖励模型,支持离线与在线奖励学习,显著提升了推理轨迹的选择效果。
Method: 整合了步骤级和轨迹级的奖励监督,分别应用于监督学习、强化学习和推理阶段,支持三种关键应用。
Result: 在CFLUE和FinQA基准测试中表现优于通用PRM和基线模型,各项任务性能提升显著。
Insight: 领域专用奖励模型对于提升大语言模型在专业领域的推理能力具有重要价值。
Abstract: Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce \textbf{Fin-PRM}, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9% in supervised learning, 5.2% in reinforcement learning, and 5.1% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.
[11] SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
Huanxuan Liao,Yixing Xu,Shizhu He,Guanchen Li,Xuanwu Yin,Dong Li,Emad Barsoum,Jun Zhao,Kang Liu
Main category: cs.CL
TL;DR: SparK提出了一种训练自由的插拔式方法,通过通道级KV剪枝动态恢复被剪枝的条目,解决了LLM中长上下文推理的KV缓存瓶颈问题,显著提高了内存和计算效率。
Details
Motivation: 大型语言模型(LLM)的长上下文推理受限于KV缓存的内存和计算瓶颈。现有方法多在时间轴上压缩KV缓存,忽略了特征维度(通道轴)上的细粒度重要性变化,导致效率和模型精度难以平衡。Contribution: 提出了一种训练自由的通道级KV剪枝方法SparK,动态恢复被剪枝的条目,显著减少了KV缓存存储并维持模型精度,且与现有KV压缩和量化技术兼容。
Method: 通过观察查询和位置间通道显著性的剧烈变化,SparK选择性地剪枝KV缓存的通道,并在注意力分数计算时动态恢复这些条目,从而减少冗余。
Result: 在相同内存预算下支持更长序列处理,KV缓存存储减少30%以上;80%的剪枝比例下性能退化小于5%,优于基线方法。
Insight: 通道级别的细粒度剪枝可有效平衡内存效率和模型性能,动态恢复机制确保了剪枝后的信息不丢失。
Abstract: Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.
[12] Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering
Bolei He,Xinran He,Run Shao,Shanfu Shu,Xianwei Xue,Mingquan Cheng,Haifeng Li,Zhenhua Ling
Main category: cs.CL
TL;DR: 论文提出了一种名为Select2Know(S2K)的框架,通过内部-外部知识自选择策略,低成本地在领域特定问答中提升LLM性能,实验结果优于现有方法。
Details
Motivation: 大型语言模型(LLM)在通用QA中表现良好,但在领域特定场景中表现不佳。检索增强生成(RAG)带来外部知识但存在幻觉和延迟问题,而持续预训练则成本高且缺乏跨领域灵活性。Contribution: 1. 提出S2K框架,结合内部-外部的知识自选择策略,低成本地优化领域QA性能;2. 引入结构化推理数据生成流程;3. 整合GRPO提升推理能力。
Method: S2K通过选择性监督微调和知识自选择策略渐进式学习领域知识,同时利用GRPO增强推理能力。
Result: 在医学、法律和金融QA基准测试中,S2K表现优于现有方法,且与领域预训练的LLM性能相当,但成本显著降低。
Insight: 知识获取应渐进式进行(从概念理解到复杂推理),内部知识在长尾分布下可能被低估,但可通过自选择策略有效利用。
Abstract: Large Language Models (LLMs) perform well in general QA but often struggle in domain-specific scenarios. Retrieval-Augmented Generation (RAG) introduces external knowledge but suffers from hallucinations and latency due to noisy retrievals. Continued pretraining internalizes domain knowledge but is costly and lacks cross-domain flexibility. We attribute this challenge to the long-tail distribution of domain knowledge, which leaves partial yet useful internal knowledge underutilized. We further argue that knowledge acquisition should be progressive, mirroring human learning: first understanding concepts, then applying them to complex reasoning. To address this, we propose Selct2Know (S2K), a cost-effective framework that internalizes domain knowledge through an internal-external knowledge self-selection strategy and selective supervised fine-tuning. We also introduce a structured reasoning data generation pipeline and integrate GRPO to enhance reasoning ability. Experiments on medical, legal, and financial QA benchmarks show that S2K consistently outperforms existing methods and matches domain-pretrained LLMs with significantly lower cost.
[13] WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai
Peerat Limkonchotiwat,Pume Tuchinda,Lalita Lowphansirikul,Surapon Nonesung,Panuthep Tasawong,Alham Fikri Aji,Can Udomcharoenchaikit,Sarana Nutanong
Main category: cs.CL
TL;DR: 论文提出WangchanThaiInstruct数据集,用于评估和指令调优,强调泰语语言和文化特定任务的性能。
Details
Motivation: 英语大语言模型表现优异,但低资源语言(如泰语)的指令跟随能力未被充分探索,现存基准通常依赖翻译,缺乏文化和领域特定细节。Contribution: 提出首个泰语多任务、多领域数据集,通过人工标注和专家审核,支持文化感知和专业任务评估,提升LLM在低资源语言中的对齐能力。
Method: 采用多阶段质量控制流程,结合标注员、领域专家和AI研究人员,用于零样本评估和指令调优研究,对比翻译数据的效果。
Result: 实验显示,基于原生数据微调的模型在领域内外均优于翻译数据微调的模型。
Insight: 文化和专业领域特定的指令数据对提升LLM在低资源语言中的性能至关重要,翻译数据难以替代原生数据的质量。
Abstract: Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.
[14] EMNLP: Educator-role Moral and Normative Large Language Models Profiling
Yilin Jiang,Mingzi Zhang,Sheng Jin,Zengyi Yu,Xiangjie Kong,Binghao Tu
Main category: cs.CL
TL;DR: 论文提出了EMNLP框架,用于评估教师角色大语言模型(LLMs)的人格特征、道德发展阶段和伦理风险。研究发现教师角色LLMs在道德推理上表现优异,但易受有害提示影响。
Details
Motivation: 当前模拟职业的大语言模型(如教师角色)缺乏心理和伦理评估,特别是在教育领域。EMNLP旨在填补这一空白,提供全面的评测框架。Contribution: 1. 提出了EMNLP框架,扩展了现有量表并构建了88个教师专属道德困境用于测评。2. 首次对教师角色LLMs的心理和伦理对齐性进行系统性评测。3. 发现LLMs的道德推理能力与安全性的矛盾。
Method: 1. 使用教师专属道德困境评测模型。2. 设计了针对性的软提示注入集以评估模型依从性和脆弱性。3. 在12个LLMs上进行了实验。
Result: 教师角色LLMs的人格更理想化和极端化,抽象道德推理能力强,但情绪复杂情境表现不佳。推理能力强的模型更易受有害提示影响,模型温度等超参数影响有限。
Insight: LLMs的道德推理能力与安全性之间存在矛盾,提示教育AI领域需平衡能力与伦理风险。
Abstract: Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 12 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at https://e-m-n-l-p.github.io/.
[15] Conflict-Aware Soft Prompting for Retrieval-Augmented Generation
Eunseong Choi,June Park,Hyeri Lee,Jongwuk Lee
Main category: cs.CL
TL;DR: 该论文提出了Conflict-Aware REtrieval-Augmented Generation (CARE)方法,通过上下文评估器和基础LLM解决RAG中的上下文记忆冲突问题。
Details
Motivation: 在检索增强生成(RAG)中,外部检索的上下文可能与LLM的参数知识冲突,导致错误结果。论文旨在解决这种冲突问题。Contribution: 提出CARE方法,结合上下文评估器和软提示技术,有效区分并处理不可靠的上下文,提升RAG系统的可靠性。
Method: CARE通过上下文评估器编码上下文标记,利用软提示技术训练模型区分冲突,并引导推理选择可靠知识源。
Result: 实验显示CARE在QA和事实核查任务上平均提升5.0%性能,显著缓解上下文记忆冲突。
Insight: 通过软提示技术动态调整上下文权重是解决RAG中知识冲突的有效方向。
Abstract: Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM’s parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.
[16] TComQA: Extracting Temporal Commonsense from Text
Lekshmi R Nair,Arun Sankar,Koninika Pal
Main category: cs.CL
TL;DR: 该论文提出了一个从文本中提取时间常识(temporal commonsense)的流程,并构建了一个名为TComQA的数据集,用于训练语言模型在时间推理任务上的表现。
Details
Motivation: 事件的理解需要时间上下文,但自然语言中常常不会明确提及时间常识,导致即使是大型语言模型(LLM)也难以生成需要时间推理的文本。自动挖掘时间常识可以提升语言模型的鲁棒性。Contribution: 1)提出了一个利用LLM从文本中自动提取时间常识的流程;2)构建了TComQA数据集;3)验证了该数据集在时间问答任务上的有效性。
Method: 通过LLM从SAMSum和RealNews语料库中提取时间常识,并构建TComQA数据集。使用众包验证数据质量,并训练模型评估其性能。
Result: TComQA数据集的提取精度超过80ine-tuned模型在时间问答任务上表现优于现有数据集的微调LLM。
Insight: 时间常识的显式提取有助于增强语言模型在时间推理任务上的表现,同时也揭示了LLM在这一任务上的局限性。
Abstract: Understanding events necessitates grasping their temporal context, which is often not explicitly stated in natural language. For example, it is not a trivial task for a machine to infer that a museum tour may last for a few hours, but can not take months. Recent studies indicate that even advanced large language models (LLMs) struggle in generating text that require reasoning with temporal commonsense due to its infrequent explicit mention in text. Therefore, automatically mining temporal commonsense for events enables the creation of robust language models. In this work, we investigate the capacity of LLMs to extract temporal commonsense from text and evaluate multiple experimental setups to assess their effectiveness. Here, we propose a temporal commonsense extraction pipeline that leverages LLMs to automatically mine temporal commonsense and use it to construct TComQA, a dataset derived from SAMSum and RealNews corpora. TComQA has been validated through crowdsourcing and achieves over 80% precision in extracting temporal commonsense. The model trained with TComQA also outperforms an LLM fine-tuned on existing dataset of temporal question answering task.
[17] CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing
Abdul Rehman,Jian-Jun Zhang,Xiaosong Yang
Main category: cs.CL
TL;DR: CUPE是一种轻量级的通用音素编码器,能够在120毫秒内捕获关键音素特征,独立处理固定长度窗口,实现跨语言的竞争性能。
Details
Motivation: 当前的通用音素识别通常需要分析长语音段和语言特定模式,而许多语音处理任务需要不受上下文影响的纯音素表示,因此开发了CUPE。Contribution: 提出了CUPE,一种轻量级模型,通过独立处理短窗口和较少参数,学习所有语言共有的基本声学模式,实现跨语言竞争性能。
Method: CUPE采用固定宽度短窗口独立处理,通过监督和自监督训练,在包括UCLA语音语料库的多种语言上评估。
Result: 在跨语言泛化测试中表现优异,证明通过建模音素长度窗口内的基本声学模式可实现有效的通用语音处理。
Insight: 研究表明,通过专注于基本声学模式而非复杂上下文,可以设计出轻量且高效的跨语言音素编码器。
Abstract: Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our development of CUPE - a lightweight model that captures key phoneme features in just 120 milliseconds, about one phoneme’s length. CUPE processes short, fixed-width windows independently and, despite fewer parameters than current approaches, achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages. Our extensive evaluation through supervised and self-supervised training on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus, demonstrates strong cross-lingual generalization and reveals that effective universal speech processing is possible through modeling basic acoustic patterns within phoneme-length windows.
[18] A Survey on Large Language Model Benchmarks
Shiwen Ni,Guhong Chen,Shuaimin Li,Xuanang Chen,Siyi Li,Bingli Wang,Qiyao Wang,Xingjian Wang,Yifan Zhang,Liyang Fan,Chengming Li,Ruifeng Xu,Le Sun,Min Yang
Main category: cs.CL
TL;DR: 该论文系统综述了大语言模型评估基准的现状与发展,分类了283个代表性基准,并指出了当前基准存在的关键问题与未来创新设计范式。
Details
Motivation: 随着大语言模型能力的快速扩展,评估基准作为量化工具在衡量模型性能和指导技术发展中扮演核心角色,但当前基准存在数据污染、文化偏见等问题,亟需系统性梳理与改进。Contribution: 首次系统综述了大语言模型基准,将其分为通用能力、领域特定和目标特定三类,提出了当前基准的突出问题与未来设计指南。
Method: 通过文献调研分类283个基准,分析其覆盖范围与不足,并提出改进方向。
Result: 归纳了三类基准的特点与问题,如通用能力基准关注语言学、知识、推理,而目标特定基准关注风险、可靠性等。
Insight: 基准设计需避免数据污染与文化偏见,未来应增强过程可信度与动态环境评估,以更全面反映模型能力。
Abstract: In recent years, with the rapid development of the depth and breadth of large language models’ capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.
[19] Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation
Yichi Zhang,Yao Huang,Yifan Wang,Yitong Sun,Chang Liu,Zhe Zhao,Zhengwei Fang,Huanran Chen,Xiao Yang,Xingxing Wei,Hang Su,Yinpeng Dong,Jun Zhu
Main category: cs.CL
TL;DR: 该论文提出了MultiTrust-X,一个全面的基准测试,用于评估、分析和缓解多模态大语言模型(MLLMs)的信任问题。通过定义三维框架,涵盖了五个信任维度、两种新型风险类型以及多种缓解策略,实验揭示了当前模型的漏洞和缓解方法的局限性,并提出了新的推理增强安全对齐(RESA)方法。
Details
Motivation: 现有的多模态大语言模型(MLLMs)在能力上取得了显著进步,但其信任问题仍然令人担忧。当前的评估和缓解方法通常关注狭窄的方面,忽略了多模态带来的风险。Contribution: 1. 提出了MultiTrust-X基准测试,涵盖五个信任维度、两种新型风险类型和多种缓解策略;2. 通过实验揭示了当前模型的漏洞和缓解方法的局限性;3. 提出了推理增强安全对齐(RESA)方法,实现了更好的安全性和性能平衡。
Method: 基于三维框架(五个信任维度、两种风险类型、多种缓解策略),设计了32个任务和28个数据集,评估了30多个开源和专有的MLLMs,并分析了8种代表性缓解方法。最后提出了RESA方法,利用链式思维推理能力发现潜在风险。
Result: 实验显示当前模型存在显著漏洞,信任性与通用能力之间存在差距,多模态训练和推理放大了基础LLMs的潜在风险。现有的缓解方法在某些方面有改进,但大多无法全面解决信任问题,且可能引入新的权衡。RESA方法在安全性和性能上实现了最优表现。
Insight: 多模态带来了独特的风险(如跨模态影响);信任问题需要全面的评估和缓解方法;链式思维推理有助于平衡安全性和性能。
Abstract: The trustworthiness of Multimodal Large Language Models (MLLMs) remains an intense concern despite the significant progress in their capabilities. Existing evaluation and mitigation approaches often focus on narrow aspects and overlook risks introduced by the multimodality. To tackle these challenges, we propose MultiTrust-X, a comprehensive benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs. We define a three-dimensional framework, encompassing five trustworthiness aspects which include truthfulness, robustness, safety, fairness, and privacy; two novel risk types covering multimodal risks and cross-modal impacts; and various mitigation strategies from the perspectives of data, model architecture, training, and inference algorithms. Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets, enabling holistic evaluations over 30 open-source and proprietary MLLMs and in-depth analysis with 8 representative mitigation methods. Our extensive experiments reveal significant vulnerabilities in current models, including a gap between trustworthiness and general capabilities, as well as the amplification of potential risks in base LLMs by both multimodal training and inference. Moreover, our controlled analysis uncovers key limitations in existing mitigation strategies that, while some methods yield improvements in specific aspects, few effectively address overall trustworthiness, and many introduce unexpected trade-offs that compromise model utility. These findings also provide practical insights for future improvements, such as the benefits of reasoning to better balance safety and performance. Based on these insights, we introduce a Reasoning-Enhanced Safety Alignment (RESA) approach that equips the model with chain-of-thought reasoning ability to discover the underlying risks, achieving state-of-the-art results.
[20] When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models
Cheng Wang,Gelei Deng,Xianglin Yang,Han Qiu,Tianwei Zhang
Main category: cs.CL
TL;DR: 论文提出了MCR-BENCH基准,揭示了大型音频-语言模型(LALMs)在音频与文本信息冲突时的文本偏向问题,并提出改进方法。
Details
Motivation: 研究动机是评估LALMs在多模态输入冲突时的表现,发现其对文本输入的偏向性,进而影响音频任务的性能。Contribution: 主要贡献包括:1)提出了首个评估LALMs在冲突音频-文本对中行为的基准MCR-BENCH;2)揭示了LALMs的文本偏向问题;3)探讨了影响因素并提出了缓解策略。
Method: 方法是通过MCR-BENCH对LALMs进行广泛评估,分析其在冲突条件下的行为,并利用监督微调探索缓解文本偏向的策略。
Result: 实验结果表明,LALMs在处理冲突输入时严重偏向文本,导致音频任务性能下降,且模型在矛盾输入下仍表现过度自信。
Insight: 研究发现模态平衡和更复杂的融合机制对提升LALMs在多模态冲突中的鲁棒性至关重要。
Abstract: Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.
[21] LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
Yirong Sun,Yizhong Geng,Peidong Wei,Yanjun Chen,Jinghan Yang,Rongfei Chen,Wei Zhang,Xiaoyu Shen
Main category: cs.CL
TL;DR: LLaSO是一个开源框架,旨在解决大型语音-语言模型(LSLMs)领域的数据和配置不透明问题,提供对齐语料库、指令调优数据集和标准化评估基准,并发布了3.8B参数的基准模型。
Details
Motivation: 当前LSLMs领域缺乏透明性和可复现性,模型权重常被公开但缺少对应的训练数据和配置,阻碍了研究的系统性比较和进展。Contribution: 提出首个完全开放的端到端LSLM框架LLaSO,包含三个核心资源:对齐语料库(LLaSO-Align)、多任务指令调优数据集(LLaSO-Instruct)和标准化评估基准(LLaSO-Eval)。
Method: 构建并公开发布了12M对齐语料、13.5M多任务指令数据集,以及一个3.8B参数的基准模型LLaSO-Base,完全基于公开数据训练。
Result: LLaSO-Base在标准化评估中得分0.72,优于同类模型,但纯音频场景的泛化能力仍有提升空间。
Insight: 研究发现,训练覆盖范围越广,性能越好,但在未见任务上尤其是纯音频场景仍存在显著的泛化差距。
Abstract: The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.
[22] SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
Peng Ding,Wen Sun,Dailin Li,Wei Zou,Jiaming Wang,Jiajun Chen,Shujian Huang
Main category: cs.CL
TL;DR: SDGO 是一种基于自判别引导优化的强化学习框架,通过利用 LLMs 自身判别能力作为奖励信号,迭代提升生成安全性,无需额外标注数据或外部模型。
Details
Motivation: LLMs 作为判别器(识别有害请求)的性能优于作为生成器(防御有害内容生成)。这种不一致性启发研究团队探索如何对齐模型的判别和生成能力。Contribution: 提出 SDGO 框架,首次尝试将 LLMs 的判别能力作为奖励信号,通过强化学习优化生成安全性,实现对分布外攻击的鲁棒性。
Method: 使用强化学习框架,以模型自身判别能力为奖励信号,迭代优化生成安全性。没有引入额外标注数据或外部模型。
Result: 实验证明,SDGO 显著提升了模型安全性,优于基于提示和训练的基线方法,同时保持了通用任务的实用性。
Insight: 对齐判别和生成能力可增强模型对分布外攻击的鲁棒性,且仅需少量判别样本即可进一步提升生成能力。
Abstract: Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model’s inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model’s own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs’ discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model’s generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.
[23] Position Bias Mitigates Position Bias:Mitigate Position Bias Through Inter-Position Knowledge Distillation
Yifei Wang,Feng Xiong,Yong Wang,Linjing Li,Xiangxiang Chu,Daniel Dajun Zeng
Main category: cs.CL
TL;DR: Pos2Distill is a novel framework that mitigates positional bias (PB) in long-context tasks by transferring knowledge from advantageous positions to less favorable ones, resulting in improved uniformity and performance.
Details
Motivation: Positional bias in long-context tasks hinders comprehension and processing. Existing methods fail to fully address this issue, prompting the need for a more effective solution.Contribution: The paper introduces Pos2Distill, a knowledge distillation framework that reduces PB by leveraging inter-position knowledge transfer. It also presents two specialized variants for retrieval and reasoning tasks.
Method: Pos2Distill employs knowledge distillation to transfer capabilities between positions. Two instantiations, Pos2Distill-R1 and Pos2Distill-R2, are designed for retrieval and reasoning tasks, respectively.
Result: The approach achieves significant performance gains and uniformity across contextual positions in long-context tasks. Both variants demonstrate cross-task generalization and superior task-specific performance.
Insight: Leveraging position-induced disparity to counteract PB is an effective strategy, and knowledge distillation can enhance uniformity in long-context tasks.
Abstract: Positional bias (PB), manifesting as non-uniform sensitivity across different contextual locations, significantly impairs long-context comprehension and processing capabilities. While prior work seeks to mitigate PB through modifying the architectures causing its emergence, significant PB still persists. To address PB effectively, we introduce \textbf{Pos2Distill}, a position to position knowledge distillation framework. Pos2Distill transfers the superior capabilities from advantageous positions to less favorable ones, thereby reducing the huge performance gaps. The conceptual principle is to leverage the inherent, position-induced disparity to counteract the PB itself. We identify distinct manifestations of PB under \textbf{\textsc{r}}etrieval and \textbf{\textsc{r}}easoning paradigms, thereby designing two specialized instantiations: \emph{Pos2Distill-R\textsuperscript{1}} and \emph{Pos2Distill-R\textsuperscript{2}} respectively, both grounded in this core principle. By employing the Pos2Distill approach, we achieve enhanced uniformity and significant performance gains across all contextual positions in long-context retrieval and reasoning tasks. Crucially, both specialized systems exhibit strong cross-task generalization mutually, while achieving superior performance on their respective tasks.
[24] EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models
Xinyi Ling,Hanwen Du,Zhihui Zhu,Xia Ning
Main category: cs.CL
TL;DR: 该论文介绍了EcomMMMU数据集,研究多模态数据在电商任务中的作用,并提出SUMEI方法以优化视觉内容的利用。
Details
Motivation: 电商平台的多模态数据(如图像)是否总能提升产品理解,或可能引入冗余或降低性能?现有数据集规模有限,难以系统研究。Contribution: 1)提出EcomMMMU数据集,包含406,190样本和8,989,510图像,支持多任务评估;2)发现视觉内容并不总能提升性能;3)提出SUMEI方法,优化视觉内容利用。
Method: 1)构建EcomMMMU数据集;2)分析视觉内容对任务的影响;3)设计SUMEI方法,通过预测视觉效用选择性地利用图像。
Result: 实验表明,SUMEI提升了MLLMs在电商任务中的性能和鲁棒性。
Insight: 多模态模型可能难以有效利用丰富的视觉内容,需设计策略优化视觉内容的利用。
Abstract: E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through https://anonymous.4open.science/r/submission25.
[25] End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Qiaoyu Zheng,Yuze Sun,Chaoyi Wu,Weike Zhao,Pengcheng Qiu,Yongguo Yu,Kun Sun,Yanfeng Wang,Ya Zhang,Weidi Xie
Main category: cs.CL
TL;DR: 该论文提出了一种名为Deep-DxSearch的端到端代理式RAG系统,通过强化学习训练来解决医疗诊断中的知识缺口和幻觉问题,显著提升了诊断准确性和推理可追溯性。
Details
Motivation: 现有的医疗大语言模型在诊断中存在知识缺口和幻觉问题,传统的检索增强方法对外部知识的利用不足且反馈推理的可追溯性较差。Contribution: 1. 构建了大规模医疗检索语料库;2. 提出了端到端的代理式RAG系统训练框架;3. 通过强化学习优化了诊断推理的准确性和可追溯性。
Method: 1. 构建医疗检索语料库;2. 将LLM作为核心代理,语料库作为环境;3. 设计格式、检索、推理结构和诊断准确性的奖励机制;4. 通过强化学习进行端到端训练。
Result: 实验表明,Deep-DxSearch在多数据中心的表现优于提示工程和训练无关的RAG方法,并在常见和罕见疾病的诊断中超越了GPT-4o和DeepSeek-R1等基线模型。
Insight: 奖励设计和检索语料库的关键性作用通过消融实验得到验证,展示了该方法的独特性和有效性。
Abstract: Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch’s diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.
[26] Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis
Yufeng Zhao,Junnan Liu,Hongwei Liu,Dongsheng Zhu,Yuan Shen,Songyang Zhang,Kai Chen
Main category: cs.CL
TL;DR: 该论文提出了工具集成推理(TIR)的概念,并通过ReasonZoo基准和新的效率指标(PAC和AUC-PCC)验证了TIR在提升大语言模型(LLM)推理能力和效率方面的有效性。
Details
Motivation: 解决大语言模型(LLM)在精确计算等复杂推理任务中的不足,探究工具集成推理(TIR)是否能够提升模型的推理能力和行为。Contribution: 1. 提出了ReasonZoo基准,涵盖九种推理任务。2. 引入了两个新指标(PAC和AUC-PCC)评估推理效率。3. 验证了TIR在提升LLM推理能力和效率方面的作用。
Method: 1. 构建ReasonZoo基准,测试TIR在多种推理任务中的表现。2. 设计PAC和AUC-PCC指标,量化推理效率。3. 对比TIR与非TIR模型的性能差异。
Result: TIR模型在数学和非数学任务中均优于非TIR模型,其推理效率也显著提升(表现为PAC和AUC-PCC的改善)。
Insight: TIR不仅提升了LLM的推理能力,还减少了过度思考,优化了推理流程,展示了其在不同领域的通用潜力。
Abstract: Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model’s reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.
[27] LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Ming Yin,Dinghan Shen,Silei Xu,Jianbing Han,Sixun Dong,Mian Zhang,Yebowen Hu,Shujian Liu,Simin Ma,Song Wang,Sathish Reddy Indurthi,Xun Wang,Yiran Chen,Kaiqiang Song
Main category: cs.CL
TL;DR: LiveMCP-101是一个基准测试,包含101个经过精心设计的真实世界查询,用于评估AI代理在多步骤任务中使用多种MCP工具的能力。通过引入基于真实执行计划的评估方法,揭示了当前模型的工具协调难题。
Details
Motivation: 现有的AI代理在整合多工具完成任务方面缺乏基准测试,特别是在动态环境中。LiveMCP-101填补了这一空白,旨在推动自主AI系统的发展。Contribution: 提出了LiveMCP-101基准测试和基于真实执行计划的评估方法,揭示了当前模型的工具协调问题和失败模式,为改进提供了方向。
Method: 通过LLM重写和人工审核设计101个查询,要求使用多种MCP工具。采用基于真实执行计划的评估方法,而非原始API输出。
Result: 实验显示,即使是前沿的LLM在任务成功率上也低于60%,暴露了工具协调的挑战。
Insight: 当前模型在多工具协调中存在明显不足,需要进一步优化执行效率和错误处理能力。
Abstract: Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.
cs.CV [Back]
[28] Heatmap Regression without Soft-Argmax for Facial Landmark Detection
Chiao-An Yang,Raymond A. Yeh
Main category: cs.CV
TL;DR: 该论文提出一种不依赖Soft-argmax的面部关键点检测方法,通过结构化预测框架实现高效训练和竞争性性能。
Details
Motivation: 传统的基于热图回归的面部关键点检测方法依赖Soft-argmax作为可微分近似,但作者质疑其必要性,探索替代方案以提升训练效率和性能。Contribution: 提出一种基于结构化预测框架的新目标函数,无需Soft-argmax,在多个基准数据集上达到SOTA性能,且训练速度提升2.2倍。
Method: 采用结构化预测框架替代Soft-argmax,通过直接优化热图回归任务的目标函数实现端到端训练。
Result: 在WFLW、COFW和300W数据集上实现了竞争性或更优的精度,同时显著加快训练收敛速度。
Insight: 无需依赖Soft-argmax也能实现高效的面部关键点检测,结构化预测框架是一种可行的替代方案。
Abstract: Facial landmark detection is an important task in computer vision with numerous applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been widely used to achieve state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. Since argmax is not differentiable, these methods use a differentiable approximation, Soft-argmax, to enable end-to-end training on deep-nets. In this work, we revisit this long-standing choice of using Soft-argmax and demonstrate that it is not the only way to achieve strong performance. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W), converging 2.2x faster during training while maintaining better/competitive accuracy. Our code is available here: https://github.com/ca-joe-yang/regression-without-softarg.
[29] Fast Graph Neural Network for Image Classification
Mustafa Mohammadi Gharasuie,Luis Rueda
Main category: cs.CV
TL;DR: 该论文提出了一种新的图像分类方法,结合了图卷积网络(GCN)和Voronoi图,通过将图像表示为图结构并优化数据关系,提高了分类效率和准确性。
Details
Motivation: 传统卷积神经网络(CNN)在处理复杂数据结构和场景时表现有限,而图卷积网络(GCN)擅长建模关系数据。论文旨在通过结合GCN和Voronoi图,提升图像分类的性能。Contribution: 提出了一种新颖的GCN与Voronoi图结合的模型,优化了图像表示为图结构的方法,显著提高了分类效率和精度。
Method: 将图像表示为图结构(像素或区域为顶点),并通过Delaunay三角剖分优化图的表示,结合GCN实现高效的分类。
Result: 实验结果表明,该方法在多个基准数据集上优于现有技术,尤其在复杂场景和细粒度分类任务中表现突出。
Insight: 图结构与Voronoi图的结合为图像分类提供了新的视角,拓展了图学习在计算机视觉和非结构化数据分析中的应用潜力。
Abstract: The rapid progress in image classification has been largely driven by the adoption of Graph Convolutional Networks (GCNs), which offer a robust framework for handling complex data structures. This study introduces a novel approach that integrates GCNs with Voronoi diagrams to enhance image classification by leveraging their ability to effectively model relational data. Unlike conventional convolutional neural networks (CNNs), our method represents images as graphs, where pixels or regions function as vertices. These graphs are then refined using corresponding Delaunay triangulations, optimizing their representation. The proposed model achieves significant improvements in both preprocessing efficiency and classification accuracy across various benchmark datasets, surpassing state-of-the-art approaches, particularly in challenging scenarios involving intricate scenes and fine-grained categories. Experimental results, validated through cross-validation, underscore the effectiveness of combining GCNs with Voronoi diagrams for advancing image classification. This research not only presents a novel perspective on image classification but also expands the potential applications of graph-based learning paradigms in computer vision and unstructured data analysis.
[30] GasTwinFormer: A Hybrid Vision Transformer for Livestock Methane Emission Segmentation and Dietary Classification in Optical Gas Imaging
Toqi Tahamid Sarker,Mohamed Embaby,Taminul Islam,Amer AbuGhazaleh,Khaled R Ahmed
Main category: cs.CV
TL;DR: GasTwinFormer提出了一种混合视觉变换器,用于实时分割牲畜甲烷排放和分类饲料,通过创新的Mix Twin编码器和新颖的数据集,实现了高效的多任务处理。
Details
Motivation: 牲畜甲烷排放占人为甲烷排放的32%,实时监测对气候缓解策略至关重要。Contribution: 1. 提出GasTwinFormer,结合全局和局部注意力机制,实现甲烷分割和饲料分类;2. 贡献首个OGI牛肉甲烷排放数据集;3. 轻量级LR-ASPP解码器实现高效多尺度特征聚合。
Method: 采用Mix Twin编码器交替全局和局部注意力,搭配LR-ASPP解码器,统一框架处理分割和分类任务。
Result: 74.47% mIoU和83.63% mF1的甲烷分割性能,100%饲料分类准确率,仅3.348M参数和114.9 FPS推理速度。
Insight: 饲料与甲烷排放相关性可用于高效分类,混合注意力机制在多任务中表现卓越。
Abstract: Livestock methane emissions represent 32% of human-caused methane production, making automated monitoring critical for climate mitigation strategies. We introduce GasTwinFormer, a hybrid vision transformer for real-time methane emission segmentation and dietary classification in optical gas imaging through a novel Mix Twin encoder alternating between spatially-reduced global attention and locally-grouped attention mechanisms. Our architecture incorporates a lightweight LR-ASPP decoder for multi-scale feature aggregation and enables simultaneous methane segmentation and dietary classification in a unified framework. We contribute the first comprehensive beef cattle methane emission dataset using OGI, containing 11,694 annotated frames across three dietary treatments. GasTwinFormer achieves 74.47% mIoU and 83.63% mF1 for segmentation while maintaining exceptional efficiency with only 3.348M parameters, 3.428G FLOPs, and 114.9 FPS inference speed. Additionally, our method achieves perfect dietary classification accuracy (100%), demonstrating the effectiveness of leveraging diet-emission correlations. Extensive ablation studies validate each architectural component, establishing GasTwinFormer as a practical solution for real-time livestock emission monitoring. Please see our project page at gastwinformer.github.io.
[31] XDR-LVLM: An Explainable Vision-Language Large Model for Diabetic Retinopathy Diagnosis
Masato Ito,Kaito Tanaka,Keisuke Matsuda,Aya Nakayama
Main category: cs.CV
TL;DR: XDR-LVLM 是一个可解释的视觉语言大模型,用于糖尿病视网膜病变诊断。它通过结合视觉与自然语言生成,实现了高精度的诊断并提供了透明的解释,填补了自动化诊断与临床需求之间的差距。
Details
Motivation: 糖尿病视网膜病变是导致全球失明的主要原因,深度学习模型在诊断中表现出潜力,但其黑盒特性限制了临床应用的透明度。XDR-LVLM旨在通过可解释的视觉语言模型提升诊断的可信度和临床实用性。Contribution: 提出了XDR-LVLM框架,结合医学视觉编码器和语言模型,实现了高精度诊断与自然语言解释的生成。还引入了多任务提示工程和多阶段微调方法。
Method: 使用Medical Vision Encoder提取眼底图像特征,通过LVLM Core生成诊断报告,结合Multi-task Prompt Engineering和Multi-stage Fine-tuning优化模型。
Result: 在DDR数据集上,XDR-LVLM达到了84.55%的平衡准确率和79.92%的F1分数,概念检测性能也很突出。人类评估验证了生成解释的流畅性和临床价值。
Insight: 通过视觉与语言的结合,XDR-LVLM不仅提升了诊断性能,还为临床医生提供了透明、可解释的诊断报告,推动了自动化诊断的实用化。
Abstract: Diabetic Retinopathy (DR) is a major cause of global blindness, necessitating early and accurate diagnosis. While deep learning models have shown promise in DR detection, their black-box nature often hinders clinical adoption due to a lack of transparency and interpretability. To address this, we propose XDR-LVLM (eXplainable Diabetic Retinopathy Diagnosis with LVLM), a novel framework that leverages Vision-Language Large Models (LVLMs) for high-precision DR diagnosis coupled with natural language-based explanations. XDR-LVLM integrates a specialized Medical Vision Encoder, an LVLM Core, and employs Multi-task Prompt Engineering and Multi-stage Fine-tuning to deeply understand pathological features within fundus images and generate comprehensive diagnostic reports. These reports explicitly include DR severity grading, identification of key pathological concepts (e.g., hemorrhages, exudates, microaneurysms), and detailed explanations linking observed features to the diagnosis. Extensive experiments on the Diabetic Retinopathy (DDR) dataset demonstrate that XDR-LVLM achieves state-of-the-art performance, with a Balanced Accuracy of 84.55% and an F1 Score of 79.92% for disease diagnosis, and superior results for concept detection (77.95% BACC, 66.88% F1). Furthermore, human evaluations confirm the high fluency, accuracy, and clinical utility of the generated explanations, showcasing XDR-LVLM’s ability to bridge the gap between automated diagnosis and clinical needs by providing robust and interpretable insights.
[32] MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion
Xuyang Chen,Zhijun Zhai,Kaixuan Zhou,Zengmao Wang,Jianan He,Dong Wang,Yanfeng Zhang,mingwei Sun,Rüdiger Westermann,Konrad Schindler,Liqiu Meng
Main category: cs.CV
TL;DR: 本文提出MeSS方法,利用城市网格作为几何先验,通过改进的图像扩散模型生成高质量、风格一致的户外场景,并通过3D高斯泼溅重建场景。
Details
Motivation: 城市网格模型缺乏真实纹理,限制了其在虚拟城市导航和自动驾驶中的应用。直接使用图像或视频扩散模型生成3D场景存在一致性和几何对齐问题。Contribution: 1. 提出MeSS方法,结合图像扩散模型改进跨视角一致性;2. 分阶段生成几何一致的场景;3. 通过3D高斯泼溅重建场景以支持多样渲染风格。
Method: 1. 使用级联外推ControlNet生成几何一致的稀疏视角;2. 通过AGInpaint模块生成密集中间视角;3. 利用GCAlign模块消除视觉不一致性;4. 基于网格初始化3D高斯泼溅重建场景。
Result: 在几何对齐和生成质量上优于现有方法,生成的场景可通过重光照和风格迁移实现多样化渲染。
Insight: 结合几何先验和改进的扩散模型,是提升3D场景生成一致性和质量的可行思路。
Abstract: Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques.
[33] SurgWound-Bench: A Benchmark for Surgical Wound Diagnosis
Jiahao Xu,Changchang Yin,Odysseas Chatzipanagiotou,Diamantis Tsilimigras,Kevin Clear,Bingsheng Yao,Dakuo Wang,Timothy Pawlik,Ping Zhang
Main category: cs.CV
TL;DR: 这篇论文提出了SurgWound-Bench,首个开源的手术伤口诊断数据集和基准,并提出了一个三阶段的学习框架WoundQwen,用于全面评估模型性能。
Details
Motivation: 手术部位感染(SSI)是常见的医疗相关感染,但当前缺乏公开的手术伤口数据集和基准,阻碍了深度学习在该领域的应用。Contribution: 1. 提出了首个开源手术伤口数据集SurgWound,包含697张图片和8种临床属性;2. 建立了首个手术伤口诊断基准;3. 提出了三阶段学习框架WoundQwen。
Method: 使用五个独立的MLLM模型预测伤口特性,再将结果输入到两个MLLM模型进行诊断,最后整合生成综合报告。
Result: 提出的三阶段框架能够分析伤口特性并提供个性化护理建议。
Insight: 通过开源数据集和基准,推动了手术伤口诊断技术的发展,为个性化护理和及时干预提供了可能。
Abstract: Surgical site infection (SSI) is one of the most common and costly healthcare-associated infections and and surgical wound care remains a significant clinical challenge in preventing SSIs and improving patient outcomes. While recent studies have explored the use of deep learning for preliminary surgical wound screening, progress has been hindered by concerns over data privacy and the high costs associated with expert annotation. Currently, no publicly available dataset or benchmark encompasses various types of surgical wounds, resulting in the absence of an open-source Surgical-Wound screening tool. To address this gap: (1) we present SurgWound, the first open-source dataset featuring a diverse array of surgical wound types. It contains 697 surgical wound images annotated by 3 professional surgeons with eight fine-grained clinical attributes. (2) Based on SurgWound, we introduce the first benchmark for surgical wound diagnosis, which includes visual question answering (VQA) and report generation tasks to comprehensively evaluate model performance. (3) Furthermore, we propose a three-stage learning framework, WoundQwen, for surgical wound diagnosis. In the first stage, we employ five independent MLLMs to accurately predict specific surgical wound characteristics. In the second stage, these predictions serve as additional knowledge inputs to two MLLMs responsible for diagnosing outcomes, which assess infection risk and guide subsequent interventions. In the third stage, we train a MLLM that integrates the diagnostic results from the previous two stages to produce a comprehensive report. This three-stage framework can analyze detailed surgical wound characteristics and provide subsequent instructions to patients based on surgical images, paving the way for personalized wound care, timely intervention, and improved patient outcomes.
[34] Adversarial Agent Behavior Learning in Autonomous Driving Using Deep Reinforcement Learning
Arjun Srinivasan,Anubhav Paras,Aniket Bera
Main category: cs.CV
TL;DR: 该论文提出了一种基于深度强化学习的方法,用于在自动驾驶中学习对抗性行为,从而测试规则基代理的鲁棒性,并展示了对抗性代理如何降低规则基代理的累积奖励。
Details
Motivation: 在自动驾驶等安全性关键应用中,规则基代理的行为建模至关重要。现有的研究通常使用规则基方法,但缺乏对对抗性行为的模拟,无法充分测试代理的鲁棒性。Contribution: 提出了一种基于深度强化学习的对抗性行为学习方法,用于生成针对规则基代理的失败场景。
Method: 利用深度强化学习框架,训练对抗性代理以最大化规则基代理的失败概率,并通过累积奖励评估其效果。
Result: 实验表明,对抗性代理能显著降低规则基代理的累积奖励,验证了其对规则基代理的有效性和潜在威胁。
Insight: 该方法为自动驾驶系统的安全性测试提供了新的视角,强调了对对抗性行为进行建模的重要性。
Abstract: Existing approaches in reinforcement learning train an agent to learn desired optimal behavior in an environment with rule based surrounding agents. In safety critical applications such as autonomous driving it is crucial that the rule based agents are modelled properly. Several behavior modelling strategies and IDM models are used currently to model the surrounding agents. We present a learning based method to derive the adversarial behavior for the rule based agents to cause failure scenarios. We evaluate our adversarial agent against all the rule based agents and show the decrease in cumulative reward.
[35] DyMorph-B2I: Dynamic and Morphology-Guided Binary-to-Instance Segmentation for Renal Pathology
Leiyue Zhao,Yuechen Yang,Yanfan Zhu,Haichun Yang,Yuankai Huo,Paul D. Simonson,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng
Main category: cs.CV
TL;DR: 论文提出了一种动态、形态学引导的二进制到实例分割方法DyMorph-B2I,专为肾脏病理学设计,通过统一框架整合了分水岭、骨架化和形态学操作,实现了对复杂结构的鲁棒分割。
Details
Motivation: 现有的肾脏病理学数据集和自动化方法通常仅提供二进制(语义)分割掩码,限制了实例级分割的精度,而传统后处理技术(如分水岭、形态学操作等)在多样化和复杂的肾脏组织形态中表现有限。Contribution: 提出了一种动态、形态学引导的二进制到实例分割管道DyMorph-B2I,通过自适应几何细化和可定制超参数优化,显著提升了实例分割的精度。
Method: 整合了分水岭、骨架化和形态学操作等传统方法,通过自适应几何细化模块优化每个功能单元类别的分割结果,并结合系统参数调优。
Result: 实验表明,DyMorph-B2I优于单一传统方法及其简单组合,能够更好地分离粘连和异质结构,支持更精确的形态学分析。
Insight: 动态参数优化和形态学引导的结合可以有效解决复杂结构的实例分割问题,为肾脏病理学提供了一种高效且可扩展的解决方案。
Abstract: Accurate morphological quantification of renal pathology functional units relies on instance-level segmentation, yet most existing datasets and automated methods provide only binary (semantic) masks, limiting the precision of downstream analyses. Although classical post-processing techniques such as watershed, morphological operations, and skeletonization, are often used to separate semantic masks into instances, their individual effectiveness is constrained by the diverse morphologies and complex connectivity found in renal tissue. In this study, we present DyMorph-B2I, a dynamic, morphology-guided binary-to-instance segmentation pipeline tailored for renal pathology. Our approach integrates watershed, skeletonization, and morphological operations within a unified framework, complemented by adaptive geometric refinement and customizable hyperparameter tuning for each class of functional unit. Through systematic parameter optimization, DyMorph-B2I robustly separates adherent and heterogeneous structures present in binary masks. Experimental results demonstrate that our method outperforms individual classical approaches and na"ive combinations, enabling superior instance separation and facilitating more accurate morphometric analysis in renal pathology workflows. The pipeline is publicly available at: https://github.com/ddrrnn123/DyMorph-B2I.
[36] STAGNet: A Spatio-Temporal Graph and LSTM Framework for Accident Anticipation
Vipooshan Vipulananthan,Kumudu Mohottala,Kavindu Chinthana,Nimsara Paramulla,Charith D Chitraranjan
Main category: cs.CV
TL;DR: 该论文提出了一种基于时空图(Spatio-Temporal Graph)和LSTM的框架STAGNet,用于通过行车记录仪视频预测潜在交通事故。该方法在三个公开数据集上表现优于现有技术。
Details
Motivation: 现有的ADAS系统多依赖多种传感器,如LiDAR或雷达,而仅靠行车记录仪视频输入是一种更具挑战性但成本更低、更易部署的方案。Contribution: 提出了STAGNet模型,通过改进时空特征并利用循环网络聚合这些特征,显著提升了交通事故预测的性能。
Method: 结合时空图结构提取特征,并通过LSTM网络进行时序建模,实现高效的事故预测。
Result: 在多个数据集上的实验表明,STAGNet在平均精度和平均碰撞时间预测上优于现有方法,且具有跨数据集泛化能力。
Insight: 行车记录仪视频可用于高效的事故预测,时空建模对于提升预测精度至关重要。
Abstract: Accident prediction and timely warnings play a key role in improving road safety by reducing the risk of injury to road users and minimizing property damage. Advanced Driver Assistance Systems (ADAS) are designed to support human drivers and are especially useful when they can anticipate potential accidents before they happen. While many existing systems depend on a range of sensors such as LiDAR, radar, and GPS, relying solely on dash-cam video input presents a more challenging but a more cost-effective and easily deployable solution. In this work, we incorporate better spatio-temporal features and aggregate them through a recurrent network to improve upon state-of-the-art graph neural networks for predicting accidents from dash-cam videos. Experiments using three publicly available datasets show that our proposed STAGNet model achieves higher average precision and mean time-to-collision values than previous methods, both when cross-validated on a given dataset and when trained and tested on different datasets.
[37] Collaborative Multi-Modal Coding for High-Quality 3D Generation
Ziang Cao,Zhaoxi Chen,Liang Pan,Ziwei Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为TriMM的多模态生成模型,通过协同多模态编码和辅助监督学习,实现了高质量的3D内容生成。
Details
Motivation: 现有的3D生成模型多局限于单一模态或3D结构,忽视了多模态数据的互补优势。TriMM旨在充分利用RGB、RGBD和点云等多模态数据,提升3D内容的生成质量。Contribution: 1)首次提出了一种前馈式的多模态3D生成模型TriMM;2)设计了协同多模态编码机制,保留了各模态的独特优势;3)引入了2D和3D辅助监督,提升了模型的鲁棒性和性能;4)利用triplane潜在扩散模型生成高质量的3D资产。
Method: TriMM通过协同多模态编码整合不同模态的特征,并引入2D和3D辅助监督提升模型性能。生成阶段采用triplane潜在扩散模型,进一步提升纹理和几何细节。
Result: 实验表明,TriMM在少量训练数据下,通过多模态学习,性能可与大规模数据集训练的模型媲美。此外,验证了将其他多模态数据集用于3D生成的可行性。
Insight: 多模态数据在3D生成中具有显著互补优势,协同编码和辅助监督是提升生成质量的关键。TriMM为多模态3D生成提供了新思路。
Abstract: 3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.
[38] AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation
Ruipu Wu,Yige Zhang,Jinyu Chen,Linjiang Huang,Shifeng Zhang,Xu Zhou,Liang Wang,Si Liu
Main category: cs.CV
TL;DR: 该论文提出了一种名为DuAl-VLN的新任务,利用高低空双无人机协作完成基于视觉和语言的导航任务,解决了传统UAV-VLN任务中长轨迹和复杂机动性带来的挑战。
Details
Motivation: 现有的UAV-VLN任务由于长轨迹和复杂的机动性导致性能不可靠,且需要人工干预或过于详细的指令。高机动性的无人机虽能提供多粒度视角,但学习空间难以管理。Contribution: 提出了DuAl-VLN任务,通过高低空双无人机协作,利用各自优势(高空无人机负责环境推理,低空无人机负责精确导航),并构建了HaL-13k数据集支持任务训练与评估。此外,设计了一个高效的双无人机协作框架AeroDuo。
Method: 设计了一种双无人机协作框架AeroDuo:高空无人机使用多模态大语言模型(Pilot-LLM)进行目标推理,低空无人机采用轻量级多阶段策略进行导航和目标定位。两者仅交换最小坐标信息以提高效率。
Result: 构建了包含13,838条协作轨迹的HaL-13k数据集,并通过未见地图和未见过对象的验证集评估模型的泛化能力。AeroDuo框架通过协作优化了导航性能。
Insight: 通过双无人机协作,结合高空的环境推理能力和低空的精确导航能力,可以有效解决UAV-VLN任务的复杂性,同时保持高效性和泛化能力。
Abstract: Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and complex maneuverability of UAVs, achieving reliable UAV-VLN performance is challenging and often requires human intervention or overly detailed instructions. To harness the advantages of UAVs’ high mobility, which could provide multi-grained perspectives, while maintaining a manageable motion space for learning, we introduce a novel task called Dual-Altitude UAV Collaborative VLN (DuAl-VLN). In this task, two UAVs operate at distinct altitudes: a high-altitude UAV responsible for broad environmental reasoning, and a low-altitude UAV tasked with precise navigation. To support the training and evaluation of the DuAl-VLN, we construct the HaL-13k, a dataset comprising 13,838 collaborative high-low UAV demonstration trajectories, each paired with target-oriented language instructions. This dataset includes both unseen maps and an unseen object validation set to systematically evaluate the model’s generalization capabilities across novel environments and unfamiliar targets. To consolidate their complementary strengths, we propose a dual-UAV collaborative VLN framework, AeroDuo, where the high-altitude UAV integrates a multimodal large language model (Pilot-LLM) for target reasoning, while the low-altitude UAV employs a lightweight multi-stage policy for navigation and target grounding. The two UAVs work collaboratively and only exchange minimal coordinate information to ensure efficiency.
[39] Comp-X: On Defining an Interactive Learned Image Compression Paradigm With Expert-driven LLM Agent
Yixin Gao,Xin Li,Xiaohan Pan,Runsen Feng,Bingchen Li,Yunpeng Qi,Yiting Lu,Zhengxue Cheng,Zhibo Chen,Jörn Ostermann
Main category: cs.CV
TL;DR: Comp-X 是一种基于大语言模型(LLM)的智能交互式图像压缩框架,通过多模态统一编码、交互式编码代理和专用评测基准,实现了高效的用户需求理解和压缩性能。
Details
Motivation: 传统的图像编码器因编码模式有限且依赖人工选择,对非专业用户不友好。Comp-X 旨在通过 LLM 的推理能力,实现智能交互式图像压缩,满足多样化需求。Contribution: 1. 多模态统一编码框架;2. 交互式编码代理(基于上下文学习与专家反馈);3. IIC-bench 评测基准。
Method: 1. 统一不同目标和需求的编码模式;2. 使用增强的上下文学习方法训练 LLM;3. 系统化设计评测基准。
Result: Comp-X 能高效理解用户需求,保持优异的压缩性能,展现出在图像压缩中实现 AGI 的潜力。
Insight: 通过 LLM 的推理能力实现交互式编码是一种创新方向,为图像压缩的智能化和通用性提供了新思路。
Abstract: We present Comp-X, the first intelligently interactive image compression paradigm empowered by the impressive reasoning capability of large language model (LLM) agent. Notably, commonly used image codecs usually suffer from limited coding modes and rely on manual mode selection by engineers, making them unfriendly for unprofessional users. To overcome this, we advance the evolution of image coding paradigm by introducing three key innovations: (i) multi-functional coding framework, which unifies different coding modes of various objective/requirements, including human-machine perception, variable coding, and spatial bit allocation, into one framework. (ii) interactive coding agent, where we propose an augmented in-context learning method with coding expert feedback to teach the LLM agent how to understand the coding request, mode selection, and the use of the coding tools. (iii) IIC-bench, the first dedicated benchmark comprising diverse user requests and the corresponding annotations from coding experts, which is systematically designed for intelligently interactive image compression evaluation. Extensive experimental results demonstrate that our proposed Comp-X can understand the coding requests efficiently and achieve impressive textual interaction capability. Meanwhile, it can maintain comparable compression performance even with a single coding framework, providing a promising avenue for artificial general intelligence (AGI) in image compression.
[40] Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images
Jinsol Song,Jiamu Wang,Anh Tien Nguyen,Keunho Byeon,Sangjeong Ahn,Sung Hak Lee,Jin Tae Kwak
Main category: cs.CV
TL;DR: Ano-NAViLa是一个结合正常和异常病理知识的视觉语言模型,用于病理图像的异常检测。它在精度和鲁棒性上表现优异,并且通过图文关联提供可解释性。
Details
Motivation: 计算病理学中异常检测面临数据稀缺、计算限制和组织多样性等挑战,现有方法在病理学领域表现不足。Contribution: 提出Ano-NAViLa模型,通过结合正常和异常病理知识提升检测精度和鲁棒性,并提供可解释性。
Method: 基于预训练视觉语言模型和轻量级可训练MLP架构,结合病理学知识进行异常检测。
Result: 在两个淋巴结数据集上实现了最优性能,优于其他竞争模型。
Insight: 结合多模态知识(视觉与语言)可以显著提升病理图像异常检测的性能和可解释性。
Abstract: Anomaly detection in computational pathology aims to identify rare and scarce anomalies where disease-related data are often limited or missing. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a Normal and Abnormal pathology knowledge-augmented Vision-Language model for Anomaly detection in pathology images. Ano-NAViLa is built on a pre-trained vision-language model with a lightweight trainable MLP. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.
[41] RATopo: Improving Lane Topology Reasoning via Redundancy Assignment
Han Li,Shaofei Huang,Longfei Xu,Yulu Gao,Beipeng Mu,Si Liu
Main category: cs.CV
TL;DR: RATopo通过冗余分配策略改进车道拓扑推理,提升了自动驾驶中车道间及其与交通元素间的拓扑关系建模能力。
Details
Motivation: 现有方法采用先检测后推理的范式,基于一对一分配结果的监督策略限制了有效监督范围,导致拓扑推理性能欠佳。Contribution: 提出RATopo策略,通过冗余车道预测和多样化几何监督增强拓扑推理能力,支持一对多分配。
Method: 重构Transformer解码器,交换交叉注意力和自注意力层,保留冗余预测;并行实例化多个独立参数的交叉注意力块,提升车道检测多样性。
Result: 在OpenLane-V2上的实验表明,RATopo具有模型无关性,可无缝集成现有框架,显著提升车道与交通元素拓扑关系性能。
Insight: 冗余分配和多样化监督是提升车道拓扑推理的关键,改进了传统一对一监督的局限性。
Abstract: Lane topology reasoning plays a critical role in autonomous driving by modeling the connections among lanes and the topological relationships between lanes and traffic elements. Most existing methods adopt a first-detect-then-reason paradigm, where topological relationships are supervised based on the one-to-one assignment results obtained during the detection stage. This supervision strategy results in suboptimal topology reasoning performance due to the limited range of valid supervision. In this paper, we propose RATopo, a Redundancy Assignment strategy for lane Topology reasoning that enables quantity-rich and geometry-diverse topology supervision. Specifically, we restructure the Transformer decoder by swapping the cross-attention and self-attention layers. This allows redundant lane predictions to be retained before suppression, enabling effective one-to-many assignment. We also instantiate multiple parallel cross-attention blocks with independent parameters, which further enhances the diversity of detected lanes. Extensive experiments on OpenLane-V2 demonstrate that our RATopo strategy is model-agnostic and can be seamlessly integrated into existing topology reasoning frameworks, consistently improving both lane-lane and lane-traffic topology performance.
[42] DesignCLIP: Multimodal Learning with CLIP for Design Patent Understanding
Zhu Wang,Homaira Huda Shomee,Sathya N. Ravi,Sourav Medya
Main category: cs.CV
TL;DR: 论文提出DesignCLIP,利用CLIP模型构建多模态学习框架,用于设计专利理解,通过分类感知和对比学习提升专利分类与检索效果。
Details
Motivation: 设计专利图像通常为草图,缺乏全面的视觉上下文和语义信息,导致专利分析中的模糊性。CLIP等视觉语言模型的进展为专利分析提供了新机会。Contribution: 提出了DesignCLIP框架,针对专利数据特点引入类感知分类和对比学习,构建大规模专利数据集,并在多任务中验证其有效性。
Method: 结合生成专利图像详细描述和多视角图像学习,采用CLIP模型进行多模态训练,优化专利分类和检索任务。
Result: DesignCLIP在专利领域所有任务中均优于基线及SOTA模型,展示了多模态方法在专利分析中的潜力。
Insight: 多模态方法能有效弥补专利图像的语义不足,提升专利分析的可靠性与准确性,同时为设计创新提供多样化灵感来源。
Abstract: In the field of design patent analysis, traditional tasks such as patent classification and patent image retrieval heavily depend on the image data. However, patent images – typically consisting of sketches with abstract and structural elements of an invention – often fall short in conveying comprehensive visual context and semantic information. This inadequacy can lead to ambiguities in evaluation during prior art searches. Recent advancements in vision-language models, such as CLIP, offer promising opportunities for more reliable and accurate AI-driven patent analysis. In this work, we leverage CLIP models to develop a unified framework DesignCLIP for design patent applications with a large-scale dataset of U.S. design patents. To address the unique characteristics of patent data, DesignCLIP incorporates class-aware classification and contrastive learning, utilizing generated detailed captions for patent images and multi-views image learning. We validate the effectiveness of DesignCLIP across various downstream tasks, including patent classification and patent retrieval. Additionally, we explore multimodal patent retrieval, which provides the potential to enhance creativity and innovation in design by offering more diverse sources of inspiration. Our experiments show that DesignCLIP consistently outperforms baseline and SOTA models in the patent domain on all tasks. Our findings underscore the promise of multimodal approaches in advancing patent analysis. The codebase is available here: https://anonymous.4open.science/r/PATENTCLIP-4661/README.md.
[43] TPA: Temporal Prompt Alignment for Fetal Congenital Heart Defect Classification
Darya Taratynova,Alya Almsouti,Beknur Kalmakhanbet,Numan Saeed,Mohammad Yaqub
Main category: cs.CV
TL;DR: TPA是一种结合时间建模、提示感知对比学习和不确定性量化的方法,用于胎儿先天性心脏病(CHD)分类,显著提升了分类性能和校准效果。
Details
Motivation: 现有方法在超声视频CHD检测中忽略了时间信息、仅支持二分类且缺乏预测校准,TPA旨在解决这些问题。Contribution: 1. 提出时间提示对齐(TPA)框架;2. 引入CVAESM模块量化不确定性;3. 在CHD和心功能异常数据集上取得SOTA结果。
Method: 1. 使用图像编码器提取视频帧特征;2. 训练时间提取器聚合时间信息;3. 通过对比损失对齐视频表示与文本提示;4. CVAESM模块调制嵌入并量化不确定性。
Result: TPA在CHD诊断中Macro F1达85.40%,校准误差显著降低;在EchoNet-Dynamic三分类任务中Macro F1提升4.73%。
Insight: 结合时间信息和文本提示对齐能有效提升医学视频分类性能;不确定性量化对临床可靠性至关重要。
Abstract: Congenital heart defect (CHD) detection in ultrasound videos is hindered by image noise and probe positioning variability. While automated methods can reduce operator dependence, current machine learning approaches often neglect temporal information, limit themselves to binary classification, and do not account for prediction calibration. We propose Temporal Prompt Alignment (TPA), a method leveraging foundation image-text model and prompt-aware contrastive learning to classify fetal CHD on cardiac ultrasound videos. TPA extracts features from each frame of video subclips using an image encoder, aggregates them with a trainable temporal extractor to capture heart motion, and aligns the video representation with class-specific text prompts via a margin-hinge contrastive loss. To enhance calibration for clinical reliability, we introduce a Conditional Variational Autoencoder Style Modulation (CVAESM) module, which learns a latent style vector to modulate embeddings and quantifies classification uncertainty. Evaluated on a private dataset for CHD detection and on a large public dataset, EchoNet-Dynamic, for systolic dysfunction, TPA achieves state-of-the-art macro F1 scores of 85.40% for CHD diagnosis, while also reducing expected calibration error by 5.38% and adaptive ECE by 6.8%. On EchoNet-Dynamic’s three-class task, it boosts macro F1 by 4.73% (from 53.89% to 58.62%). Temporal Prompt Alignment (TPA) is a framework for fetal congenital heart defect (CHD) classification in ultrasound videos that integrates temporal modeling, prompt-aware contrastive learning, and uncertainty quantification.
[44] BasketLiDAR: The First LiDAR-Camera Multimodal Dataset for Professional Basketball MOT
Ryunosuke Hayashi,Kohei Torimi,Rokuto Nagata,Kazuma Ikeda,Ozora Sako,Taichi Nakamura,Masaki Tani,Yoshimitsu Aoki,Kentaro Yoshioka
Main category: cs.CV
TL;DR: BasketLiDAR是首个结合LiDAR点云和多视角摄像机数据的篮球运动员多目标跟踪(MOT)数据集,提出了一种新型MOT框架,提高跟踪精度并降低计算成本。
Details
Motivation: 传统基于摄像机的系统受限于二维数据,难以在篮球等高动态场景中实现实时3D跟踪。Contribution: 创建了首个篮球领域的LiDAR-摄像机多模态数据集(BasketLiDAR),并提出了一种融合LiDAR和相机数据的新型MOT框架。
Method: 通过LiDAR的高精度3D空间信息,开发了实时跟踪和多模态融合跟踪的管道。
Result: 实验表明,该方法在遮挡条件下仍能实现实时操作,且性能优于传统相机方法。
Insight: LiDAR数据能显著提升复杂场景中的3D跟踪性能,多模态融合是解决MOT挑战的有效方向。
Abstract: Real-time 3D trajectory player tracking in sports plays a crucial role in tactical analysis, performance evaluation, and enhancing spectator experience. Traditional systems rely on multi-camera setups, but are constrained by the inherently two-dimensional nature of video data and the need for complex 3D reconstruction processing, making real-time analysis challenging. Basketball, in particular, represents one of the most difficult scenarios in the MOT field, as ten players move rapidly and complexly within a confined court space, with frequent occlusions caused by intense physical contact. To address these challenges, this paper constructs BasketLiDAR, the first multimodal dataset in the sports MOT field that combines LiDAR point clouds with synchronized multi-view camera footage in a professional basketball environment, and proposes a novel MOT framework that simultaneously achieves improved tracking accuracy and reduced computational cost. The BasketLiDAR dataset contains a total of 4,445 frames and 3,105 player IDs, with fully synchronized IDs between three LiDAR sensors and three multi-view cameras. We recorded 5-on-5 and 3-on-3 game data from actual professional basketball players, providing complete 3D positional information and ID annotations for each player. Based on this dataset, we developed a novel MOT algorithm that leverages LiDAR’s high-precision 3D spatial information. The proposed method consists of a real-time tracking pipeline using LiDAR alone and a multimodal tracking pipeline that fuses LiDAR and camera data. Experimental results demonstrate that our approach achieves real-time operation, which was difficult with conventional camera-only methods, while achieving superior tracking performance even under occlusion conditions. The dataset is available upon request at: https://sites.google.com/keio.jp/keio-csg/projects/basket-lidar
[45] First RAG, Second SEG: A Training-Free Paradigm for Camouflaged Object Detection
Wutao Liu,YiDan Wang,Pan Gao
Main category: cs.CV
TL;DR: 提出了一种无需训练的伪装目标检测(COD)方法RAG-SEG,通过两阶段(RAG和SEG)实现高效检测,性能媲美现有方法,且计算资源需求低。
Details
Motivation: 传统COD方法依赖大量训练和计算资源,基础模型如SAM虽泛化能力强,但需要高质量提示且需微调。手工生成提示成本高,效率低。Contribution: 1. 提出RAG-SEG,无需训练的COD范式;2. 通过两阶段(RAG生成粗掩码,SAM细化)实现高效检测;3. 在基准数据集上性能媲美现有方法,且计算高效。
Method: 1. RAG阶段:利用无监督聚类构建紧凑检索数据库,快速生成粗掩码作为提示;2. SEG阶段:基于SAM的细化分割。
Result: 在COD基准数据集上性能与最优方法相当或更好,且仅需个人笔记本即可运行。
Insight: 1. 通过两阶段解耦任务降低复杂度;2. 无监督聚类和检索结合可高效生成提示;3. 方法展示了无需训练的轻量级COD可行性。
Abstract: Camouflaged object detection (COD) poses a significant challenge in computer vision due to the high similarity between objects and their backgrounds. Existing approaches often rely on heavy training and large computational resources. While foundation models such as the Segment Anything Model (SAM) offer strong generalization, they still struggle to handle COD tasks without fine-tuning and require high-quality prompts to yield good performance. However, generating such prompts manually is costly and inefficient. To address these challenges, we propose \textbf{First RAG, Second SEG (RAG-SEG)}, a training-free paradigm that decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks as prompts, followed by SAM-based segmentation (SEG) for refinement. RAG-SEG constructs a compact retrieval database via unsupervised clustering, enabling fast and effective feature retrieval. During inference, the retrieved features produce pseudo-labels that guide precise mask generation using SAM2. Our method eliminates the need for conventional training while maintaining competitive performance. Extensive experiments on benchmark COD datasets demonstrate that RAG-SEG performs on par with or surpasses state-of-the-art methods. Notably, all experiments are conducted on a \textbf{personal laptop}, highlighting the computational efficiency and practicality of our approach. We present further analysis in the Appendix, covering limitations, salient object detection extension, and possible improvements.
[46] VideoEraser: Concept Erasure in Text-to-Video Diffusion Models
Naen Xu,Jinghuai Zhang,Changjiang Li,Zhi Chen,Chunyi Zhou,Qingming Li,Tianyu Du,Shouling Ji
Main category: cs.CV
TL;DR: VideoEraser是一种无需训练的框架,旨在防止文本到视频(T2V)扩散模型生成含有不良概念的视频,即使明确提示这些概念。它通过两阶段处理(SPEA和ARNG)实现,在多个任务中表现优越。
Details
Motivation: 随着T2V扩散模型的快速发展,其可能生成有害或误导性内容的问题引发了对隐私、版权和安全的担忧。VideoEraser旨在解决这一问题,防止模型生成不良内容。Contribution: 提出了VideoEraser,一种即插即用的框架,无需训练即可防止T2V模型生成不良内容。通过SPEA和ARNG两阶段方法,提升了模型的鲁棒性和泛化能力。
Method: 采用选择性提示嵌入调整(SPEA)和对抗性鲁棒噪声引导(ARNG)两阶段方法,无缝集成到现有T2V扩散模型中。
Result: 在四个任务(如对象擦除、名人擦除等)中,VideoEraser平均减少了46%的不良内容生成,优于现有方法。
Insight: 无需重新训练即可高效擦除不良概念,为T2V模型的安全应用提供了新思路。
Abstract: The rapid growth of text-to-video (T2V) diffusion models has raised concerns about privacy, copyright, and safety due to their potential misuse in generating harmful or misleading content. These models are often trained on numerous datasets, including unauthorized personal identities, artistic creations, and harmful materials, which can lead to uncontrolled production and distribution of such content. To address this, we propose VideoEraser, a training-free framework that prevents T2V diffusion models from generating videos with undesirable concepts, even when explicitly prompted with those concepts. Designed as a plug-and-play module, VideoEraser can seamlessly integrate with representative T2V diffusion models via a two-stage process: Selective Prompt Embedding Adjustment (SPEA) and Adversarial-Resilient Noise Guidance (ARNG). We conduct extensive evaluations across four tasks, including object erasure, artistic style erasure, celebrity erasure, and explicit content erasure. Experimental results show that VideoEraser consistently outperforms prior methods regarding efficacy, integrity, fidelity, robustness, and generalizability. Notably, VideoEraser achieves state-of-the-art performance in suppressing undesirable content during T2V generation, reducing it by 46% on average across four tasks compared to baselines.
[47] Predicting Road Crossing Behaviour using Pose Detection and Sequence Modelling
Subhasis Dasgupta,Preetam Saha,Agniva Roy,Jaydip Sen
Main category: cs.CV
TL;DR: 这篇论文通过姿态检测和序列建模预测行人过马路意图,比较了三种序列模型(GRU、LSTM和1D CNN),发现1D CNN在速度上表现最佳,GRU在预测准确性上优于LSTM。
Details
Motivation: 随着自动驾驶车辆的普及,准确预测行人是否过马路成为关键问题。论文旨在通过深度学习方法解决这一问题。Contribution: 主要的贡献在于提出了一个结合姿态检测和序列建模的端到端深度学习框架,用于预测行人过马路意图,并比较了不同序列模型的性能。
Method: 论文采用了姿态检测模型提取行人姿态数据,随后用三种序列模型(GRU、LSTM和1D CNN)进行时序预测,分析了它们的预测行为和性能。
Result: 实验结果显示,GRU在预测准确性上优于LSTM,而1D CNN在速度上表现最佳。
Insight: 研究强调了序列建模在预测行人行为中的重要性,并为实际应用中速度和准确性的权衡提供了参考。
Abstract: The world is constantly moving towards AI based systems and autonomous vehicles are now reality in different parts of the world. These vehicles require sensors and cameras to detect objects and maneuver according to that. It becomes important to for such vehicles to also predict from a distant if a person is about to cross a road or not. The current study focused on predicting the intent of crossing the road by pedestrians in an experimental setup. The study involved working with deep learning models to predict poses and sequence modelling for temporal predictions. The study analysed three different sequence modelling to understand the prediction behaviour and it was found out that GRU was better in predicting the intent compared to LSTM model but 1D CNN was the best model in terms of speed. The study involved video analysis, and the output of pose detection model was integrated later on to sequence modelling techniques for an end-to-end deep learning framework for predicting road crossing intents.
[48] RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features
Olga Matykina,Dmitry Yudin
Main category: cs.CV
TL;DR: RCDINO是一种基于多模态Transformer的雷达-摄像头3D目标检测模型,通过融合DINOv2预训练模型的语义特征提升检测性能,在nuScenes数据集上实现了SOTA结果。
Details
Motivation: 自动驾驶和机器人领域需要高效的雷达-摄像头多模态数据融合方法,现有的视觉骨干特征在某些场景下语义信息不足,限制了检测性能。Contribution: 提出RCDINO模型,通过融合DINOv2的语义特征增强视觉骨干特征,显著提升3D目标检测性能,同时保持与基线架构的兼容性。
Method: 采用基于Transformer的多模态融合方法,将DINOv2的语义特征与视觉骨干特征结合,从而丰富特征表达。
Result: 在nuScenes数据集上达到56.4 NDS和48.1 mAP,优于现有雷达-摄像头模型。
Insight: 预训练的DINOv2模型具有丰富的语义信息,能够有效增强视觉特征的表达能力,从而提升多模态目标检测的性能。
Abstract: Three-dimensional object detection is essential for autonomous driving and robotics, relying on effective fusion of multimodal data from cameras and radar. This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features by fusing them with semantically rich representations from the pretrained DINOv2 foundation model. This approach enriches visual representations and improves the model’s detection performance while preserving compatibility with the baseline architecture. Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models, with 56.4 NDS and 48.1 mAP. Our implementation is available at https://github.com/OlgaMatykina/RCDINO.
[49] An Empirical Study on How Video-LLMs Answer Video Questions
Chenhui Gou,Ziyu Ma,Zicheng Duan,Haoyu He,Feng Chen,Akide Liu,Bohan Zhuang,Jianfei Cai,Hamid Rezatofighi
Main category: cs.CV
TL;DR: 这篇论文通过系统的实证研究,揭示了视频大型语言模型(Video-LLMs)如何内部处理和回答视频问题的机制。通过注意力剔除技术,研究发现视频信息提取主要在早期层完成,形成两阶段处理过程,并揭示了一些中间层对视频问答的关键作用。此外,研究发现时空建模更多地依赖于语言引导的检索而非视频令牌的自注意力。这些发现为提升视频-LLM的效率和可解释性提供了新视角。
Details
Motivation: 现有研究主要集中在提升Video-LLMs的性能,而对其内部机制的理解有限。论文旨在填补这一空白,通过系统性研究揭示模型如何处理和理解视频内容。Contribution: 1. 首次系统性揭示了Video-LLMs的内部工作机制。2. 设计三种注意力剔除变体(时间剔除、空间剔除和语言-视频剔除)作为分析工具。3. 提出全局和细粒度两种设置,发现早期层和某些关键中间层的作用。4. 揭示了时空建模依赖语言引导检索而非自注意力。5. 展示了如何利用这些发现减少计算开销。
Method: 采用注意力剔除技术,设计了三种变体(Video Temporal Knockout, Video Spatial Knockout, Language-to-Video Knockout),并通过调节剔除的层数范围,在全局和细粒度设置下分析模型行为。
Result: 1. 早期层主要负责视频信息提取,形成两阶段处理。2. 某些中间层对问答任务具有不成比例的影响。3. 时空建模更依赖语言引导检索而非视频令牌的自注意力。
Insight: 视频-LLM的高效设计可以针对性地优化早期层和关键中间层,同时减少对自注意力的依赖,从而降低计算成本并提高性能。
Abstract: Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process – lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter’s high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.
[50] Transfer learning optimization based on evolutionary selective fine tuning
Jacinto Colan,Ana Davila,Yasuhisa Hasegawa
Main category: cs.CV
TL;DR: BioTune通过进化算法选择性微调预训练模型的层级,提升迁移学习效率,减少计算成本。
Details
Motivation: 传统微调方法通常更新所有模型参数,可能导致过拟合和高计算成本,因此需要更高效的迁移学习方法。Contribution: 提出BioTune,一种基于进化算法的选择性微调技术,优化迁移学习效率和性能。
Method: 使用进化算法识别关键层级进行微调,减少可训练参数数量。
Result: 在九个图像分类数据集上表现优于现有方法(如AutoRGN和LoRA),提高准确率和效率。
Insight: 选择性微调可显著降低计算成本,同时保持或提升模型性能,适用于数据和分布多样的任务。
Abstract: Deep learning has shown substantial progress in image analysis. However, the computational demands of large, fully trained models remain a consideration. Transfer learning offers a strategy for adapting pre-trained models to new tasks. Traditional fine-tuning often involves updating all model parameters, which can potentially lead to overfitting and higher computational costs. This paper introduces BioTune, an evolutionary adaptive fine-tuning technique that selectively fine-tunes layers to enhance transfer learning efficiency. BioTune employs an evolutionary algorithm to identify a focused set of layers for fine-tuning, aiming to optimize model performance on a given target task. Evaluation across nine image classification datasets from various domains indicates that BioTune achieves competitive or improved accuracy and efficiency compared to existing fine-tuning methods such as AutoRGN and LoRA. By concentrating the fine-tuning process on a subset of relevant layers, BioTune reduces the number of trainable parameters, potentially leading to decreased computational cost and facilitating more efficient transfer learning across diverse data characteristics and distributions.
[51] Image-Conditioned 3D Gaussian Splat Quantization
Xinshuang Liu,Runfa Blark Li,Keito Suzuki,Truong Nguyen
Main category: cs.CV
TL;DR: 该论文提出了ICGS-Quantizer,一种基于图像条件的高效3D高斯泼溅量化方法,显著提升了压缩效率并支持存档后场景变化的适应性。
Details
Motivation: 现有的3D高斯泼溅(3DGS)压缩方法存在存储需求大(仅压缩至兆字节级)且缺乏存档后适应性的问题,限制了其大规模应用。Contribution: 提出了ICGS-Quantizer,通过联合利用高斯间和属性间的相关性以及共享训练场景的码书,显著降低了3DGS的存储需求至千字节级,并支持基于图像的存档后场景更新。
Method: 采用联合编码、量化和解码的训练方法,利用共享码书和图像条件解码机制,实现高效压缩和场景适应性。
Result: 实验表明,ICGS-Quantizer在压缩效率和场景更新适应性上均优于现有方法,存储需求显著降低。
Insight: 共享码书和图像条件解码的结合为大规模3D场景压缩和长期存档提供了高效且灵活的解决方案。
Abstract: 3D Gaussian Splatting (3DGS) has attracted considerable attention for enabling high-quality real-time rendering. Although 3DGS compression methods have been proposed for deployment on storage-constrained devices, two limitations hinder archival use: (1) they compress medium-scale scenes only to the megabyte range, which remains impractical for large-scale scenes or extensive scene collections; and (2) they lack mechanisms to accommodate scene changes after long-term archival. To address these limitations, we propose an Image-Conditioned Gaussian Splat Quantizer (ICGS-Quantizer) that substantially enhances compression efficiency and provides adaptability to scene changes after archiving. ICGS-Quantizer improves quantization efficiency by jointly exploiting inter-Gaussian and inter-attribute correlations and by using shared codebooks across all training scenes, which are then fixed and applied to previously unseen test scenes, eliminating the overhead of per-scene codebooks. This approach effectively reduces the storage requirements for 3DGS to the kilobyte range while preserving visual fidelity. To enable adaptability to post-archival scene changes, ICGS-Quantizer conditions scene decoding on images captured at decoding time. The encoding, quantization, and decoding processes are trained jointly, ensuring that the codes, which are quantized representations of the scene, are effective for conditional decoding. We evaluate ICGS-Quantizer on 3D scene compression and 3D scene updating. Experimental results show that ICGS-Quantizer consistently outperforms state-of-the-art methods in compression efficiency and adaptability to scene changes. Our code, model, and data will be publicly available on GitHub.
[52] DriveSplat: Decoupled Driving Scene Reconstruction with Geometry-enhanced Partitioned Neural Gaussians
Cong Wang,Xianda Guo,Wenbo Xu,Wei Tian,Ruiqi Song,Chenming Zhang,Lingxi Li,Long Chen
Main category: cs.CV
TL;DR: DriveSplat提出了一种基于神经高斯表示的解耦动态与静态场景的高质量驾驶场景重建方法,通过区域化体素初始化、可变形神经高斯和几何先验监督,显著提升了新视角合成的性能。
Details
Motivation: 驾驶场景中的快速运动车辆、行人和大规模静态背景为3D场景重建带来挑战。现有基于3D高斯泼溅的方法虽能解耦动态与静态组分,但忽略了背景的几何优化,导致新视角合成的鲁棒性不足。Contribution: DriveSplat的主要贡献包括:1)区域化体素初始化方案,增强近、中、远景的细节表示;2)可变形神经高斯建模非刚性动态物体;3)利用深度和法线先验监督几何准确性。
Method: 方法采用三区域体素划分和可变形神经高斯,结合深度与法线先验监督,优化动态与静态场景的解耦重建。
Result: 在Waymo和KITTI数据集上的实验显示,DriveSplat在新视角合成任务上达到了最先进性能。
Insight: 通过几何增强的区域划分和动态建模,DriveSplat为复杂驾驶场景的重建提供了更鲁棒和准确的解决方案。
Abstract: In the realm of driving scenarios, the presence of rapidly moving vehicles, pedestrians in motion, and large-scale static backgrounds poses significant challenges for 3D scene reconstruction. Recent methods based on 3D Gaussian Splatting address the motion blur problem by decoupling dynamic and static components within the scene. However, these decoupling strategies overlook background optimization with adequate geometry relationships and rely solely on fitting each training view by adding Gaussians. Therefore, these models exhibit limited robustness in rendering novel views and lack an accurate geometric representation. To address the above issues, we introduce DriveSplat, a high-quality reconstruction method for driving scenarios based on neural Gaussian representations with dynamic-static decoupling. To better accommodate the predominantly linear motion patterns of driving viewpoints, a region-wise voxel initialization scheme is employed, which partitions the scene into near, middle, and far regions to enhance close-range detail representation. Deformable neural Gaussians are introduced to model non-rigid dynamic actors, whose parameters are temporally adjusted by a learnable deformation network. The entire framework is further supervised by depth and normal priors from pre-trained models, improving the accuracy of geometric structures. Our method has been rigorously evaluated on the Waymo and KITTI datasets, demonstrating state-of-the-art performance in novel-view synthesis for driving scenarios.
[53] DIO: Refining Mutual Information and Causal Chain to Enhance Machine Abstract Reasoning Ability
Ruizhuo Song,Beiming Yuan
Main category: cs.CV
TL;DR: 该论文通过因果链建模改进DIO模型,解决RPM问题中互信息优化的局限性,提出三种改进方法以提升机器抽象推理能力。
Details
Motivation: 当前深度学习模型在抽象推理方面存在瓶颈,而RPM问题是评估抽象推理能力的权威基准。论文旨在解决RPM问题,提升机器的抽象推理能力。Contribution: 1. 基于因果链建模设计了DIO网络架构;2. 发现互信息优化的局限性并提出三种改进方法;3. 提升了模型在RPM任务中的表现。
Method: 1. 采用因果链建模分析RPM任务;2. 设计DIO模型架构;3. 提出三种改进方法解决互信息优化的局限性。
Result: 通过实验验证了改进方法的有效性,提升了模型在RPM任务中的表现。
Insight: 互信息作为统计量无法捕捉因果关系,论文通过因果链建模的方法弥补了这一不足。
Abstract: Despite the outstanding performance of current deep learning models across various domains, their fundamental bottleneck in abstract reasoning remains unresolved. To address this challenge, the academic community has introduced Raven’s Progressive Matrices (RPM) problems as an authoritative benchmark for evaluating the abstract reasoning capabilities of deep learning algorithms, with a focus on core intelligence dimensions such as abstract reasoning, pattern recognition, and complex problem-solving. Therefore, this paper centers on solving RPM problems, aiming to contribute to enhancing the abstract reasoning abilities of machine intelligence. Firstly, this paper adopts a ``causal chain modeling’’ perspective to systematically analyze the complete causal chain in RPM tasks: image $\rightarrow$ abstract attributes $\rightarrow$ progressive attribute patterns $\rightarrow$ pattern consistency $\rightarrow$ correct answer. Based on this analysis, the network architecture of the baseline model DIO is designed. However, experiments reveal that the optimization objective formulated for DIO, namely maximizing the variational lower bound of mutual information between the context and the correct option, fails to enable the model to genuinely acquire the predefined human reasoning logic. This is attributed to two main reasons: the tightness of the lower bound significantly impacts the effectiveness of mutual information maximization, and mutual information, as a statistical measure, does not capture the causal relationship between subjects and objects. To overcome these limitations, this paper progressively proposes three improvement methods:
[54] Spiking Variational Graph Representation Inference for Video Summarization
Wenrui Li,Wei Han,Liang-Jian Deng,Ruiqin Xiong,Xiaopeng Fan
Main category: cs.CV
TL;DR: 论文提出了一种基于SNN和动态图推理的视频摘要方法SpiVG,通过变分推断解决多通道特征融合中的噪声问题,并在多个数据集上表现优异。
Details
Motivation: 现有视频摘要方法难以捕捉全局时序依赖和语义连贯性,且多通道特征融合易受噪声干扰,因此需设计新方法提升信息密度并降低计算复杂度。Contribution: 1. 基于SNN设计关键帧提取器;2. 提出动态聚合图推理器;3. 通过变分推断模块解决特征融合中的不确定性和噪声。
Method: 1. 使用SNN提取关键帧特征;2. 动态聚合图推理器分离对象一致性与语义连贯性;3. 变分推断模块优化ELBO并正则化后验分布。
Result: 在SumMe、TVSum等数据集上优于现有方法。
Insight: SNN的事件驱动机制结合图推理和变分推断能有效提升视频摘要的性能和鲁棒性。
Abstract: With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.
[55] From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations
Anthony Bisulco,Rahul Ramesh,Randall Balestriero,Pratik Chaudhari
Main category: cs.CV
TL;DR: 本文研究了MAEs如何学习输入图像中的空间相关性,分析了线性MAE学习的特征,并展示了掩码比例和补丁大小如何影响捕捉短程和长程空间相关性。作者还将此分析扩展到非线性MAE,并提供了关于如何选择MAE超参数的实际建议。
Details
Motivation: 尽管MAEs在视觉基础模型中表现出色,但其在应用于新数据集时需要大量的超参数调优(如掩码比例、补丁大小、编码器/解码器层数等)。目前的理论研究多集中在MAEs的注意力模式和分层潜变量模型上,而忽略了超参数与下游任务性能之间的关联。本文旨在填补这一空白。Contribution: 本文的主要贡献包括:1) 理论上推导了线性MAE学习的特征;2) 展示了掩码比例和补丁大小如何影响捕捉空间相关性;3) 将分析扩展到非线性MAE,表明其能够适应数据中超出二阶统计量的空间相关性;4) 提供了关于如何选择MAE超参数的实用建议。
Method: 作者首先对线性MAE进行理论分析,推导出其学习的特征;然后通过实验展示掩码比例和补丁大小对捕捉短程和长程空间相关性的影响。进一步,将分析扩展到非线性MAE,并通过实验验证其能够捕捉更复杂的空间相关性。
Result: 结果表明,线性MAE能够学习到与空间相关性相关的特征,而掩码比例和补丁大小可以用于选择捕捉不同范围的空间相关性。非线性MAE则能够自适应数据中的复杂空间相关性,超出二阶统计量的范围。
Insight: 本文的见解包括:1) MAEs的超参数选择可以针对不同范围的空间相关性进行优化;2) 非线性MAE能够捕捉更复杂的空间结构;3) 实践中可以通过掩码比例和补丁大小的调整来优化MAE在下游任务中的性能。
Abstract: Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models. Despite their effectiveness, they require extensive hyperparameter tuning (masking ratio, patch size, encoder/decoder layers) when applied to novel datasets. While prior theoretical works have analyzed MAEs in terms of their attention patterns and hierarchical latent variable models, the connection between MAE hyperparameters and performance on downstream tasks is relatively unexplored. This work investigates how MAEs learn spatial correlations in the input image. We analytically derive the features learned by a linear MAE and show that masking ratio and patch size can be used to select for features that capture short- and long-range spatial correlations. We extend this analysis to non-linear MAEs to show that MAE representations adapt to spatial correlations in the dataset, beyond second-order statistics. Finally, we discuss some insights on how to select MAE hyper-parameters in practice.
[56] Bidirectional Temporal Information Propagation for Moving Infrared Small Target Detection
Dengyan Luo,Yanping Xiang,Hu Wang,Luping Ji. Shuai Li,Mao Ye
Main category: cs.CV
TL;DR: 该论文提出了一种双向时间信息传播方法(BIRD),用于移动红外小目标检测,通过局部和全局时间信息的递归融合,优化了整个视频片段的检测性能。
Details
Motivation: 现有的基于学习的多帧方法采用滑动窗口方式聚合相邻帧信息,忽略了全局时间信息,导致冗余计算和次优性能。Contribution: 提出了一种双向时间信息传播框架(BIRD),结合局部(LTMF)和全局(GTMF)时间信息融合模块,并引入时空融合损失(STF)进行联合优化。
Method: BIRD通过双向传播分支(前向和后向)递归融合局部和全局时间信息,最终结合检测头进行目标检测。
Result: 实验表明,BIRD在性能和推理速度上均达到最优水平。
Insight: 双向时间信息传播能有效利用局部和全局信息,提升红外小目标检测的性能和效率。
Abstract: Moving infrared small target detection is broadly adopted in infrared search and track systems, and has attracted considerable research focus in recent years. The existing learning-based multi-frame methods mainly aggregate the information of adjacent frames in a sliding window fashion to assist the detection of the current frame. However, the sliding-window-based methods do not consider joint optimization of the entire video clip and ignore the global temporal information outside the sliding window, resulting in redundant computation and sub-optimal performance. In this paper, we propose a Bidirectional temporal information propagation method for moving InfraRed small target Detection, dubbed BIRD. The bidirectional propagation strategy simultaneously utilizes local temporal information of adjacent frames and global temporal information of past and future frames in a recursive fashion. Specifically, in the forward and backward propagation branches, we first design a Local Temporal Motion Fusion (LTMF) module to model local spatio-temporal dependency between a target frame and its two adjacent frames. Then, a Global Temporal Motion Fusion (GTMF) module is developed to further aggregate the global propagation feature with the local fusion feature. Finally, the bidirectional aggregated features are fused and input into the detection head for detection. In addition, the entire video clip is jointly optimized by the traditional detection loss and the additional Spatio-Temporal Fusion (STF) loss. Extensive experiments demonstrate that the proposed BIRD method not only achieves the state-of-the-art performance but also shows a fast inference speed.
[57] A Curated Dataset and Deep Learning Approach for Minor Dent Detection in Vehicles
Danish Zia Baig,Mohsin Kamal
Main category: cs.CV
TL;DR: 该论文提出了一种基于YOLOv8的深度学习方法,用于自动检测车辆表面的微小凹陷,通过自定义数据集和模型优化实现了高精度和低延迟的实时检测。
Details
Motivation: 传统的车辆损伤检测方法耗时且易忽略微小缺陷,亟需自动化和高精度的解决方案。Contribution: 1. 创建了一个包含多种光照和角度条件的自定义车辆凹陷数据集;2. 提出了基于YOLOv8的改进模型(YOLOv8m-t4和YOLOv8m-t42),并验证了其高效性和鲁棒性。
Method: 使用YOLOv8m及其变种(YOLOv8m-t4和YOLOv8m-t42),结合实时数据增强技术,训练模型检测微小凹陷。
Result: YOLOv8m-t42模型表现最佳,精度0.86,召回率0.84,F1分数0.85,mAP@0.5为0.60,适合实时应用。
Insight: 虽然YOLOv8m-t42收敛较慢,但其高精度和一致性使其成为实际车辆损伤检测的理想选择。
Abstract: Conventional car damage inspection techniques are labor-intensive, manual, and frequently overlook tiny surface imperfections like microscopic dents. Machine learning provides an innovative solution to the increasing demand for quicker and more precise inspection methods. The paper uses the YOLOv8 object recognition framework to provide a deep learning-based solution for automatically detecting microscopic surface flaws, notably tiny dents, on car exteriors. Traditional automotive damage inspection procedures are manual, time-consuming, and frequently unreliable at detecting tiny flaws. To solve this, a bespoke dataset containing annotated photos of car surfaces under various lighting circumstances, angles, and textures was created. To improve robustness, the YOLOv8m model and its customized variants, YOLOv8m-t4 and YOLOv8m-t42, were trained employing real-time data augmentation approaches. Experimental results show that the technique has excellent detection accuracy and low inference latency, making it suited for real-time applications such as automated insurance evaluations and automobile inspections. Evaluation parameters such as mean Average Precision (mAP), precision, recall, and F1-score verified the model’s efficacy. With a precision of 0.86, recall of 0.84, and F1-score of 0.85, the YOLOv8m-t42 model outperformed the YOLOv8m-t4 model (precision: 0.81, recall: 0.79, F1-score: 0.80) in identifying microscopic surface defects. With a little reduced mAP@0.5:0.95 of 0.20, the mAP@0.5 for YOLOv8m-t42 stabilized at 0.60. Furthermore, YOLOv8m-t42’s PR curve area was 0.88, suggesting more consistent performance than YOLOv8m-t4 (0.82). YOLOv8m-t42 has greater accuracy and is more appropriate for practical dent detection applications, even though its convergence is slower.
[58] Aligning Moments in Time using Video Queries
Yogesh Kumar,Uday Agarwal,Manish Gupta,Anand Mishra
Main category: cs.CV
TL;DR: 论文提出了MATR(Moment Alignment TRansformer)模型,用于解决视频到视频时刻检索(Vid2VidMR)任务。MATR通过双阶段序列对齐和自监督预训练技术,显著提升了性能。
Details
Motivation: 视频到视频时刻检索任务需要语义帧级对齐和建模查询与目标视频间的复杂依赖关系,现有方法难以满足需求。Contribution: 1. 提出MATR模型,基于Transformer,利用双阶段序列对齐捕捉语义和时序信息;2. 引入自监督预训练技术提升模型初始化;3. 在ActivityNet-VRL和SportsMoments数据集上显著优于现有方法。
Method: 1. 使用双阶段序列对齐编码查询和目标视频的关联;2. 通过前景/背景分类和边界预测头实现时刻定位;3. 自监督预训练技术初始化模型。
Result: 在ActivityNet-VRL数据集上,R@1提升13.1%,mIoU提升8.1%;在新数据集SportsMoments上,R@1提升14.7%,mIoU提升14.4%。
Insight: 1. Transformer架构在视频时刻检索任务中具有潜力;2. 自监督预训练显著提升模型性能;3. 双阶段对齐策略能有效捕捉复杂依赖关系。
Abstract: Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query video. Additionally, to provide a strong task-specific initialization for MATR, we propose a self-supervised pre-training technique that involves training the model to localize random clips within videos. Extensive experiments demonstrate that MATR achieves notable performance improvements of 13.1% in R@1 and 8.1% in mIoU on an absolute scale compared to state-of-the-art methods on the popular ActivityNet-VRL dataset. Additionally, on our newly proposed dataset, SportsMoments, MATR shows a 14.7% gain in R@1 and a 14.4% gain in mIoU on an absolute scale over strong baselines.
[59] Enhancing Novel View Synthesis from extremely sparse views with SfM-free 3D Gaussian Splatting Framework
Zongqi He,Hanmin Li,Kin-Chung Chan,Yushen Zuo,Hao Xie,Zhe Xiao,Jun Xiao,Kin-Man Lam
Main category: cs.CV
TL;DR: 本文提出了一种不依赖SfM的3D高斯泼溅(3DGS)方法,通过密集立体模块和一致性插值模块,在极稀疏视角下联合优化相机位姿和3D重建,显著提升了新视角合成的质量。
Details
Motivation: 现有的3DGS方法依赖密集多视角输入和精确的相机位姿,而在极稀疏视角下,结构从运动(SfM)初始化效果差,导致渲染质量下降。本文旨在解决这一限制。Contribution: 1. 提出了一种无需SfM的3DGS框架;2. 设计了密集立体模块和一致性插值模块;3. 引入了多尺度拉普拉斯一致性正则化和自适应空间感知几何正则化。
Method: 通过密集立体模块渐进估计相机位姿并重建全局密集点云,利用一致性插值模块生成额外监督信号,结合多尺度正则化优化几何结构和渲染内容。
Result: 在极稀疏视角下(仅2个训练视图),PSNR提升2.75dB,视觉质量优于现有方法。
Insight: 通过联合优化相机位姿和3D重建,并利用插值和正则化技术,可以有效解决稀疏视角下的新视角合成问题。
Abstract: 3D Gaussian Splatting (3DGS) has demonstrated remarkable real-time performance in novel view synthesis, yet its effectiveness relies heavily on dense multi-view inputs with precisely known camera poses, which are rarely available in real-world scenarios. When input views become extremely sparse, the Structure-from-Motion (SfM) method that 3DGS depends on for initialization fails to accurately reconstruct the 3D geometric structures of scenes, resulting in degraded rendering quality. In this paper, we propose a novel SfM-free 3DGS-based method that jointly estimates camera poses and reconstructs 3D scenes from extremely sparse-view inputs. Specifically, instead of SfM, we propose a dense stereo module to progressively estimates camera pose information and reconstructs a global dense point cloud for initialization. To address the inherent problem of information scarcity in extremely sparse-view settings, we propose a coherent view interpolation module that interpolates camera poses based on training view pairs and generates viewpoint-consistent content as additional supervision signals for training. Furthermore, we introduce multi-scale Laplacian consistent regularization and adaptive spatial-aware multi-scale geometry regularization to enhance the quality of geometrical structures and rendered content. Experiments show that our method significantly outperforms other state-of-the-art 3DGS-based approaches, achieving a remarkable 2.75dB improvement in PSNR under extremely sparse-view conditions (using only 2 training views). The images synthesized by our method exhibit minimal distortion while preserving rich high-frequency details, resulting in superior visual quality compared to existing techniques.
[60] LGMSNet: Thinning a medical image segmentation model via dual-level multiscale fusion
Chengqi Dong,Fenghe Tang,Rongge Mao,Xinpei Gao,S. Kevin Zhou
Main category: cs.CV
TL;DR: LGMSNet是一个轻量级的医学图像分割模型,通过双层次多尺度融合解决现有模型在性能和效率上的权衡问题,同时利用异构卷积核和稀疏Transformer机制提升特征提取能力。
Details
Motivation: 在资源受限的临床环境中,需要轻量化和泛化性强的医学图像分割模型。然而,现有模型为追求效率常牺牲性能,且缺乏有效的全局上下文感知能力。此外,通道冗余问题也限制了特征提取效果。Contribution: 提出了LGMSNet,一种基于局部和全局双层次多尺度融合的轻量级框架,通过异构卷积核减少通道冗余,并结合稀疏Transformer机制捕获全局信息,实现了性能和效率的平衡。
Method: 1. 使用异构卷积核提取局部高频信息;2. 结合稀疏Transformer-Convolutional混合分支捕获低频全局信息;3. 设计了轻量化的双层次多尺度架构。
Result: 在六个公开数据集上的实验表明,LGMSNet性能优于现有方法,并在四个未见过的数据集上展示了出色的零样本泛化能力。
Insight: 异构卷积核和稀疏Transformer的结合可以显著提升轻量级模型的全局感知能力,同时减少冗余。这一框架在资源有限的医学场景中具有实际应用潜力。
Abstract: Medical image segmentation plays a pivotal role in disease diagnosis and treatment planning, particularly in resource-constrained clinical settings where lightweight and generalizable models are urgently needed. However, existing lightweight models often compromise performance for efficiency and rarely adopt computationally expensive attention mechanisms, severely restricting their global contextual perception capabilities. Additionally, current architectures neglect the channel redundancy issue under the same convolutional kernels in medical imaging, which hinders effective feature extraction. To address these challenges, we propose LGMSNet, a novel lightweight framework based on local and global dual multiscale that achieves state-of-the-art performance with minimal computational overhead. LGMSNet employs heterogeneous intra-layer kernels to extract local high-frequency information while mitigating channel redundancy. In addition, the model integrates sparse transformer-convolutional hybrid branches to capture low-frequency global information. Extensive experiments across six public datasets demonstrate LGMSNet’s superiority over existing state-of-the-art methods. In particular, LGMSNet maintains exceptional performance in zero-shot generalization tests on four unseen datasets, underscoring its potential for real-world deployment in resource-limited medical scenarios. The whole project code is in https://github.com/cq-dong/LGMSNet.
[61] MExECON: Multi-view Extended Explicit Clothed humans Optimized via Normal integration
Fulden Ece Uğur,Rafael Redondo,Albert Barreiro,Stefan Hristov,Roger Marí
Main category: cs.CV
TL;DR: MExECON提出了一种从稀疏多视角RGB图像重建穿衣人体虚拟形象的新方法,扩展了单视角方法ECON的能力,通过多视角优化几何和姿态估计。
Details
Motivation: 现有单视角方法(如ECON)难以充分利用多视角信息,导致重建精度受限。MExECON旨在通过多视角一致性优化提升精度。Contribution: 1. 提出了JMBO算法,跨多视角联合优化SMPL-X人体模型;2. 通过法向图积分捕捉细节(如衣物褶皱、发型);3. 无需网络重新训练即可完成多视角优化。
Method: 1. 使用JMBO算法在多视角下联合拟合SMPL-X模型,确保一致性;2. 以优化后的模型为低频先验,通过法向图积分添加高频细节;3. 前后视角法向图结合以获得精细表面。
Result: 实验表明,MExECON比单视角基线(ECON)重建精度更高,且与少样本3D重建方法性能相当。
Insight: 多视角联合优化能显著提升重建质量,而法向图积分是捕捉高频细节的有效手段,同时无需引入额外训练开销。
Abstract: This work presents MExECON, a novel pipeline for 3D reconstruction of clothed human avatars from sparse multi-view RGB images. Building on the single-view method ECON, MExECON extends its capabilities to leverage multiple viewpoints, improving geometry and body pose estimation. At the core of the pipeline is the proposed Joint Multi-view Body Optimization (JMBO) algorithm, which fits a single SMPL-X body model jointly across all input views, enforcing multi-view consistency. The optimized body model serves as a low-frequency prior that guides the subsequent surface reconstruction, where geometric details are added via normal map integration. MExECON integrates normal maps from both front and back views to accurately capture fine-grained surface details such as clothing folds and hairstyles. All multi-view gains are achieved without requiring any network re-training. Experimental results show that MExECON consistently improves fidelity over the single-view baseline and achieves competitive performance compared to modern few-shot 3D reconstruction methods.
[62] Task-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion
Mengyu Wang,Zhenyu Liu,Kun Li,Yu Wang,Yuwei Wang,Yanyan Wei,Fei Wang
Main category: cs.CV
TL;DR: 该论文提出了一种名为AdaSFFuse的任务通用的多模态图像融合框架,通过自适应跨域联合学习解决了模态错位、高频细节丢失等挑战。
Details
Motivation: 当前多模态图像融合(MMIF)方法面临模态错位、高频细节破坏和任务特定局限性等问题,亟需一种通用且高效的解决方案。Contribution: 提出AdaSFFuse框架,包含自适应近似小波变换(AdaWAT)和时空频率曼巴块(Spatial-Frequency Mamba Blocks),实现高效的多模态图像融合。
Method: 使用AdaWAT进行频率解耦,提取和对齐多模态图像的高低频特征;通过Spatial-Frequency Mamba Blocks实现时空和频域的跨模态融合。
Result: 在四种MMIF任务上的实验表明,AdaSFFuse在融合性能和计算效率上表现优异。
Insight: 通过自适应频率解耦和跨域融合机制,AdaSFFuse能够平衡性能与效率,适用于多种多模态图像融合任务。
Abstract: Multimodal Image Fusion (MMIF) aims to integrate complementary information from different imaging modalities to overcome the limitations of individual sensors. It enhances image quality and facilitates downstream applications such as remote sensing, medical diagnostics, and robotics. Despite significant advancements, current MMIF methods still face challenges such as modality misalignment, high-frequency detail destruction, and task-specific limitations. To address these challenges, we propose AdaSFFuse, a novel framework for task-generalized MMIF through adaptive cross-domain co-fusion learning. AdaSFFuse introduces two key innovations: the Adaptive Approximate Wavelet Transform (AdaWAT) for frequency decoupling, and the Spatial-Frequency Mamba Blocks for efficient multimodal fusion. AdaWAT adaptively separates the high- and low-frequency components of multimodal images from different scenes, enabling fine-grained extraction and alignment of distinct frequency characteristics for each modality. The Spatial-Frequency Mamba Blocks facilitate cross-domain fusion in both spatial and frequency domains, enhancing this process. These blocks dynamically adjust through learnable mappings to ensure robust fusion across diverse modalities. By combining these components, AdaSFFuse improves the alignment and integration of multimodal features, reduces frequency loss, and preserves critical details. Extensive experiments on four MMIF tasks – Infrared-Visible Image Fusion (IVF), Multi-Focus Image Fusion (MFF), Multi-Exposure Image Fusion (MEF), and Medical Image Fusion (MIF) – demonstrate AdaSFFuse’s superior fusion performance, ensuring both low computational cost and a compact network, offering a strong balance between performance and efficiency. The code will be publicly available at https://github.com/Zhen-yu-Liu/AdaSFFuse.
[63] ExtraGS: Geometric-Aware Trajectory Extrapolation with Uncertainty-Guided Generative Priors
Kaiyuan Tan,Yingying Shen,Haohui Zhu,Zhiwei Zhan,Shan Zhao,Mingfei Tu,Hongcheng Luo,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye
Main category: cs.CV
TL;DR: ExtraGS 是一个用于轨迹外推的框架,结合了几何与生成先验,通过Road Surface Gaussian和Far Field Gaussians提升几何一致性和真实感。
Details
Motivation: 当前的生成先验方法在几何一致性和渲染质量上表现不佳,需要一种既能保持高保真度又能提升外推视图真实感的解决方案。Contribution: 提出了一个结合几何与生成先验的框架ExtraGS,引入了Road Surface Gaussian和Far Field Gaussians表示,并开发了基于球谐函数的自监督不确定性估计方法。
Method: 使用混合高斯-符号距离函数设计Road Surface Gaussian,学习缩放因子处理远距离对象,并通过球谐函数实现生成先验的局部选择性应用。
Result: 在多个数据集和生成先验上验证了ExtraGS在提升外推视图真实感和几何一致性方面的显著效果。
Insight: 几何与生成先验的结合是关键,通过自监督不确定性估计能有效避免生成先验的全局应用带来的问题。
Abstract: Synthesizing extrapolated views from recorded driving logs is critical for simulating driving scenes for autonomous driving vehicles, yet it remains a challenging task. Recent methods leverage generative priors as pseudo ground truth, but often lead to poor geometric consistency and over-smoothed renderings. To address these limitations, we propose ExtraGS, a holistic framework for trajectory extrapolation that integrates both geometric and generative priors. At the core of ExtraGS is a novel Road Surface Gaussian(RSG) representation based on a hybrid Gaussian-Signed Distance Function (SDF) design, and Far Field Gaussians (FFG) that use learnable scaling factors to efficiently handle distant objects. Furthermore, we develop a self-supervised uncertainty estimation framework based on spherical harmonics that enables selective integration of generative priors only where extrapolation artifacts occur. Extensive experiments on multiple datasets, diverse multi-camera setups, and various generative priors demonstrate that ExtraGS significantly enhances the realism and geometric consistency of extrapolated views, while preserving high fidelity along the original trajectory.
[64] Multi-Object Sketch Animation with Grouping and Motion Trajectory Priors
Guotao Liang,Juncheng Hu,Ximing Xing,Jing Zhang,Qian Yu
Main category: cs.CV
TL;DR: GroupSketch是一种新颖的矢量草图动画方法,通过两阶段流程处理多物体交互和复杂运动,显著提升了动画质量和时序一致性。
Details
Motivation: 现有方法难以处理多物体交互和复杂运动,要么局限于单物体,要么存在时序不一致和泛化能力差的问题。GroupSketch旨在解决这些挑战。Contribution: 1. 提出两阶段流程(运动初始化和运动细化);2. 设计基于组的位移网络(GDN)细化动画;3. 引入上下文条件特征增强模块(CCFE)提升时序一致性。
Method: 1. 交互式语义分组和关键帧定义生成粗动画;2. 基于GDN预测组位移场细化动画,利用文本到视频模型的先验;3. 使用CCFE模块增强特征。
Result: 实验表明,GroupSketch在多物体和复杂运动场景中显著优于现有方法,生成高质量的时序一致动画。
Insight: 通过分组和运动轨迹先验,结合文本到视频模型,能够有效解决多物体动画的复杂性和时序一致性挑战。
Abstract: We introduce GroupSketch, a novel method for vector sketch animation that effectively handles multi-object interactions and complex motions. Existing approaches struggle with these scenarios, either being limited to single-object cases or suffering from temporal inconsistency and poor generalization. To address these limitations, our method adopts a two-stage pipeline comprising Motion Initialization and Motion Refinement. In the first stage, the input sketch is interactively divided into semantic groups and key frames are defined, enabling the generation of a coarse animation via interpolation. In the second stage, we propose a Group-based Displacement Network (GDN), which refines the coarse animation by predicting group-specific displacement fields, leveraging priors from a text-to-video model. GDN further incorporates specialized modules, such as Context-conditioned Feature Enhancement (CCFE), to improve temporal consistency. Extensive experiments demonstrate that our approach significantly outperforms existing methods in generating high-quality, temporally consistent animations for complex, multi-object sketches, thus expanding the practical applications of sketch animation.
[65] D3FNet: A Differential Attention Fusion Network for Fine-Grained Road Structure Extraction in Remote Perception Systems
Chang Liu,Yang Xu,Tamas Sziranyi
Main category: cs.CV
TL;DR: D3FNet提出了一种用于高分辨率遥感图像中细粒度道路结构提取的差分注意力融合网络,通过三种创新模块解决了窄道路提取的挑战,并在多个基准测试中取得了优于现有方法的性能。
Details
Motivation: 从高分辨率遥感图像中提取窄道路存在宽度有限、拓扑结构碎片化和频繁遮挡等问题,传统方法难以应对这些挑战。因此,需要一种新的网络设计来增强道路特征的提取能力。Contribution: 1. 提出差分注意力扩张提取(DADE)模块;2. 设计双流解码融合机制(DDFM);3. 采用多尺度扩张策略(1,3,5,9)。
Method: 基于D-LinkNet的编码器-解码器结构,D3FNet引入差分注意力模块和双流解码融合机制,结合多尺度扩张策略提升道路提取的连续性和精度。
Result: 在DeepGlobe和CHN6-CUG基准测试中,D3FNet在IoU和召回率上表现最优,验证了其有效性。
Insight: 注意力机制和多尺度特征的结合可以显著提升细粒度道路提取任务的性能,尤其适用于复杂场景下的道路提取问题。
Abstract: Extracting narrow roads from high-resolution remote sensing imagery remains a significant challenge due to their limited width, fragmented topology, and frequent occlusions. To address these issues, we propose D3FNet, a Dilated Dual-Stream Differential Attention Fusion Network designed for fine-grained road structure segmentation in remote perception systems. Built upon the encoder-decoder backbone of D-LinkNet, D3FNet introduces three key innovations:(1) a Differential Attention Dilation Extraction (DADE) module that enhances subtle road features while suppressing background noise at the bottleneck; (2) a Dual-stream Decoding Fusion Mechanism (DDFM) that integrates original and attention-modulated features to balance spatial precision with semantic context; and (3) a multi-scale dilation strategy (rates 1, 3, 5, 9) that mitigates gridding artifacts and improves continuity in narrow road prediction. Unlike conventional models that overfit to generic road widths, D3FNet specifically targets fine-grained, occluded, and low-contrast road segments. Extensive experiments on the DeepGlobe and CHN6-CUG benchmarks show that D3FNet achieves superior IoU and recall on challenging road regions, outperforming state-of-the-art baselines. Ablation studies further verify the complementary synergy of attention-guided encoding and dual-path decoding. These results confirm D3FNet as a robust solution for fine-grained narrow road extraction in complex remote and cooperative perception scenarios.
[66] Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Youjia Zhang,Youngeun Kim,Young-Geun Choi,Hongyeob Kim,Huiling Liu,Sungeun Hong
Main category: cs.CV
TL;DR: 该论文提出了一种无需反向传播的测试时适应方法ADAPT,通过概率高斯对齐实现高效的分布适应。
Details
Motivation: 现有测试时适应方法依赖反向传播和迭代优化,限制了实时部署和扩展性,且缺乏对类别条件特征分布的显式建模。ADAPT旨在解决这些问题。Contribution: 提出了ADAPT方法,通过高斯概率推断建模类别条件似然,实现了闭式、无训练推断;引入基于CLIP先验和历史知识库的轻量正则化,修正似然偏差。
Method: 将测试时适应任务重新定义为高斯概率推断问题,使用逐步更新的类别均值和共享协方差矩阵建模类别条件似然;结合CLIP先验和历史知识库进行正则化。
Result: 在多种分布偏移基准测试中,ADAPT表现出最优性能,同时具备更高的可扩展性和鲁棒性。
Insight: 概率建模和轻量正则化能够显著提升测试时适应的效率和性能,无需依赖反向传播或源数据。
Abstract: Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
[67] When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
Pengcheng Fang,Yuxia Chen,Rui Guo
Main category: cs.CV
TL;DR: 论文提出了Grounded VideoDiT,一种Video LLM,通过Diffusion Temporal Latent编码器、基于实体的表示和混合token方案,显著提升了视频理解中时间定位和实体交互的能力。
Details
Motivation: 现有Video LLM在时间感知上表现粗糙,无法精确捕捉事件发生时间和实体交互,提出了解决这些局限性的需求。Contribution: 1. 提出了Diffusion Temporal Latent (DTL)编码器增强边界敏感性和时间一致性;2. 通过基于实体的表示明确绑定查询实体和视觉证据;3. 采用混合token方案实现显式时间戳建模。
Method: 结合DTL编码器、实体感知的表示和混合token方案,设计了Grounded VideoDiT模型,优化时间定位和实体交互能力。
Result: 在Charades STA、NExT GQA和多个VideoQA基准上取得了state-of-the-art结果。
Insight: 显式时间建模和实体绑定是提升视频理解中时间定位和交互能力的关键。
Abstract: Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides explicit timestamp modeling, enabling fine grained temporal reasoning. Together, these designs equip Grounded VideoDiT with robust grounding capabilities, as validated by state of the art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks.
[68] Weakly-Supervised Learning for Tree Instances Segmentation in Airborne Lidar Point Clouds
Swann Emilien Céleste Destouches,Jesse Lahaye,Laurent Valentin Jospin,Jan Skaloud
Main category: cs.CV
TL;DR: 提出了一种弱监督学习方法,用于机载激光雷达点云中的树木实例分割,通过人类操作员的评分标签训练评分模型,从而优化分割模型的性能。
Details
Motivation: 机载激光扫描(ALS)数据的树木实例分割对森林监测至关重要,但精确标注数据成本高,且数据受多种因素影响导致分割难度大。Contribution: 提出了一种弱监督框架,通过人类操作员的评分标签训练评分模型,再优化原始分割模型,显著提升了树木实例识别的准确性。
Method: 利用非微调模型或封闭式算法生成初始分割结果,人类操作员提供评分标签训练评分模型,最后用评分模型反馈微调分割模型。
Result: 分割模型的正确识别树木实例提升了34%,并显著减少了非树木实例的误识别。
Insight: 在稀疏森林区域或复杂环境中(如小树或灌木丛),方法的性能有所下降,表明仍需进一步优化。
Abstract: Tree instance segmentation of airborne laser scanning (ALS) data is of utmost importance for forest monitoring, but remains challenging due to variations in the data caused by factors such as sensor resolution, vegetation state at acquisition time, terrain characteristics, etc. Moreover, obtaining a sufficient amount of precisely labeled data to train fully supervised instance segmentation methods is expensive. To address these challenges, we propose a weakly supervised approach where labels of an initial segmentation result obtained either by a non-finetuned model or a closed form algorithm are provided as a quality rating by a human operator. The labels produced during the quality assessment are then used to train a rating model, whose task is to classify a segmentation output into the same classes as specified by the human operator. Finally, the segmentation model is finetuned using feedback from the rating model. This in turn improves the original segmentation model by 34% in terms of correctly identified tree instances while considerably reducing the number of non-tree instances predicted. Challenges still remain in data over sparsely forested regions characterized by small trees (less than two meters in height) or within complex surroundings containing shrubs, boulders, etc. which can be confused as trees where the performance of the proposed method is reduced.
[69] MapKD: Unlocking Prior Knowledge with Cross-Modal Distillation for Efficient Online HD Map Construction
Ziyang Yan,Ruikai Li,Zhiyong Cui,Bohan Li,Han Jiang,Yilong Ren,Aoyong Li,Zhenning Li,Sijia Wen,Haiyang Yu
Main category: cs.CV
TL;DR: MapKD提出了一种新颖的多级跨模态知识蒸馏框架,通过教师-教练-学生(TCS)范式,将多模态模型的先验知识转移到一个高效、低成本、以视觉为中心的学生模型中,显著提升了性能和推理速度。
Details
Motivation: 现有的在线高精地图构建方法依赖于过时的离线地图和多模态传感器,导致推理时不必要的计算开销。本文旨在通过知识蒸馏技术,将先验知识转移到低成本的学生模型中,以解决这一问题。Contribution: 1. 提出了MapKD框架,采用TCS范式实现跨模态知识转移;2. 设计了两种知识蒸馏策略(TGPD和MSRD)以优化特征对齐和语义学习;3. 在nuScenes数据集上验证了方法的有效性。
Method: 1. 使用相机-LiDAR融合模型(教师)和视觉中心教练模型(模拟LiDAR)实现知识桥梁;2. 通过TGPD(2D特征对齐)和MSRD(语义学习指导)进行知识蒸馏;3. 最终训练轻量级学生模型。
Result: 在nuScenes数据集上,学生模型的性能提升了+6.68 mIoU和+10.94 mAP,同时加快了推理速度。
Insight: 通过跨模态知识蒸馏,可以显著提升低模态(如纯视觉)模型的性能,同时保持高效性,为自动驾驶中的高精地图构建提供了一种低成本解决方案。
Abstract: Online HD map construction is a fundamental task in autonomous driving systems, aiming to acquire semantic information of map elements around the ego vehicle based on real-time sensor inputs. Recently, several approaches have achieved promising results by incorporating offline priors such as SD maps and HD maps or by fusing multi-modal data. However, these methods depend on stale offline maps and multi-modal sensor suites, resulting in avoidable computational overhead at inference. To address these limitations, we employ a knowledge distillation strategy to transfer knowledge from multimodal models with prior knowledge to an efficient, low-cost, and vision-centric student model. Specifically, we propose MapKD, a novel multi-level cross-modal knowledge distillation framework with an innovative Teacher-Coach-Student (TCS) paradigm. This framework consists of: (1) a camera-LiDAR fusion model with SD/HD map priors serving as the teacher; (2) a vision-centric coach model with prior knowledge and simulated LiDAR to bridge the cross-modal knowledge transfer gap; and (3) a lightweight vision-based student model. Additionally, we introduce two targeted knowledge distillation strategies: Token-Guided 2D Patch Distillation (TGPD) for bird’s eye view feature alignment and Masked Semantic Response Distillation (MSRD) for semantic learning guidance. Extensive experiments on the challenging nuScenes dataset demonstrate that MapKD improves the student model by +6.68 mIoU and +10.94 mAP while simultaneously accelerating inference speed. The code is available at:https://github.com/2004yan/MapKD2026.
[70] CM2LoD3: Reconstructing LoD3 Building Models Using Semantic Conflict Maps
Franz Hanke,Antonia Bieringer,Olaf Wysocki,Boris Jutzi
Main category: cs.CV
TL;DR: CM2LoD3提出了一种利用语义冲突图(CMs)重建详细LoD3建筑模型的方法,通过分析射线到模型的先验信息,实现了自动化的LoD3模型重建。
Details
Motivation: LoD1和LoD2建筑模型缺乏详细的外立面元素,而LoD3模型虽能满足需求,但传统方法依赖手动建模,难以大规模应用。Contribution: 提出了CM2LoD3方法,结合语义冲突图(CMs)和纹理模型分割,显著提高了建筑开口的分割和重建精度。
Method: 使用语义冲突图生成器(SCMG)生成合成CMs,并与真实CMs进行语义分割,同时通过置信度融合纹理模型的分割结果。
Result: 实验表明,CM2LoD3在分割和重建建筑开口方面表现优越,通过不确定性感知融合,分割性能提升了61%。
Insight: 该方法为自动化LoD3模型重建提供了新思路,为可扩展的高效3D城市建模铺平了道路。
Abstract: Detailed 3D building models are crucial for urban planning, digital twins, and disaster management applications. While Level of Detail 1 (LoD)1 and LoD2 building models are widely available, they lack detailed facade elements essential for advanced urban analysis. In contrast, LoD3 models address this limitation by incorporating facade elements such as windows, doors, and underpasses. However, their generation has traditionally required manual modeling, making large-scale adoption challenging. In this contribution, CM2LoD3, we present a novel method for reconstructing LoD3 building models leveraging Conflict Maps (CMs) obtained from ray-to-model-prior analysis. Unlike previous works, we concentrate on semantically segmenting real-world CMs with synthetically generated CMs from our developed Semantic Conflict Map Generator (SCMG). We also observe that additional segmentation of textured models can be fused with CMs using confidence scores to further increase segmentation performance and thus increase 3D reconstruction accuracy. Experimental results demonstrate the effectiveness of our CM2LoD3 method in segmenting and reconstructing building openings, with the 61% performance with uncertainty-aware fusion of segmented building textures. This research contributes to the advancement of automated LoD3 model reconstruction, paving the way for scalable and efficient 3D city modeling. Our project is available: https://github.com/InFraHank/CM2LoD3
[71] LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions
Yongju Jia,Jiarui Ma,Xiangxian Li,Baiqiao Zhang,Xianhui Cao,Juan Liu,Yulong Bian
Main category: cs.CV
TL;DR: 本文提出了一种名为MDPR的动态提示路由框架,通过结合LLM和VLM,解决了长尾分布下视觉语言模型微调中的类别偏差问题,显著提升了模型性能。
Details
Motivation: 预训练视觉语言模型(VLM)在类别不平衡的长尾场景中微调时容易产生偏差,现有方法虽然引入了LLM补充语义信息,但忽视了预训练中固有的类别不平衡问题,导致偏差累积。Contribution: 提出了多维度动态提示路由(MDPR)框架,通过构建跨五维视觉语义的类别知识库,动态路由机制实现了全局类别对齐、最优提示检索和细粒度语义平衡,并通过logits融合提高了预测稳定性。
Method: MDPR框架结合LLM和VLM,构建多维视觉语义知识库,动态路由机制在微调过程中动态检索提示并平衡语义,同时融合logits以稳定预测。
Result: 在CIFAR-LT、ImageNet-LT和Places-LT等长尾基准测试中,MDPR与当前SOTA方法取得了可比的结果,且动态路由仅带来极小的计算开销。
Insight: 通过多维度语义知识库和动态路由机制,MDPR有效解决了长尾分布下的类别偏差问题,为VLM微调提供了一种灵活且高效的增强方法。
Abstract: Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive capability in visual tasks, but their fine-tuning often suffers from bias in class-imbalanced scene. Recent works have introduced large language models (LLMs) to enhance VLM fine-tuning with supplementing semantic information. However, they often overlook inherent class imbalance in VLMs’ pre-training, which may lead to bias accumulation in downstream tasks. To address this problem, this paper proposes a Multi-dimensional Dynamic Prompt Routing (MDPR) framework. MDPR constructs a comprehensive knowledge base for classes, spanning five visual-semantic dimensions. During fine-tuning, the dynamic routing mechanism aligns global visual classes, retrieves optimal prompts, and balances fine-grained semantics, yielding stable predictions through logits fusion. Extensive experiments on long-tailed benchmarks, including CIFAR-LT, ImageNet-LT, and Places-LT, demonstrate that MDPR achieves comparable results with current SOTA methods. Ablation studies further confirm the effectiveness of our semantic library for tail classes, and show that our dynamic routing incurs minimal computational overhead, making MDPR a flexible and efficient enhancement for VLM fine-tuning under data imbalance.
[72] StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding
Yanlai Yang,Zhuokai Zhao,Satya Narayan Shukla,Aashu Singh,Shlok Kumar Mishra,Lizhu Zhang,Mengye Ren
Main category: cs.CV
TL;DR: 论文提出了StreamMem,一种支持流式视频理解的查询无关KV缓存内存机制,能够高效压缩KV缓存以减少内存和计算开销,适用于长视频和多轮对话场景。
Details
Motivation: 现有的视频理解方法在长视频场景下存在内存和计算开销高的问题,且通常需要预先编码整个视频或知道问题内容,难以实际应用。StreamMem旨在解决这些问题。Contribution: 提出了StreamMem,一种查询无关的KV缓存内存机制,支持流式视频理解和高效压缩KV缓存,适用于内存受限的长视频场景。
Method: 通过流式编码新视频帧,利用视觉令牌与通用查询令牌之间的注意力分数压缩KV缓存,同时维护固定大小的KV内存以实现高效问答。
Result: 在三个长视频理解和两个流式视频问答基准测试中,StreamMem在查询无关KV缓存压缩方面达到SOTA性能,且与查询感知压缩方法竞争。
Insight: 通过查询无关的KV缓存压缩,StreamMem在减少内存和计算开销的同时,保持了高效的长视频理解能力,为实际应用提供了新思路。
Abstract: Multimodal large language models (MLLMs) have made significant progress in visual-language reasoning, but their ability to efficiently handle long videos remains limited. Despite recent advances in long-context MLLMs, storing and attending to the key-value (KV) cache for long visual contexts incurs substantial memory and computational overhead. Existing visual compression methods require either encoding the entire visual context before compression or having access to the questions in advance, which is impractical for long video understanding and multi-turn conversational settings. In this work, we propose StreamMem, a query-agnostic KV cache memory mechanism for streaming video understanding. Specifically, StreamMem encodes new video frames in a streaming manner, compressing the KV cache using attention scores between visual tokens and generic query tokens, while maintaining a fixed-size KV memory to enable efficient question answering (QA) in memory-constrained, long-video scenarios. Evaluation on three long video understanding and two streaming video question answering benchmarks shows that StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.
[73] WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception
Zhiheng Liu,Xueqing Deng,Shoufa Chen,Angtian Wang,Qiushan Guo,Mingfei Han,Zeyue Xue,Mengzhao Chen,Ping Luo,Linjie Yang
Main category: cs.CV
TL;DR: WorldWeaver是一个用于生成长视频的框架,通过联合建模RGB帧和感知条件,解决了长序列中结构和时间一致性的问题。
Details
Motivation: 现有方法主要依赖RGB信号,导致长视频生成中结构和运动的累积误差,WorldWeaver通过引入感知条件解决这一问题。Contribution: 1. 联合预测感知条件和颜色信息;2. 利用深度线索构建记忆库;3. 分段噪声调度以降低计算成本。
Method: 统一建模RGB和感知条件,利用深度线索构建记忆库,并采用分段噪声调度策略。
Result: 实验表明WorldWeaver能有效减少时间漂移,提高生成视频的保真度。
Insight: 深度信息比RGB更具稳定性,有助于提升长视频生成的性能。
Abstract: Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.
[74] Fine-grained Multi-class Nuclei Segmentation with Molecular-empowered All-in-SAM Model
Xueyuan Li,Can Cui,Ruining Deng,Yucheng Tang,Quan Liu,Tianyuan Yao,Shunxing Bao,Naweed Chowdhury,Haichun Yang,Yuankai Huo
Main category: cs.CV
TL;DR: 该论文提出了分子驱动的All-in-SAM模型,通过结合标注、学习和细化三个步骤,提升了细粒度的细胞核分割性能,减少了标注需求,并适用于资源受限的场景。
Details
Motivation: 现有视觉基础模型(如SAM)在细粒度语义分割(如特定细胞核亚型识别)方面存在挑战。论文旨在通过分子驱动的学习方法改进这一缺陷,提升计算病理学的分析效果。Contribution: 1. 提出基于分子驱动的All-in-SAM模型,整合了标注、学习和细化三个阶段;2. 通过SAM适配器和分子导向纠正学习(MOCL)提升分割精度;3. 减少标注工作量,适用于资源有限的环境。
Method: 1. 标注阶段:利用分子驱动学习减少像素级标注需求;2. 学习阶段:通过SAM适配器调整模型以关注特定语义;3. 细化阶段:结合MOCL提升分割准确性。
Result: 在内部和公开数据集上,模型显著提升了细胞分类性能,且对标注质量变化具有鲁棒性。
Insight: 通过分子驱动的方法,可以高效利用基础模型的通用性,同时针对细粒度任务优化性能,为医学图像分析提供了新思路。
Abstract: Purpose: Recent developments in computational pathology have been driven by advances in Vision Foundation Models, particularly the Segment Anything Model (SAM). This model facilitates nuclei segmentation through two primary methods: prompt-based zero-shot segmentation and the use of cell-specific SAM models for direct segmentation. These approaches enable effective segmentation across a range of nuclei and cells. However, general vision foundation models often face challenges with fine-grained semantic segmentation, such as identifying specific nuclei subtypes or particular cells. Approach: In this paper, we propose the molecular-empowered All-in-SAM Model to advance computational pathology by leveraging the capabilities of vision foundation models. This model incorporates a full-stack approach, focusing on: (1) annotation-engaging lay annotators through molecular-empowered learning to reduce the need for detailed pixel-level annotations, (2) learning-adapting the SAM model to emphasize specific semantics, which utilizes its strong generalizability with SAM adapter, and (3) refinement-enhancing segmentation accuracy by integrating Molecular-Oriented Corrective Learning (MOCL). Results: Experimental results from both in-house and public datasets show that the All-in-SAM model significantly improves cell classification performance, even when faced with varying annotation quality. Conclusions: Our approach not only reduces the workload for annotators but also extends the accessibility of precise biomedical image analysis to resource-limited settings, thereby advancing medical diagnostics and automating pathology image analysis.
[75] Waver: Wave Your Way to Lifelike Video Generation
Yifu Zhang,Hao Yang,Yuqi Zhang,Yifei Hu,Fengda Zhu,Chuang Lin,Xiaofeng Mei,Yi Jiang,Zehuan Yuan,Bingyue Peng
Main category: cs.CV
TL;DR: Waver 是一种高性能的基础模型,支持图像和视频的生成,实现了文本到视频(T2V)、图像到视频(I2V)和文本到图像(T2I)的统一生成。通过 Hybrid Stream DiT 架构和数据优化流程,Waver 在运动捕捉和一致性方面表现优异,位列 T2V 和 I2V 排行榜前三。
Details
Motivation: 视频生成技术面临多种挑战,如运动复杂性、时间一致性以及多模态对齐等问题。Waver 的目标是提供一个统一的框架,高效生成高质量视频,同时推动该领域的技术进步。Contribution: 1. 提出 Hybrid Stream DiT 架构,增强模态对齐并加速训练收敛;2. 建立数据优化流程,确保训练数据质量;3. 提供详细的训练和推理方案,支持高质量视频生成。
Method: 采用 Hybrid Stream DiT 架构提升模态对齐能力,结合 MLLM 视频质量模型过滤低质量数据,并通过训练和推理方案优化生成效果。
Result: Waver 在 T2V 和 I2V 任务中表现优异,位列排行榜前三,性能超越开源模型并媲美商业解决方案。
Insight: 模态对齐和数据质量对视频生成至关重要。通过架构创新和数据优化,可以显著提升模型的运动捕捉和合成一致性。
Abstract: We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.
[76] ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling
Jinhyung Park,Javier Romero,Shunsuke Saito,Fabian Prada,Takaaki Shiratori,Yichen Xu,Federica Bogo,Shoou-I Yu,Kris Kitani,Rawal Khirodkar
Main category: cs.CV
TL;DR: ATLAS提出了一种新的参数化人体建模方法,通过解耦骨骼和形状参数,提高了模型的表达能力,实现了对身体属性的精细控制。
Details
Motivation: 现有的人体网格建模方法因训练数据多样性不足和建模假设的限制,难以捕捉多样姿势和形状的细节变化,且骨骼与外部软组织之间存在依赖性。Contribution: ATLAS通过解耦形状和骨骼参数,提供了一个高保真的人体模型,支持对属性(如身高和骨骼长度)的直接控制,并增强了关键点拟合的能力。
Method: ATLAS从60万高分辨率扫描数据中学习,通过将网格表示与人体骨骼联系起来,实现了形状和骨骼的非线性解耦。
Result: 实验表明,ATLAS在拟合未见姿势和形状时表现更优,且非线性姿势校正比线性模型更有效。
Insight: 解耦骨骼和形状参数可以显著提高人体模型的表达能力和控制灵活性,尤其是在复杂姿势下表现更佳。
Abstract: Parametric body models offer expressive 3D representation of humans across a wide range of poses, shapes, and facial expressions, typically derived by learning a basis over registered 3D meshes. However, existing human mesh modeling approaches struggle to capture detailed variations across diverse body poses and shapes, largely due to limited training data diversity and restrictive modeling assumptions. Moreover, the common paradigm first optimizes the external body surface using a linear basis, then regresses internal skeletal joints from surface vertices. This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. Unlike previous methods, we explicitly decouple the shape and skeleton bases by grounding our mesh representation in the human skeleton. This decoupling enables enhanced shape expressivity, fine-grained customization of body attributes, and keypoint fitting independent of external soft-tissue characteristics. ATLAS outperforms existing methods by fitting unseen subjects in diverse poses more accurately, and quantitative evaluations show that our non-linear pose correctives more effectively capture complex poses compared to linear models.
[77] SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass
Yanxu Meng,Haoning Wu,Ya Zhang,Weidi Xie
Main category: cs.CV
TL;DR: SceneGen是一种新颖的单图像3D场景生成框架,能够在单次前馈中生成具有几何和纹理的多3D资产,无需优化或资产检索,支持多图像输入扩展,并展示了高效的生成能力。
Details
Motivation: 3D内容生成在VR/AR和具身AI中有广泛应用,但现有方法需要复杂优化或资产检索,限制了效率和实用性。SceneGen旨在解决这些问题,提供高效的3D场景生成方案。Contribution: 1) 提出SceneGen框架,单图像输入生成多3D资产;2) 引入特征聚合模块结合局部和全局信息;3) 支持多图像输入扩展;4) 通过实验验证其高效性和鲁棒性。
Method: SceneGen采用视觉和几何编码器提取特征,通过特征聚合模块整合局部与全局信息,结合位置头生成3D资产及其空间位置。架构支持多图像输入。
Result: 实验显示SceneGen在单次前馈中高效生成高质量3D资产,且在多图像输入下表现更优。定量和定性评估证实其优于现有方法。
Insight: SceneGen展示了单次前馈生成复杂3D场景的潜力,为下游任务提供了实用解决方案,未来可进一步扩展到动态场景生成。
Abstract: 3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen’s direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.
[78] Visual Autoregressive Modeling for Instruction-Guided Image Editing
Qingyang Mao,Qi Cai,Yehao Li,Yingwei Pan,Mingyue Cheng,Ting Yao,Qi Liu,Tao Mei
Main category: cs.CV
TL;DR: VAREdit是一个基于视觉自回归(VAR)框架的图像编辑方法,通过序列化的多尺度特征预测,解决了扩散模型在指令引导编辑中的全局去噪问题,显著提高了编辑精度和效率。
Details
Motivation: 扩散模型在指令引导的图像编辑中表现出色,但其全局去噪过程会导致编辑区域与整个图像上下文的纠缠,引发不希望的伪修改和指令遵从问题。自回归模型通过序列化生成避免了这一问题。Contribution: 提出了VAREdit框架,将图像编辑重构为多尺度特征预测问题,并设计了Scale-Aligned Reference(SAR)模块,有效解决了源图像特征与目标特征尺度不匹配的问题。
Method: VAREdit通过自回归模型生成多尺度目标特征,结合SAR模块注入尺度匹配的条件信息,改进了源图像特征的引导能力。
Result: 在标准基准测试中,VAREdit的GPT-Balance得分比领先的扩散方法高出30%+,且512×512尺寸的图像编辑仅需1.2秒,速度提升2.2倍。
Insight: 自回归模型的序列化和因果机制更适合解决图像编辑中的局部化和指令遵从问题,而多尺度特征的动态匹配是关键技术突破。
Abstract: Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30%+ higher GPT-Balance score. Moreover, it completes a $512\times512$ editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.
[79] Scaling Group Inference for Diverse and High-Quality Generation
Gaurav Parmar,Or Patashnik,Daniil Ostashev,Kuan-Chieh Wang,Kfir Aberman,Srinivasa Narasimhan,Jun-Yan Zhu
Main category: cs.CV
TL;DR: 本文提出了一种可扩展的组推断方法,通过将候选输出建模为图节点并选择子集以优化质量和多样性,显著提高了生成样本的多样性和质量。
Details
Motivation: 生成模型通常独立采样输出,导致用户在选择多幅图像时结果冗余,限制了选择和探索。本文旨在解决这一问题。Contribution: 提出了一种新的可扩展组推断方法,将组推断建模为二次整数分配问题,并引入渐进式剪枝技术以提高效率。
Method: 将候选输出建模为图节点,通过优化质量和多样性(分别对应一元和二元项)选择子集,并采用渐进式剪枝技术扩展到大候选集。
Result: 实验表明,该方法在多样性和质量上显著优于独立采样基线和现有推断算法,适用于多种任务。
Insight: 生成模型应将多输出视为连贯的组而非独立样本,从而提高用户体验和任务适用性。
Abstract: Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we progressively prune the candidate set using intermediate predictions, allowing our method to scale up to large candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.
[80] CineScale: Free Lunch in High-Resolution Cinematic Visual Generation
Haonan Qiu,Ning Yu,Ziqi Huang,Paul Debevec,Ziwei Liu
Main category: cs.CV
TL;DR: CineScale是一种新的推理范式,旨在实现更高分辨率的视觉生成。它通过专门设计的变体解决了两种视频生成架构的问题,并扩展了高分辨率I2V和V2V合成的能力,无需微调即可生成8K图像,仅需少量LoRA微调即可生成4K视频。
Details
Motivation: 现有的视觉扩散模型因高分辨率数据缺乏和计算资源限制,通常只能训练成有限分辨率,导致生成高保真高分辨率内容时出现重复模式。CineScale旨在解决这一问题,释放预训练模型在高分辨率生成上的潜力。Contribution: 提出了CineScale推理范式,通过专门设计的变体解决不同类型视频生成架构的问题,扩展了高分辨率I2V和V2V合成的能力。
Method: 设计了针对不同视频生成架构的自定义变体,无需微调即可支持8K图像生成,仅需少量LoRA微调即可支持4K视频生成。
Result: 实验验证了CineScale在扩展高分辨率视觉生成能力方面的优越性,能够生成8K图像和4K视频。
Insight: CineScale通过优化推理过程而非模型本身,提供了高分辨率生成的免费午餐方案,尤其适合资源受限的场景。
Abstract: Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.
cs.RO [Back]
[81] A Vision-Based Shared-Control Teleoperation Scheme for Controlling the Robotic Arm of a Four-Legged Robot
Murilo Vinicius da Silva,Matheus Hipolito Carvalho,Juliano Negri,Thiago Segreto,Gustavo J. G. Lahr,Ricardo V. Godoy,Marcelo Becker
Main category: cs.RO
TL;DR: 该论文提出了一种基于视觉的共享控制远程操作方案,用于控制四足机器人上的机械臂,通过直观的运动映射和轨迹规划提升安全性和操作效率。
Details
Motivation: 在危险和远程环境中,四足机器人及其机械臂需要更安全和高效的操控方式。当前的遥操作方式(如操纵杆)复杂且不直观,增加了操作员的认知负荷和碰撞风险。Contribution: 提出了一种基于视觉的直观远程控制系统,通过外接摄像头和机器学习模型检测操作者手腕位置,实时映射为机械臂控制命令,并结合轨迹规划确保安全性。
Method: 1. 使用外部摄像头和基于机器学习的模型检测操作者手腕位置;2. 将手腕运动映射为机械臂控制命令;3. 结合轨迹规划器实时检测并避免碰撞。
Result: 在真实机器人上验证了系统的实时控制和鲁棒性能,证明了其在工业应用中的安全性、精确性和易用性。
Insight: 通过直接映射人类手臂运动到机械臂,并结合碰撞检测,能够显著简化操作流程并降低认知负荷,适用于高风险环境中的工业应用。
Abstract: In hazardous and remote environments, robotic systems perform critical tasks demanding improved safety and efficiency. Among these, quadruped robots with manipulator arms offer mobility and versatility for complex operations. However, teleoperating quadruped robots is challenging due to the lack of integrated obstacle detection and intuitive control methods for the robotic arm, increasing collision risks in confined or dynamically changing workspaces. Teleoperation via joysticks or pads can be non-intuitive and demands a high level of expertise due to its complexity, culminating in a high cognitive load on the operator. To address this challenge, a teleoperation approach that directly maps human arm movements to the robotic manipulator offers a simpler and more accessible solution. This work proposes an intuitive remote control by leveraging a vision-based pose estimation pipeline that utilizes an external camera with a machine learning-based model to detect the operator’s wrist position. The system maps these wrist movements into robotic arm commands to control the robot’s arm in real-time. A trajectory planner ensures safe teleoperation by detecting and preventing collisions with both obstacles and the robotic arm itself. The system was validated on the real robot, demonstrating robust performance in real-time control. This teleoperation approach provides a cost-effective solution for industrial applications where safety, precision, and ease of use are paramount, ensuring reliable and intuitive robotic control in high-risk environments.
[82] Decentralized Vision-Based Autonomous Aerial Wildlife Monitoring
Makram Chahine,William Yang,Alaa Maalouf,Justin Siriska,Ninad Jadhav,Daniel Vogt,Stephanie Gil,Robert Wood,Daniela Rus
Main category: cs.RO
TL;DR: 该论文提出了一种基于视觉的分布式多旋翼无人机系统,用于野生动物监测,具有可扩展性、低带宽和最小化传感器需求的特点。
Details
Motivation: 现有的野生动物监测方法通常从群体视角出发,或依赖人工操作且规模受限,无法满足高效、大规模并行部署的需求。Contribution: 提出了一种基于单机载RGB相机的分布式视觉系统,开发了适用于动态非结构化环境的协调与跟踪算法,无需集中式通信或控制。
Method: 采用分布式多旋翼无人机系统,结合新颖的视觉协调与跟踪算法,独立运行且在自然环境中表现鲁棒。
Result: 通过真实环境实验验证了系统的可靠性,能够在大规模野外条件下有效识别和跟踪大型野生动物。
Insight: 分布式视觉系统在复杂环境中具有潜力,无需依赖集中控制即可实现高效、低带宽的监测任务。
Abstract: Wildlife field operations demand efficient parallel deployment methods to identify and interact with specific individuals, enabling simultaneous collective behavioral analysis, and health and safety interventions. Previous robotics solutions approach the problem from the herd perspective, or are manually operated and limited in scale. We propose a decentralized vision-based multi-quadrotor system for wildlife monitoring that is scalable, low-bandwidth, and sensor-minimal (single onboard RGB camera). Our approach enables robust identification and tracking of large species in their natural habitat. We develop novel vision-based coordination and tracking algorithms designed for dynamic, unstructured environments without reliance on centralized communication or control. We validate our system through real-world experiments, demonstrating reliable deployment in diverse field conditions.
[83] Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation
Huy Hoang Nguyen,Johannes Huemer,Markus Murschitz,Tobias Glueck,Minh Nhat Vu,Andreas Kugi
Main category: cs.RO
TL;DR: Lang2Lift是一个结合自然语言引导的托盘检测与姿态估计的框架,用于户外自动叉车操作,解决物流和建筑行业中托盘自动搬运的挑战。
Details
Motivation: 物流和建筑行业在户外环境中自动化托盘搬运面临诸多挑战,如负载多变、托盘质量与尺寸不一致以及非结构化环境。此外,劳动力短缺和安全问题也促使研究自动化解决方案。Contribution: 主要贡献是提出Lang2Lift框架,通过自然语言指令实现托盘的检测与6D姿态估计,并集成到自动叉车操作中。
Method: 结合Florence-2和SAM-2进行基于语言的语义分割,利用FoundationPose实现复杂户外场景下的鲁棒姿态估计,最终通过运动规划模块完成全自动操作。
Result: 在ADAPT自动叉车平台上验证,托盘分割mIoU达到0.76,证明了系统在实际物流和建筑环境中的可行性。
Insight: 语言引导的感知系统能够显著提升自动叉车在复杂户外环境中的适应性和灵活性,为物流自动化提供了新的思路。
Abstract: The logistics and construction industries face persistent challenges in automating pallet handling, especially in outdoor environments with variable payloads, inconsistencies in pallet quality and dimensions, and unstructured surroundings. In this paper, we tackle automation of a critical step in pallet transport: the pallet pick-up operation. Our work is motivated by labor shortages, safety concerns, and inefficiencies in manually locating and retrieving pallets under such conditions. We present Lang2Lift, a framework that leverages foundation models for natural language-guided pallet detection and 6D pose estimation, enabling operators to specify targets through intuitive commands such as “pick up the steel beam pallet near the crane.” The perception pipeline integrates Florence-2 and SAM-2 for language-grounded segmentation with FoundationPose for robust pose estimation in cluttered, multi-pallet outdoor scenes under variable lighting. The resulting poses feed into a motion planning module for fully autonomous forklift operation. We validate Lang2Lift on the ADAPT autonomous forklift platform, achieving 0.76 mIoU pallet segmentation accuracy on a real-world test dataset. Timing and error analysis demonstrate the system’s robustness and confirm its feasibility for deployment in operational logistics and construction environments. Video demonstrations are available at https://eric-nguyen1402.github.io/lang2lift.github.io/
q-bio.QM [Back]
[84] Fusing Structural Phenotypes with Functional Data for Early Prediction of Primary Angle Closure Glaucoma Progression
Swati Sharma,Thanadet Chuangsuwanich,Royston K. Y. Tan,Shimna C. Prasad,Tin A. Tun,Shamira A. Perera,Martin L. Buist,Tin Aung,Monisha E. Nongpiur,Michaël J. A. Girard
Main category: q-bio.QM
TL;DR: 通过结合视神经头结构特征和视野功能参数,利用机器学习模型(如随机森林)对原发性闭角型青光眼患者进行快速或缓慢进展的分类,显示联合特征显著提升了分类性能。
Details
Motivation: 原发性闭角型青光眼(PACG)的进展速度对临床管理至关重要,但目前缺乏有效的预测方法。论文旨在通过整合结构(OCT)和功能(视野)数据,提高对进展风险的早期分类准确性。Contribution: 主要贡献在于提出了一种结合视神经头(ONH)结构参数和视野(VF)功能数据的机器学习方法,显著提升了PACG进展分类的性能(AUC=0.87)。此外,通过SHAP分析识别了6个关键预测因子。
Method: 方法包括:1)纳入PACG患者数据,定义快速和缓慢进展的标准;2)使用AI分割OCT数据提取31个ONH参数;3)将视野功能数据分区与结构参数结合训练机器学习模型(如随机森林);4)使用SHAP分析关键预测因子。
Result: 在451只眼中,随机森林模型(结合结构和功能特征)表现最佳(AUC=0.87),显著优于仅使用结构(AUC=0.82)或功能(AUC=0.78)特征的模型。关键预测因子包括下MRW、RNFL厚度等。
Insight: 研究表明,ONH结构(如下部视神经头形态)在PACG进展中起关键作用,联合结构和功能数据可显著提升预测性能,为临床监测提供了新方向。
Abstract: Purpose: To classify eyes as slow or fast glaucoma progressors in patients with primary angle closure glaucoma (PACG) using an integrated approach combining optic nerve head (ONH) structural features and sector-based visual field (VF) functional parameters. Methods: PACG patients with >5 reliable VF tests over >5 years were included. Progression was assessed in Zeiss Forum, with baseline VF within six months of OCT. Fast progression was VFI decline <-2.0% per year; slow progression >-2.0% per year. OCT volumes were AI-segmented to extract 31 ONH parameters. The Glaucoma Hemifield Test defined five regions per hemifield, aligned with RNFL distribution. Mean sensitivity per region was combined with structural parameters to train ML classifiers. Multiple models were tested, and SHAP identified key predictors. Main outcome measures: Classification of slow versus fast progressors using combined structural and functional data. Results: We analyzed 451 eyes from 299 patients. Mean VFI progression was -0.92% per year; 369 eyes progressed slowly and 82 rapidly. The Random Forest model combining structural and functional features achieved the best performance (AUC = 0.87, 2000 Monte Carlo iterations). SHAP identified six key predictors: inferior MRW, inferior and inferior-temporal RNFL thickness, nasal-temporal LC curvature, superior nasal VF sensitivity, and inferior RNFL and GCL+IPL thickness. Models using only structural or functional features performed worse with AUC of 0.82 and 0.78, respectively. Conclusions: Combining ONH structural and VF functional parameters significantly improves classification of progression risk in PACG. Inferior ONH features, MRW and RNFL thickness, were the most predictive, highlighting the critical role of ONH morphology in monitoring disease progression.
eess.IV [Back]
[85] Scalable Event-Based Video Streaming for Machines with MoQ
Andrew C. Freeman
Main category: eess.IV
TL;DR: 论文提出了一种新的低延迟事件流格式,基于Media Over QUIC协议草案,专为神经形态事件传感器设计,解决了传统视频流技术无法满足的事件数据传输问题。
Details
Motivation: 传统的视频流技术(如有损压缩和速率自适应流)是为人类视频消费设计的,而神经形态‘事件’传感器生成的异步像素数据需要专为计算机视觉优化的传输方案。目前的研究主要集中在应用开发上,忽视了数据传输这一关键问题。Contribution: 1. 提出了基于Media Over QUIC协议草案的新型低延迟事件流格式;2. 解决了事件传感器数据的传输问题,填补了研究空白。
Method: 通过分析现有事件视频系统的技术问题,结合Media Over QUIC协议草案的最近扩展,设计了一种可扩展的事件流格式。
Result: 该方法为事件传感器数据提供了高效的传输解决方案,适合低延迟和高吞吐量的机器视觉应用。
Insight: 事件流传输是计算机视觉应用中未被充分探索的领域,新型流协议可以显著提升机器视觉系统的性能和数据传输效率。
Abstract: Lossy compression and rate-adaptive streaming are a mainstay in traditional video steams. However, a new class of neuromorphic ``event’’ sensors records video with asynchronous pixel samples rather than image frames. These sensors are designed for computer vision applications, rather than human video consumption. Until now, researchers have focused their efforts primarily on application development, ignoring the crucial problem of data transmission. We survey the landscape of event-based video systems, discuss the technical issues with our recent scalable event streaming work, and propose a new low-latency event streaming format based on the latest additions to the Media Over QUIC protocol draft.
[86] Zero-shot Volumetric CT Super-Resolution using 3D Gaussian Splatting with Upsampled 2D X-ray Projection Priors
Jeonghyun Noh,Hyun-Jic Oh,Byungju Chae,Won-Ki Jeong
Main category: eess.IV
TL;DR: 本文提出了一种零样本3D CT超分辨率框架,利用扩散模型生成的2D X-ray投影先验,通过3D高斯溅射和负alpha混合技术,显著提升了3D CT的重建质量。
Details
Motivation: 高分辨率CT的获取受限于辐射风险,且现有深度学习方法需要成对数据,而零样本方法难以恢复细节。本文旨在利用丰富的2D X-ray数据为3D CT重建提供外部先验。Contribution: 1. 提出了结合扩散模型与3D高斯溅射的零样本3D CT超分辨率框架;2. 引入负alpha混合技术(NAB-GS),支持负值密度表示,增强高频结构重建。
Method: 1. 使用扩散模型生成高分辨率2D X-ray投影作为先验;2. 提出投影自适应采样策略;3. 结合3D高斯溅射和NAB-GS进行3D重建。
Result: 在两个数据集上验证了方法的有效性,定量和定性结果均优于现有方法。
Insight: 利用2D X-ray数据作为先验可以有效提升3D CT超分辨率重建质量,负alpha混合技术为密度表示提供了新思路。
Abstract: Computed tomography (CT) is widely used in clinical diagnosis, but acquiring high-resolution (HR) CT is limited by radiation exposure risks. Deep learning-based super-resolution (SR) methods have been studied to reconstruct HR from low-resolution (LR) inputs. While supervised SR approaches have shown promising results, they require large-scale paired LR-HR volume datasets that are often unavailable. In contrast, zero-shot methods alleviate the need for paired data by using only a single LR input, but typically struggle to recover fine anatomical details due to limited internal information. To overcome these, we propose a novel zero-shot 3D CT SR framework that leverages upsampled 2D X-ray projection priors generated by a diffusion model. Exploiting the abundance of HR 2D X-ray data, we train a diffusion model on large-scale 2D X-ray projection and introduce a per-projection adaptive sampling strategy. It selects the generative process for each projection, thus providing HR projections as strong external priors for 3D CT reconstruction. These projections serve as inputs to 3D Gaussian splatting for reconstructing a 3D CT volume. Furthermore, we propose negative alpha blending (NAB-GS) that allows negative values in Gaussian density representation. NAB-GS enables residual learning between LR and diffusion-based projections, thereby enhancing high-frequency structure reconstruction. Experiments on two datasets show that our method achieves superior quantitative and qualitative results for 3D CT SR.
[87] Pathology-Informed Latent Diffusion Model for Anomaly Detection in Lymph Node Metastasis
Jiamu Wang,Keunho Byeon,Jinsol Song,Anh Nguyen,Sangjeong Ahn,Sung Hak Lee,Jin Tae Kwak
Main category: eess.IV
TL;DR: 该论文提出了一种结合视觉语言模型和扩散模型的病理学驱动的异常检测方法,利用组织病理学提示进行无监督异常检测,避免了标注数据的依赖,并在胃和乳腺淋巴结数据上验证了其有效性。
Details
Motivation: 数字病理学中异常检测需要大量标注数据,而数据稀缺性限制了监督学习方法的应用。无监督异常检测可以避免这一限制,但仍需更有效的方法来区分正常与异常组织。Contribution: 提出了一种基于病理学知识的潜扩散模型(AnoPILaD),通过组织病理学提示指导重建过程,实现无监督异常检测。
Method: 结合视觉语言模型和扩散模型,利用病理学关键词作为提示,在重建过程中区分正常与异常组织。实验在胃和乳腺淋巴结数据集上进行,验证了方法的有效性。
Result: 实验结果表明,该方法在无监督异常检测中表现优异,尤其在跨器官泛化能力上展现了潜力。
Insight: 通过结合病理学先验知识和扩散模型,能够有效提升无监督异常检测的性能,为数字病理学提供了一种高效且无需标注数据的解决方案。
Abstract: Anomaly detection is an emerging approach in digital pathology for its ability to efficiently and effectively utilize data for disease diagnosis. While supervised learning approaches deliver high accuracy, they rely on extensively annotated datasets, suffering from data scarcity in digital pathology. Unsupervised anomaly detection, however, offers a viable alternative by identifying deviations from normal tissue distributions without requiring exhaustive annotations. Recently, denoising diffusion probabilistic models have gained popularity in unsupervised anomaly detection, achieving promising performance in both natural and medical imaging datasets. Building on this, we incorporate a vision-language model with a diffusion model for unsupervised anomaly detection in digital pathology, utilizing histopathology prompts during reconstruction. Our approach employs a set of pathology-related keywords associated with normal tissues to guide the reconstruction process, facilitating the differentiation between normal and abnormal tissues. To evaluate the effectiveness of the proposed method, we conduct experiments on a gastric lymph node dataset from a local hospital and assess its generalization ability under domain shift using a public breast lymph node dataset. The experimental results highlight the potential of the proposed method for unsupervised anomaly detection across various organs in digital pathology. Code: https://github.com/QuIIL/AnoPILaD.
[88] Explainable Knowledge Distillation for Efficient Medical Image Classification
Aqib Nazir Mir,Danish Raza Rizvi
Main category: eess.IV
TL;DR: 该论文探索了一种基于知识蒸馏的高效医学图像分类方法,通过高容量教师模型(如VGG19和轻量级Vision Transformers)指导学生模型的训练,并结合Score-CAM可视化提升模型的可解释性。
Details
Motivation: 在资源受限的临床环境中,需要高效且可解释的医学AI模型,以兼顾分类性能和计算效率。Contribution: 提出了一种结合知识蒸馏和可视化解释的方法,实现了高精度且轻量化的医学图像分类模型。
Method: 使用VGG19和轻量级Vision Transformers作为教师模型,指导学生模型(基于OFA-595超级网络)的训练,并通过Score-CAM可视化分析模型的注意力区域。
Result: 学生模型在保持高分类性能的同时,显著减少了参数量和推理时间,适用于资源受限的临床场景。
Insight: 结合模型效率和可解释性,是开发实用且可信赖的医学AI解决方案的关键。
Abstract: This study comprehensively explores knowledge distillation frameworks for COVID-19 and lung cancer classification using chest X-ray (CXR) images. We employ high-capacity teacher models, including VGG19 and lightweight Vision Transformers (Visformer-S and AutoFormer-V2-T), to guide the training of a compact, hardware-aware student model derived from the OFA-595 supernet. Our approach leverages hybrid supervision, combining ground-truth labels with teacher models’ soft targets to balance accuracy and computational efficiency. We validate our models on two benchmark datasets: COVID-QU-Ex and LCS25000, covering multiple classes, including COVID-19, healthy, non-COVID pneumonia, lung, and colon cancer. To interpret the spatial focus of the models, we employ Score-CAM-based visualizations, which provide insight into the reasoning process of both teacher and student networks. The results demonstrate that the distilled student model maintains high classification performance with significantly reduced parameters and inference time, making it an optimal choice in resource-constrained clinical environments. Our work underscores the importance of combining model efficiency with explainability for practical, trustworthy medical AI solutions.
[89] DoSReMC: Domain Shift Resilient Mammography Classification using Batch Normalization Adaptation
Uğurcan Akyüz,Deniz Katircioglu-Öztürk,Emre K. Süslü,Burhan Keleş,Mete C. Kaya,Gamze Durhan,Meltem G. Akpınar,Figen B. Demirkazık,Gözde B. Akar
Main category: eess.IV
TL;DR: 论文提出了DoSReMC框架,通过仅微调批归一化层和全连接层,提升了乳腺X光分类模型在不同域数据上的泛化能力,解决了域偏移问题。
Details
Motivation: 现有深度学习模型在跨域乳腺X光图像分类中性能下降,限制了AI在临床中的安全公平应用。Contribution: 提出了DoSReMC框架,揭示了批归一化层是模型域依赖的主要来源,并提出一种仅微调BN和FC层的轻量解决方案。
Method: 通过微调BN和FC层并结合对抗训练,提升模型跨域泛化能力,保留预训练卷积层参数。
Result: 在多个大规模FFDM数据集上验证了DoSReMC的有效性,显著提升了跨域分类性能。
Insight: 批归一化层是域依赖的关键因素,针对性微调和对抗训练是提升跨域泛化的有效策略。
Abstract: Numerous deep learning-based solutions have been developed for the automatic recognition of breast cancer using mammography images. However, their performance often declines when applied to data from different domains, primarily due to domain shift - the variation in data distributions between source and target domains. This performance drop limits the safe and equitable deployment of AI in real-world clinical settings. In this study, we present DoSReMC (Domain Shift Resilient Mammography Classification), a batch normalization (BN) adaptation framework designed to enhance cross-domain generalization without retraining the entire model. Using three large-scale full-field digital mammography (FFDM) datasets - including HCTP, a newly introduced, pathologically confirmed in-house dataset - we conduct a systematic cross-domain evaluation with convolutional neural networks (CNNs). Our results demonstrate that BN layers are a primary source of domain dependence: they perform effectively when training and testing occur within the same domain, and they significantly impair model generalization under domain shift. DoSReMC addresses this limitation by fine-tuning only the BN and fully connected (FC) layers, while preserving pretrained convolutional filters. We further integrate this targeted adaptation with an adversarial training scheme, yielding additional improvements in cross-domain generalizability. DoSReMC can be readily incorporated into existing AI pipelines and applied across diverse clinical environments, providing a practical pathway toward more robust and generalizable mammography classification systems.
[90] Deep Equilibrium Convolutional Sparse Coding for Hyperspectral Image Denoising
Jin Ye,Jingran Wang,Fengchao Xiong,Jingzhou Chen,Yuntao Qian
Main category: eess.IV
TL;DR: 该论文提出了一种基于深度均衡(DEQ)的卷积稀疏编码框架(DECSC),用于高光谱图像去噪,结合了局部空间-光谱相关性、非局部空间自相似性和全局空间一致性。
Details
Motivation: 高光谱图像(HSI)常因复杂噪声模式退化,传统的基于展开的方法缺乏收敛保证。DEQ模型通过固定点问题模拟无限深度网络,更适合优化。Contribution: 1. 提出DECSC框架,统一局部、非局部和全局信息;2. 在CSC框架中引入共享2D卷积和非共享3D卷积;3. 嵌入Transformer模块和细节增强模块。
Method: 1. 通过DEQ框架将CSC模型的近端梯度下降转化为固定点问题;2. 结合2D和3D卷积稀疏表示;3. 引入Transformer和细节增强模块。
Result: 实验表明,DECSC在去噪性能上优于现有方法。
Insight: DEQ框架的无限深度特性与优化问题更契合,能有效结合多种图像特征提升去噪效果。
Abstract: Hyperspectral images (HSIs) play a crucial role in remote sensing but are often degraded by complex noise patterns. Ensuring the physical property of the denoised HSIs is vital for robust HSI denoising, giving the rise of deep unfolding-based methods. However, these methods map the optimization of a physical model to a learnable network with a predefined depth, which lacks convergence guarantees. In contrast, Deep Equilibrium (DEQ) models treat the hidden layers of deep networks as the solution to a fixed-point problem and models them as infinite-depth networks, naturally consistent with the optimization. Under the framework of DEQ, we propose a Deep Equilibrium Convolutional Sparse Coding (DECSC) framework that unifies local spatial-spectral correlations, nonlocal spatial self-similarities, and global spatial consistency for robust HSI denoising. Within the convolutional sparse coding (CSC) framework, we enforce shared 2D convolutional sparse representation to ensure global spatial consistency across bands, while unshared 3D convolutional sparse representation captures local spatial-spectral details. To further exploit nonlocal self-similarities, a transformer block is embedded after the 2D CSC. Additionally, a detail enhancement module is integrated with the 3D CSC to promote image detail preservation. We formulate the proximal gradient descent of the CSC model as a fixed-point problem and transform the iterative updates into a learnable network architecture within the framework of DEQ. Experimental results demonstrate that our DECSC method achieves superior denoising performance compared to state-of-the-art methods.
[91] Label Uncertainty for Ultrasound Segmentation
Malini Shivaram,Gautam Rajendrakumar Gare,Laura Hutchins,Jacob Duplantis,Thomas Deiss,Thales Nogueira Gomes,Thong Tran,Keyur H. Patel,Thomas H Fox,Amita Krishnan,Deva Ramanan,Bennett DeBoisblanc,Ricardo Rodriguez,John Galeotti
Main category: eess.IV
TL;DR: 论文提出了一种新方法,通过引入专家提供的像素级置信度值来处理医疗影像分割中的标签不确定性,从而提高分割性能和下游临床任务的准确性。
Details
Motivation: 医疗影像中,放射科医生之间的标注差异(特别是主观性强的模态如肺超声)导致标签不确定性,传统方法将标注视为绝对真值的方法效果有限。Contribution: 1) 提出了一种结合像素级置信度值的标注协议;2) 证明了在训练中利用置信度值能提升分割性能;3) 展示了这种改进在临床任务(如血氧比估计和患者再入院预测)中的实际价值。
Method: 设计了一种标注协议,捕捉放射科医生对每个标签区域的置信度,并在训练中使用这些值。实验比较了不同置信度阈值对模型性能的影响。
Result: 使用60%置信度阈值二值化标签的简单方法表现最佳,显著优于传统50%阈值的朴素方法,且分割性能和下游临床任务结果均得到提升。
Insight: 标签置信度是一个重要信号,合理利用可以显著提升AI在医疗影像中的可靠性和临床价值;高置信度标签的训练效果更优。
Abstract: In medical imaging, inter-observer variability among radiologists often introduces label uncertainty, particularly in modalities where visual interpretation is subjective. Lung ultrasound (LUS) is a prime example-it frequently presents a mixture of highly ambiguous regions and clearly discernible structures, making consistent annotation challenging even for experienced clinicians. In this work, we introduce a novel approach to both labeling and training AI models using expert-supplied, per-pixel confidence values. Rather than treating annotations as absolute ground truth, we design a data annotation protocol that captures the confidence that radiologists have in each labeled region, modeling the inherent aleatoric uncertainty present in real-world clinical data. We demonstrate that incorporating these confidence values during training leads to improved segmentation performance. More importantly, we show that this enhanced segmentation quality translates into better performance on downstream clinically-critical tasks-specifically, estimating S/F oxygenation ratio values, classifying S/F ratio change, and predicting 30-day patient readmission. While we empirically evaluate many methods for exposing the uncertainty to the learning model, we find that a simple approach that trains a model on binarized labels obtained with a (60%) confidence threshold works well. Importantly, high thresholds work far better than a naive approach of a 50% threshold, indicating that training on very confident pixels is far more effective. Our study systematically investigates the impact of training with varying confidence thresholds, comparing not only segmentation metrics but also downstream clinical outcomes. These results suggest that label confidence is a valuable signal that, when properly leveraged, can significantly enhance the reliability and clinical utility of AI in medical imaging.
[92] Hessian-based lightweight neural network for brain vessel segmentation on a minimal training dataset
Alexandra Bernadotte,Elfimov Nikita,Mikhail Shutov,Ivan Menshikov
Main category: eess.IV
TL;DR: 本文提出了一种基于Hessian矩阵的轻量化神经网络HessNet,用于在小规模训练数据集上实现脑部血管的3D分割,具有资源需求低、精度高的特点。
Details
Motivation: 脑部血管分割在医学影像中至关重要,但现有方法(如手动分割或经典滤波器)精度不足,且缺乏公开标注数据集。神经网络虽强大,但依赖大量标注数据。Contribution: 1. 提出轻量化神经网络HessNet,仅需6000参数,适合CPU运行;2. 在小数据集上实现SOTA分割精度;3. 基于IXI数据集创建了半手动标注的脑部血管数据集。
Method: 采用Hessian矩阵辅助的半监督学习方法,专注于管状结构分割,通过轻量化设计(HessNet)减少资源需求。
Result: 在极小训练集上实现高精度血管分割,并构建了200张半手动标注的MRA数据集,标注效率高。
Insight: Hessian矩阵在多尺度结构(如血管)分割中表现优异;轻量化网络设计在小数据场景下具有实用价值。
Abstract: Accurate segmentation of blood vessels in brain magnetic resonance angiography (MRA) is essential for successful surgical procedures, such as aneurysm repair or bypass surgery. Currently, annotation is primarily performed through manual segmentation or classical methods, such as the Frangi filter, which often lack sufficient accuracy. Neural networks have emerged as powerful tools for medical image segmentation, but their development depends on well-annotated training datasets. However, there is a notable lack of publicly available MRA datasets with detailed brain vessel annotations. To address this gap, we propose a novel semi-supervised learning lightweight neural network with Hessian matrices on board for 3D segmentation of complex structures such as tubular structures, which we named HessNet. The solution is a Hessian-based neural network with only 6000 parameters. HessNet can run on the CPU and significantly reduces the resource requirements for training neural networks. The accuracy of vessel segmentation on a minimal training dataset reaches state-of-the-art results. It helps us create a large, semi-manually annotated brain vessel dataset of brain MRA images based on the IXI dataset (annotated 200 images). Annotation was performed by three experts under the supervision of three neurovascular surgeons after applying HessNet. It provides high accuracy of vessel segmentation and allows experts to focus only on the most complex important cases. The dataset is available at https://git.scinalytics.com/terilat/VesselDatasetPartly.
cs.CR [Back]
[93] Retrieval-Augmented Review Generation for Poisoning Recommender Systems
Shiyi Yang,Xinshu Li,Guanglin Zhou,Chen Wang,Xiwei Xu,Liming Zhu,Lina Yao
Main category: cs.CR
TL;DR: 论文提出了一种名为RAGAN的新型攻击框架,通过增强生成的高质量虚假用户评论,显著提升了推荐系统的中毒攻击效果,同时兼顾了隐蔽性。
Details
Motivation: 现有推荐系统容易受到数据中毒攻击,但攻击者在生成高质量、隐蔽的虚假评论时面临资源限制和效果不佳的挑战。Contribution: 提出了RAGAN框架,通过检索增强的多模态基础模型和文本风格转换策略,生成了高质量且隐蔽的虚假用户评论。
Method: 结合了上下文学习能力、演示检索算法和文本风格转换策略,由jailbreaker生成虚假用户画像,并通过协作优化提升攻击的隐蔽性和可迁移性。
Result: 在多个实际数据集上的实验表明,RAGAN实现了最先进的中毒攻击性能。
Insight: 通过利用多模态基础模型和检索增强技术,可以显著提升攻击的效果和隐蔽性,这对推荐系统的防御提出了新的挑战。
Abstract: Recent studies have shown that recommender systems (RSs) are highly vulnerable to data poisoning attacks, where malicious actors inject fake user profiles, including a group of well-designed fake ratings, to manipulate recommendations. Due to security and privacy constraints in practice, attackers typically possess limited knowledge of the victim system and thus need to craft profiles that have transferability across black-box RSs. To maximize the attack impact, the profiles often remains imperceptible. However, generating such high-quality profiles with the restricted resources is challenging. Some works suggest incorporating fake textual reviews to strengthen the profiles; yet, the poor quality of the reviews largely undermines the attack effectiveness and imperceptibility under the practical setting. To tackle the above challenges, in this paper, we propose to enhance the quality of the review text by harnessing in-context learning (ICL) capabilities of multimodal foundation models. To this end, we introduce a demonstration retrieval algorithm and a text style transfer strategy to augment the navie ICL. Specifically, we propose a novel practical attack framework named RAGAN to generate high-quality fake user profiles, which can gain insights into the robustness of RSs. The profiles are generated by a jailbreaker and collaboratively optimized on an instructional agent and a guardian to improve the attack transferability and imperceptibility. Comprehensive experiments on various real-world datasets demonstrate that RAGAN achieves the state-of-the-art poisoning attack performance.
cs.DB [Back]
[94] AmbiSQL: Interactive Ambiguity Detection and Resolution for Text-to-SQL
Zhongjun Ding,Yin Lin,Tianjing Zeng
Main category: cs.DB
TL;DR: AmbiSQL是一个交互式系统,用于检测和解决文本到SQL中的歧义问题,通过多选问题引导用户澄清意图,显著提升了SQL生成的准确性。
Details
Motivation: 现有的文本到SQL系统(尤其是基于大语言模型的)在歧义处理上表现不佳,容易导致用户意图误解和SQL生成错误。AmbiSQL旨在解决这一问题。Contribution: 提出了一个细粒度的歧义分类法,并设计了一种交互式系统,通过用户反馈重写问题,显著提高了SQL生成的准确性和歧义检测的精确度。
Method: 系统首先检测查询中的歧义(基于分类法),然后通过多选问题与用户交互以澄清意图,最终重写问题生成正确的SQL。
Result: 在歧义查询数据集上,AmbiSQL的歧义检测精确度达87.2%,集成到文本到SQL系统后,SQL完全匹配准确率提升了50%。
Insight: 交互式澄清是解决文本到SQL中歧义问题的有效方法,用户反馈对提升系统性能至关重要。
Abstract: Text-to-SQL systems translate natural language questions into SQL queries, providing substantial value for non-expert users. While large language models (LLMs) show promising results for this task, they remain error-prone. Query ambiguity has been recognized as a major obstacle for LLM-based Text-to-SQL systems, leading to misinterpretation of user intent and inaccurate SQL generation. We demonstrate AmbiSQL, an interactive system that automatically detects query ambiguities and guides users through intuitive multiple-choice questions to clarify their intent. Our approach introduces a fine-grained ambiguity taxonomy for identifying ambiguities that affect database element mapping and LLM reasoning, then incorporates user feedback to rewrite ambiguous questions. Evaluation on an ambiguous query dataset shows that AmbiSQL achieves 87.2% precision in ambiguity detection and improves SQL exact match accuracy by 50% when integrated with Text-to-SQL systems. Our demonstration showcases the significant performance gains and highlights the system’s practical usability. Code repo and demonstration are available at: https://github.com/JustinzjDing/AmbiSQL.
cs.MM [Back]
[95] Robust Symbolic Reasoning for Visual Narratives via Hierarchical and Semantically Normalized Knowledge Graphs
Yi-Chun Chen
Main category: cs.MM
TL;DR: 该论文提出了一种语义归一化框架,用于层次化叙事知识图谱,通过词汇相似性和嵌入聚类来减少不一致性和冗余,提升视觉叙事的符号推理能力。
Details
Motivation: 视觉叙事(如漫画)的理解需要结构化表示,但现有的符号叙事图谱常因标注不一致和冗余而影响推理和泛化效果,因此需要一种方法来统一语义表示。Contribution: 提出了一种语义归一化框架,通过层次化和语义归一化的方法减少标注噪声,提升符号推理的鲁棒性和一致性。
Method: 基于认知模型,利用词汇相似性和嵌入聚类技术,对叙事图谱中的动作和事件进行语义归一化,并在Manga109数据集上进行实验。
Result: 在动作检索、角色定位和事件摘要等叙事推理任务中,语义归一化显著提升了图谱的一致性和鲁棒性,同时保持了符号透明性。
Insight: 语义归一化是实现多模态叙事理解的关键步骤,能够为认知启发的图谱模型提供可扩展性。
Abstract: Understanding visual narratives such as comics requires structured representations that capture events, characters, and their relations across multiple levels of story organization. However, symbolic narrative graphs often suffer from inconsistency and redundancy, where similar actions or events are labeled differently across annotations or contexts. Such variance limits the effectiveness of reasoning and generalization. This paper introduces a semantic normalization framework for hierarchical narrative knowledge graphs. Building on cognitively grounded models of narrative comprehension, we propose methods that consolidate semantically related actions and events using lexical similarity and embedding-based clustering. The normalization process reduces annotation noise, aligns symbolic categories across narrative levels, and preserves interpretability. We demonstrate the framework on annotated manga stories from the Manga109 dataset, applying normalization to panel-, event-, and story-level graphs. Preliminary evaluations across narrative reasoning tasks, such as action retrieval, character grounding, and event summarization, show that semantic normalization improves coherence and robustness, while maintaining symbolic transparency. These findings suggest that normalization is a key step toward scalable, cognitively inspired graph models for multimodal narrative understanding.
[96] \textit{adder-viz}: Real-Time Visualization Software for Transcoding Event Video
Andrew C. Freeman,Luke Reinkensmeyer
Main category: cs.MM
TL;DR: 本文介绍了改进版的adder-viz软件,用于实时可视化和转码事件视频,提升了灵活性和速度。
Details
Motivation: 针对神经形态事件视频研究中现有表征的灵活性和速度不足问题,作者提出改进的可视化工具以支持更高效的实时处理。Contribution: 主要贡献是改进的adder-viz软件,支持事件视频的实时可视化和转码,提升灵活性和处理速度。
Method: 基于之前提出的ADΔER统一表征方法,作者改进了adder-viz软件,实现了更高效的实时可视化。
Result: 该软件通过开源形式发布(MIT许可),可直接用于事件视频的处理和可视化任务。
Insight: 事件视频处理的实时性和灵活性是未来研究的关键方向,开源工具有助于社区进一步探索这一领域。
Abstract: Recent years have brought about a surge in neuromorphic ``event’’ video research, primarily targeting computer vision applications. Event video eschews video frames in favor of asynchronous, per-pixel intensity samples. While much work has focused on a handful of representations for specific event cameras, these representations have shown limitations in flexibility, speed, and compressibility. We previously proposed the unified AD{\Delta}ER representation to address these concerns. This paper introduces numerous improvements to the \textit{adder-viz} software for visualizing real-time event transcode processes and applications in-the-loop. The MIT-licensed software is available from a centralized repository at \href{https://github.com/ac-freeman/adder-codec-rs}{https://github.com/ac-freeman/adder-codec-rs}.
cs.AI [Back]
[97] Don’t Think Twice! Over-Reasoning Impairs Confidence Calibration
Romain Lacombe,Kerrie Wu,Eddie Dilworth
Main category: cs.AI
TL;DR: 论文发现过度推理会损害大语言模型的置信度校准,而检索增强生成能显著提升校准效果。
Details
Motivation: 研究动机是探索大语言模型在知识密集型任务中如何有效校准置信度,避免过度自信的现象。Contribution: 主要贡献是指出增加推理预算会损害校准效果,并提出检索增强生成是提升校准的关键方法。
Method: 使用ClimateX数据集,通过评估推理能力和预算对置信度校准的影响,比较推理和检索增强的效果。
Result: 结果显示,延长推理预算会导致过度自信,而检索增强生成能将校准准确率提升至89.3%。
Insight: 重要洞察是信息访问比推理深度或计算预算对置信度校准更为关键。
Abstract: Large Language Models deployed as question answering tools require robust calibration to avoid overconfidence. We systematically evaluate how reasoning capabilities and budget affect confidence assessment accuracy, using the ClimateX dataset (Lacombe et al., 2023) and expanding it to human and planetary health. Our key finding challenges the “test-time scaling” paradigm: while recent reasoning LLMs achieve 48.7% accuracy in assessing expert confidence, increasing reasoning budgets consistently impairs rather than improves calibration. Extended reasoning leads to systematic overconfidence that worsens with longer thinking budgets, producing diminishing and negative returns beyond modest computational investments. Conversely, search-augmented generation dramatically outperforms pure reasoning, achieving 89.3% accuracy by retrieving relevant evidence. Our results suggest that information access, rather than reasoning depth or inference budget, may be the critical bottleneck for improved confidence calibration of knowledge-intensive tasks.
[98] DiagECG: An LLM-Driven Framework for Diagnostic Reasoning via Discretized ECG Tokenization
Jinning Yang,Wen Shi
Main category: cs.AI
TL;DR: DiagECG是一个结合时间序列和语言建模的框架,通过离散化ECG信号为符号令牌,使大型语言模型能够处理12导联ECG信号并完成临床文本生成任务。
Details
Motivation: 现有自动化方法在心血管诊断中泛化能力有限,且缺乏开放式推理支持。DiagECG旨在利用LLM的统一处理能力,解决这一问题。Contribution: 1. 提出一种ECG离散化方法,将连续信号转换为符号令牌;2. 通过预训练和指令调整,使LLM能处理ECG和自然语言输入;3. 在多种任务上表现出色,支持跨分布泛化。
Method: 1. 使用导联无关编码器和量化模块离散化ECG嵌入;2. 通过自回归预训练任务建模时序动态;3. 指令调整适用于ECG问答和诊断报告生成。
Result: 实验表明DiagECG在多种任务中表现优异,且具备良好的泛化能力。
Insight: 符号化ECG表示与LLM的结合为医疗推理提供了新思路,证明了跨模态统一建模的潜力。
Abstract: Electrocardiography plays a central role in cardiovascular diagnostics, yet existing automated approaches often struggle to generalize across clinical tasks and offer limited support for open-ended reasoning. We present DiagECG, a novel framework that integrates time-series and language modeling by enabling large language models to process 12-lead ECG signals for clinical text generation tasks. Our approach discretizes continuous ECG embeddings into symbolic tokens using a lead-independent encoder and quantization module. These tokens are then used to extend the vocabulary of LLM, allowing the model to handle both ECG and natural language inputs in a unified manner. To bridge the modality gap, we pretrain the model on an autoregressive ECG forecasting task, enabling the LLM to model temporal dynamics using its native language modeling capabilities. Finally, we perform instruction tuning on both ECG question answering and diagnostic report generation. Without modifying the core model, DiagECG achieves strong performance across tasks while maintaining generalization to out-of-distribution settings. Extensive experiments demonstrate the effectiveness of each component and highlight the potential of integrating symbolic ECG representations into LLMs for medical reasoning.
[99] Language-Guided Tuning: Enhancing Numeric Optimization with Textual Feedback
Yuxing Lu,Yucheng Hu,Nan Sun,Xukai Zhao
Main category: cs.AI
TL;DR: LGT是一种通过多智能体大语言模型(LLM)进行自然语言推理的配置优化框架,结合文本梯度(textual gradients)提升语义理解能力,显著优于传统优化方法。
Details
Motivation: 机器学习的配置优化通常缺乏动态适应性和语义推理能力,传统方法难以协调多个维度且不透明。Contribution: 提出了LGT框架,通过多智能体LLM和文本梯度实现语义化的配置优化,提升了性能与可解释性。
Method: LGT使用三个智能体(Advisor、Evaluator、Optimizer)协同工作,形成自改进的反馈循环,结合数字优化与文本反馈。
Result: 在六个数据集上验证,LGT性能显著优于传统方法,并保持了高可解释性。
Insight: 文本梯度为数值优化提供了语义补充,多智能体协作的框架有望扩展至更复杂的优化任务。
Abstract: Configuration optimization remains a critical bottleneck in machine learning, requiring coordinated tuning across model architecture, training strategy, feature engineering, and hyperparameters. Traditional approaches treat these dimensions independently and lack interpretability, while recent automated methods struggle with dynamic adaptability and semantic reasoning about optimization decisions. We introduce Language-Guided Tuning (LGT), a novel framework that employs multi-agent Large Language Models to intelligently optimize configurations through natural language reasoning. We apply textual gradients - qualitative feedback signals that complement numerical optimization by providing semantic understanding of training dynamics and configuration interdependencies. LGT coordinates three specialized agents: an Advisor that proposes configuration changes, an Evaluator that assesses progress, and an Optimizer that refines the decision-making process, creating a self-improving feedback loop. Through comprehensive evaluation on six diverse datasets, LGT demonstrates substantial improvements over traditional optimization methods, achieving performance gains while maintaining high interpretability.
[100] See it. Say it. Sorted: Agentic System for Compositional Diagram Generation
Hantao Zhang,Jingyang Liu,Ed Li
Main category: cs.AI
TL;DR: 该论文提出了一种无需训练的代理系统,结合视觉语言模型(VLM)和大语言模型(LLM),用于将手绘草图转换为精确、可编辑的矢量图(SVG)。系统通过迭代循环生成高质量输出,优于当前前沿的闭源图像生成模型。
Details
Motivation: 现有的扩散模型在生成高精度、对齐和符号化结构的流程图时表现不佳。为了解决这一问题,作者提出了一个无需训练的系统,通过结合VLM和LLM实现草图的精确生成。Contribution: 1. 提出了一种无需训练的代理系统,结合VLM和LLM生成可编辑的SVG程序。2. 通过迭代循环(Critic VLM提出编辑建议,LLM生成更新,Judge VLM选择最佳候选)实现稳定改进。3. 在10个流程图草图上验证了方法的有效性,优于GPT-5和Gemini-2.5-Pro。
Method: 1. 使用Critic VLM对草图提出定性关系编辑建议。2. 多个LLM生成不同的SVG更新策略(保守到激进、替代、专注等)。3. Judge VLM选择最佳候选,确保稳定改进。整个过程支持人工介入修正。
Result: 在10个流程图草图上,该方法比GPT-5和Gemini-2.5-Pro更准确地重建了布局和结构,生成了高质量的SVG输出,且避免了不必要的文本插入。
Insight: 1. 结合VLM和LLM的代理系统可以实现无需训练的高精度草图生成。2. 定性推理比数值估计更稳定,能更好地满足全局约束(如对齐、连接性)。3. 程序化SVG输出便于通过API集成到其他工具中。
Abstract: We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.
eess.AS [Back]
[101] A Chinese Heart Failure Status Speech Database with Universal and Personalised Classification
Yue Pan,Liwei Liu,Changxin Li,Xinyao Wang,Yili Xia,Hanyue Zhang,Ming Chu
Main category: eess.AS
TL;DR: 论文构建了首个中文心力衰竭语音数据库,验证了中文语音在心力衰竭检测中的有效性,并提出了一种自适应频率滤波器(AFF)用于频率重要性分析。
Details
Motivation: 语音是一种低成本、非侵入性的数据源,可用于心力衰竭检测,但此前缺乏对中文语音中是否包含心力衰竭相关信息的研究。Contribution: 1. 构建首个中文心力衰竭语音数据库;2. 验证中文语音在心力衰竭检测中的有效性;3. 提出个性化分类方法和自适应频率滤波器(AFF)。
Method: 1. 使用标准“患者级”和个性化“配对级”分类方法;2. 提出自适应频率滤波器(AFF)分析频率重要性。
Result: 中文语音在心力衰竭检测中表现有效,个体差异是影响准确性的主要因素。
Insight: 个性化分类方法可作为未来研究的理想基准,自适应频率滤波器为频率分析提供了新思路。
Abstract: Speech is a cost-effective and non-intrusive data source for identifying acute and chronic heart failure (HF). However, there is a lack of research on whether Chinese syllables contain HF-related information, as observed in other well-studied languages. This study presents the first Chinese speech database of HF patients, featuring paired recordings taken before and after hospitalisation. The findings confirm the effectiveness of the Chinese language in HF detection using both standard ‘patient-wise’ and personalised ‘pair-wise’ classification approaches, with the latter serving as an ideal speaker-decoupled baseline for future research. Statistical tests and classification results highlight individual differences as key contributors to inaccuracy. Additionally, an adaptive frequency filter (AFF) is proposed for frequency importance analysis. The data and demonstrations are published at https://github.com/panyue1998/Voice_HF.
cs.HC [Back]
[102] “Does the cafe entrance look accessible? Where is the door?” Towards Geospatial AI Agents for Visual Inquiries
Jon E. Froehlich,Jared Hwang,Zeyu Wang,John S. O’Meara,Xia Su,William Huang,Yang Zhang,Alex Fiannaca,Philip Nelson,Shaun Kane
Main category: cs.HC
TL;DR: 论文提出‘地理视觉代理’(Geo-Visual Agents)的概念,这是一种多模态AI代理,能够通过分析地理空间图像(如街景、景点照片和卫星图像)与GIS数据,回答复杂的视觉空间问题。
Details
Motivation: 现有的交互式数字地图依赖预结构化GIS数据,无法满足用户对地理视觉问题的需求,例如‘咖啡馆入口是否无障碍?门在哪里?’。Contribution: 提出了Geo-Visual Agents的愿景,结合多模态AI和地理空间图像,以更灵活地响应视觉空间问题。
Method: 整合街景、景点照片和卫星图像等地理空间图像与传统GIS数据,通过多模态AI进行分析和交互。
Result: 论文展示了三个示例,说明代理能够处理视觉空间问题,并讨论了未来挑战与机遇。
Insight: 地理视觉代理的提出填补了传统GIS数据在视觉空间问题上的不足,为未来地理AI应用提供了新方向。
Abstract: Interactive digital maps have revolutionized how people travel and learn about the world; however, they rely on pre-existing structured data in GIS databases (e.g., road networks, POI indices), limiting their ability to address geo-visual questions related to what the world looks like. We introduce our vision for Geo-Visual Agents–multimodal AI agents capable of understanding and responding to nuanced visual-spatial inquiries about the world by analyzing large-scale repositories of geospatial images, including streetscapes (e.g., Google Street View), place-based photos (e.g., TripAdvisor, Yelp), and aerial imagery (e.g., satellite photos) combined with traditional GIS data sources. We define our vision, describe sensing and interaction approaches, provide three exemplars, and enumerate key challenges and opportunities for future work.
cs.AR [Back]
[103] Scalable FPGA Framework for Real-Time Denoising in High-Throughput Imaging: A DRAM-Optimized Pipeline using High-Level Synthesis
Weichien Liao
Main category: cs.AR
TL;DR: 该论文提出了一种基于FPGA的可扩展预处理流水线,用于实时图像去噪,通过HLS实现并针对DRAM优化,适用于高通量成像工作流。
Details
Motivation: 高通量成像(如PRISM)的数据生成速率超过传统实时处理能力,需要一种低延迟的去噪解决方案。Contribution: 1. 提出了一个基于FPGA的DRAM优化流水线;2. 实现了直接处理流式图像数据的去噪核;3. 降低了数据集规模并减少了下行分析的计算负担。
Method: 使用HLS实现FPGA流水线,通过爆发模式AXI4接口最小化延迟,直接在流式数据上执行帧减法和平均操作。
Result: 在PRISM规模的采集下验证,该框架能实现低于帧间隔的延迟,适用于光谱和显微成像工作流。
Insight: FPGA+HLS+DRAM优化是实现高通量实时图像处理的高效方法,为类似场景提供了模块化解决方案。
Abstract: High-throughput imaging workflows, such as Parallel Rapid Imaging with Spectroscopic Mapping (PRISM), generate data at rates that exceed conventional real-time processing capabilities. We present a scalable FPGA-based preprocessing pipeline for real-time denoising, implemented via High-Level Synthesis (HLS) and optimized for DRAM-backed buffering. Our architecture performs frame subtraction and averaging directly on streamed image data, minimizing latency through burst-mode AXI4 interfaces. The resulting kernel operates below the inter-frame interval, enabling inline denoising and reducing dataset size for downstream CPU/GPU analysis. Validated under PRISM-scale acquisition, this modular FPGA framework offers a practical solution for latency-sensitive imaging workflows in spectroscopy and microscopy.
cs.LG [Back]
[104] Classification errors distort findings in automated speech processing: examples and solutions from child-development research
Lucas Gautheron,Evan Kidd,Anton Malko,Marvin Lavechin,Alejandrina Cristia
Main category: cs.LG
TL;DR: 论文探讨了自动语音分类错误对儿童语言发展研究结果的扭曲影响,并提出了一种贝叶斯校准方法以减少这种偏差。
Details
Motivation: 随着可穿戴录音设备的普及,研究者越来越多地依赖自动语音分析方法,但这些方法的分类错误可能对科学推断产生显著影响。目前对此类问题的研究较少。Contribution: 论文的主要贡献包括:1)揭示了自动分类错误对儿童语言发展研究结果的显著扭曲;2)提出了一种基于贝叶斯方法的校准方案,用于恢复无偏的效应量估计。
Method: 作者采用贝叶斯方法分析分类错误对统计推断的影响,并以两个常用的语音分类器(LENA和ACLEW系统的Voice Type Classifier)为例进行了验证。
Result: 研究发现,分类错误显著低估了兄弟姐妹对成人语言输入的负面影响(偏差达20-80%),并可能导致统计显著性降低。贝叶斯校准方法在一定程度上有效,但并非完全无缺陷。
Insight: 论文指出,任何涉及事件检测和分类的自动化工具,只要存在错误率,都可能面临类似问题。贝叶斯方法为解决此类问题提供了潜在方向,但需要进一步优化。
Abstract: With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children’s experience, behavior, and outcomes, with a sizable literature employing long-form audio-recordings to study language acquisition. While numerous articles report on the accuracy and reliability of the most popular automated classifiers, less has been written on the downstream effects of classification errors on measurements and statistical inferences (e.g., the estimate of correlations and effect sizes in regressions). This paper proposes a Bayesian approach to study the effects of algorithmic errors on key scientific questions, including the effect of siblings on children’s language experience and the association between children’s production and their input. In both the most commonly used \gls{lena}, and an open-source alternative (the Voice Type Classifier from the ACLEW system), we find that classification errors can significantly distort estimates. For instance, automated annotations underestimated the negative effect of siblings on adult input by 20–80%, potentially placing it below statistical significance thresholds. We further show that a Bayesian calibration approach for recovering unbiased estimates of effect sizes can be effective and insightful, but does not provide a fool-proof solution. Both the issue reported and our solution may apply to any classifier involving event detection and classification with non-zero error rates.
[105] Intern-S1: A Scientific Multimodal Foundation Model
Lei Bai,Zhongrui Cai,Maosong Cao,Weihan Cao,Chiyu Chen,Haojiong Chen,Kai Chen,Pengcheng Chen,Ying Chen,Yongkang Chen,Yu Cheng,Yu Cheng,Pei Chu,Tao Chu,Erfei Cui,Ganqu Cui,Long Cui,Ziyun Cui,Nianchen Deng,Ning Ding,Nanqin Dong,Peijie Dong,Shihan Dou,Sinan Du,Haodong Duan,Caihua Fan,Ben Gao,Changjiang Gao,Jianfei Gao,Songyang Gao,Yang Gao,Zhangwei Gao,Jiaye Ge,Qiming Ge,Lixin Gu,Yuzhe Gu,Aijia Guo,Qipeng Guo,Xu Guo,Conghui He,Junjun He,Yili Hong,Siyuan Hou,Caiyu Hu,Hanglei Hu,Jucheng Hu,Ming Hu,Zhouqi Hua,Haian Huang,Junhao Huang,Xu Huang,Zixian Huang,Zhe Jiang,Lingkai Kong,Linyang Li,Peiji Li,Pengze Li,Shuaibin Li,Tianbin Li,Wei Li,Yuqiang Li,Dahua Lin,Junyao Lin,Tianyi Lin,Zhishan Lin,Hongwei Liu,Jiangning Liu,Jiyao Liu,Junnan Liu,Kai Liu,Kaiwen Liu,Kuikun Liu,Shichun Liu,Shudong Liu,Wei Liu,Xinyao Liu,Yuhong Liu,Zhan Liu,Yinquan Lu,Haijun Lv,Hongxia Lv,Huijie Lv,Qidang Lv,Ying Lv,Chengqi Lyu,Chenglong Ma,Jianpeng Ma,Ren Ma,Runmin Ma,Runyuan Ma,Xinzhu Ma,Yichuan Ma,Zihan Ma,Sixuan Mi,Junzhi Ning,Wenchang Ning,Xinle Pang,Jiahui Peng,Runyu Peng,Yu Qiao,Jiantao Qiu,Xiaoye Qu,Yuan Qu,Yuchen Ren,Fukai Shang,Wenqi Shao,Junhao Shen,Shuaike Shen,Chunfeng Song,Demin Song,Diping Song,Chenlin Su,Weijie Su,Weigao Sun,Yu Sun,Qian Tan,Cheng Tang,Huanze Tang,Kexian Tang,Shixiang Tang,Jian Tong,Aoran Wang,Bin Wang,Dong Wang,Lintao Wang,Rui Wang,Weiyun Wang,Wenhai Wang,Yi Wang,Ziyi Wang,Ling-I Wu,Wen Wu,Yue Wu,Zijian Wu,Linchen Xiao,Shuhao Xing,Chao Xu,Huihui Xu,Jun Xu,Ruiliang Xu,Wanghan Xu,GanLin Yang,Yuming Yang,Haochen Ye,Jin Ye,Shenglong Ye,Jia Yu,Jiashuo Yu,Jing Yu,Fei Yuan,Bo Zhang,Chao Zhang,Chen Zhang,Hongjie Zhang,Jin Zhang,Qiaosheng Zhang,Qiuyinzhe Zhang,Songyang Zhang,Taolin Zhang,Wenlong Zhang,Wenwei Zhang,Yechen Zhang,Ziyang Zhang,Haiteng Zhao,Qian Zhao,Xiangyu Zhao,Xiangyu Zhao,Bowen Zhou,Dongzhan Zhou,Peiheng Zhou,Yuhao Zhou,Yunhua Zhou,Dongsheng Zhu,Lin Zhu,Yicheng Zou
Main category: cs.LG
TL;DR: 本研究提出了Intern-S1,一个科学多模态基础模型,旨在填补开源模型在高价值科学专业领域的性能差距。通过混合专家(MoE)架构和大规模预训练,结合离线与在线强化学习(Mixture-of-Rewards),Intern-S1在科学任务上表现优异,超越开源模型并接近闭源顶尖模型。
Details
Motivation: 针对科学领域开源模型性能不足的现状,研究者希望通过构建高性能多模态基础模型,弥合与闭源模型的差距,推动科学研究的智能化转型。Contribution: 1. 提出Intern-S1,一个具有科学专业能力的多模态MoE模型,参数量达280亿;
2. 设计Mixture-of-Rewards(MoR)方法,优化在线强化学习;
3. 在科学任务上取得显著性能提升,超越开源模型,接近闭源顶尖模型。
Method: 1. 基于MoE架构,整合280亿激活参数;
2. 预训练5T token数据(含2.5T科学领域数据);
3. 离线与在线强化学习结合,利用MoR方法同步优化1000+任务。
Result: 1. 在通用推理任务上表现优异;
2. 在科学任务(如分子合成规划、反应条件预测)上超越开源模型并接近闭源顶尖模型。
Insight: 科学领域的专业模型需要大规模数据和高效训练方法(如MoR)的支持,多模态与MoE结合的架构对复杂科学任务的处理具有潜力。
Abstract: In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.
[106] Probability Density from Latent Diffusion Models for Out-of-Distribution Detection
Joonas Järve,Karl Kaspar Haavel,Meelis Kull
Main category: cs.LG
TL;DR: 该论文研究潜变量扩散模型在离群分布检测中的应用,证明了似然是理论上最优的离群检测指标,并通过实验验证了其在表示空间中的有效性。
Details
Motivation: 尽管生成模型的似然在理论上是最优的离群分布检测指标,但实际上它经常失效,作者想探究这是否是由于生成模型在像素空间中密度估计不佳导致的。Contribution: 论文证明了似然在理论上是离群分布检测的最优指标,并提出了在表示空间中训练扩散模型的方法,以验证其在离群检测中的有效性。
Method: 作者训练了一个变分扩散模型,不是直接在图像上,而是在预训练的ResNet-18的表示空间上,然后用似然作为离群检测的评分指标。
Result: 实验表明,基于似然的检测器在表示空间中的表现优于OpenOOD套件中的现有方法。
Insight: 表示空间可能比像素空间更适合用于离群检测,因为它避免了像素空间中密度估计的复杂性。
Abstract: Despite rapid advances in AI, safety remains the main bottleneck to deploying machine-learning systems. A critical safety component is out-of-distribution detection: given an input, decide whether it comes from the same distribution as the training data. In generative models, the most natural OOD score is the data likelihood. Actually, under the assumption of uniformly distributed OOD data, the likelihood is even the optimal OOD detector, as we show in this work. However, earlier work reported that likelihood often fails in practice, raising doubts about its usefulness. We explore whether, in practice, the representation space also suffers from the inability to learn good density estimation for OOD detection, or if it is merely a problem of the pixel space typically used in generative models. To test this, we trained a Variational Diffusion Model not on images, but on the representation space of a pre-trained ResNet-18 to assess the performance of our likelihood-based detector in comparison to state-of-the-art methods from the OpenOOD suite.