Table of Contents
- cs.CL [Total: 20]
- cs.CV [Total: 73]
- cs.GR [Total: 3]
- cs.SD [Total: 1]
- cs.LG [Total: 4]
- cs.AI [Total: 2]
- eess.IV [Total: 5]
- eess.SP [Total: 1]
- cs.RO [Total: 1]
- cs.CR [Total: 2]
cs.CL [Back]
[1] Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses
Jens Rupprecht,Georg Ahnert,Markus Strohmaier
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLMs)在社会科学调查中的响应偏差,发现所有测试模型均表现出明显的‘近因偏差’,且对语义变化敏感。研究表明,使用LLMs生成合成调查数据时需注意提示设计和鲁棒性测试。
Details
Motivation: 大型语言模型逐渐被用作社会科学调查的人类代理,但其可靠性和对已知响应偏差的敏感性尚不明确。本研究旨在揭示LLMs在调查响应中的偏差和鲁棒性问题。Contribution: 1. 揭示了LLMs在调查响应中的‘近因偏差’。2. 发现LLMs对语义变化和组合扰动敏感。3. 强调了提示设计和鲁棒性测试的重要性。
Method: 在‘世界价值观调查’(WVS)问题基础上,应用11种扰动(如问题措辞和答案结构变化),对9种LLMs进行了超过167,000次模拟访谈。
Result: 所有模型均表现出不同程度的近因偏差,且对语义变化(如改写)和组合扰动敏感。模型规模越大,鲁棒性越强,但仍存在敏感性。
Insight: LLMs部分表现出于人类类似的调查响应偏差,提示其在社会科学应用中的局限性,需谨慎设计提示并进行鲁棒性测试。
Abstract: Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts – we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs’ vulnerabilities to perturbations but also reveal that all tested models exhibit a consistent \textit{recency bias} varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.
[2] Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings
Minseon Kim,Jean-Philippe Corbeil,Alessandro Sordoni,Francois Beaulieu,Paul Vozila
Main category: cs.CL
TL;DR: 这篇论文提出了一种针对医疗领域语言模型的安全性评估协议,聚焦患者和临床医生的视角,并填补了现有研究中医疗LLM安全性评估的空白。
Details
Motivation: 随着大型语言模型在医疗领域的广泛应用,其安全性问题日益突出,尤其是模型输出可能直接影响人类健康。现有评估多关注通用领域,缺乏针对医疗场景的特殊考量。Contribution: 论文首次定义了医疗LLM的安全性评估标准,提出了患者和临床医生视角的评估协议,并构建了PatientSafetyBench数据集(包含466个样本,覆盖5个关键类别)。
Method: 通过针对性的红队测试(red-teaming),从患者、临床医生和普通用户三个视角对医疗LLM(如MediPhi模型)进行安全性评估。
Result: 研究填补了医疗LLM安全性评估的空白,为医疗领域的安全部署奠定了基础。
Insight: 医疗LLM的安全性评估需考虑用户角色的多样性,患者和临床医生的视角对发现潜在风险至关重要。
Abstract: As the performance of large language models (LLMs) continues to advance, their adoption is expanding across a wide range of domains, including the medical field. The integration of LLMs into medical applications raises critical safety concerns, particularly due to their use by users with diverse roles, e.g. patients and clinicians, and the potential for model’s outputs to directly affect human health. Despite the domain-specific capabilities of medical LLMs, prior safety evaluations have largely focused only on general safety benchmarks. In this paper, we introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives, alongside general safety assessments and quantitatively analyze the safety of medical LLMs. We bridge a gap in the literature by building the PatientSafetyBench containing 466 samples over 5 critical categories to measure safety from the perspective of the patient. We apply our red-teaming protocols on the MediPhi model collection as a case study. To our knowledge, this is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view - patient, clinician, and general user - establishing a foundation for safer deployment in medical domains.
[3] Towards Interpretable Time Series Foundation Models
Matthieu Boileau,Philippe Helluy,Jeremy Pawlus,Svitlana Vyetrenko
Main category: cs.CL
TL;DR: 论文研究了如何将时间序列推理能力蒸馏到小型指令调优语言模型中,以构建可解释的时间序列基础模型。通过使用合成数据集和大模型生成的自然语言注释,训练紧凑的Qwen模型,并引入评估指标验证其推理质量。
Details
Motivation: 开发轻量级、可解释的时间序列基础模型,适合在设备端或隐私敏感场景部署,同时能以自然语言解释时间序列模式。Contribution: 提出了一种将时间序列推理能力蒸馏到小型语言模型的方法,并设计了评估指标验证其趋势方向、噪声强度和极值定位能力。
Method: 利用合成数据集生成自然语言注释,监督训练紧凑的Qwen模型,并通过评估指标验证模型性能。
Result: 后训练模型获得了有意义的解释能力,验证了将时间序列理解压缩到轻量级模型的可行性。
Insight: 轻量级语言模型可以通过蒸馏时间序列推理能力,实现可解释的时间序列分析,适合隐私敏感或设备端应用。
Abstract: In this paper, we investigate the distillation of time series reasoning capabilities into small, instruction-tuned language models as a step toward building interpretable time series foundation models. Leveraging a synthetic dataset of mean-reverting time series with systematically varied trends and noise levels, we generate natural language annotations using a large multimodal model and use these to supervise the fine-tuning of compact Qwen models. We introduce evaluation metrics that assess the quality of the distilled reasoning - focusing on trend direction, noise intensity, and extremum localization - and show that the post-trained models acquire meaningful interpretive capabilities. Our results highlight the feasibility of compressing time series understanding into lightweight, language-capable models suitable for on-device or privacy-sensitive deployment. This work contributes a concrete foundation toward developing small, interpretable models that explain temporal patterns in natural language.
[4] SAND: Boosting LLM Agents with Self-Taught Action Deliberation
Yu Xia,Yiran Jenny Shen,Junda Wu,Tong Yu,Sungchul Kim,Ryan A. Rossi,Lina Yao,Julian McAuley
Main category: cs.CL
TL;DR: 论文提出了SAND框架,通过自我学习的动作深思机制,提升LLM代理的性能,解决了现有方法在动作探索不足时可能选择次优动作的问题。
Details
Motivation: 现有LLM代理通常通过模仿专家行为或基于偏好的优化来调整,但这些方法可能因动作空间探索不足而导致选择看似合理但实际次优的动作。Contribution: 提出SAND框架,引入动作深思机制,通过自我一致性动作采样和执行引导的动作评价,帮助代理在动作空间中进行更优的决策。
Method: 结合自我一致性动作采样和执行引导的动作评价,生成逐步的动作深思轨迹,并通过迭代优化调整LLM代理。
Result: 在两个交互代理任务中,SAND平均提升了20%的性能,并超越了现有最优的代理调整方法。
Insight: 通过动作深思和自我优化,LLM代理能够在复杂动作空间中更有效地探索和选择最优动作,提升整体性能。
Abstract: Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
[5] RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
Hongzhi Zhang,Jia Fu,Jingyuan Zhang,Kai Fu,Qi Wang,Fuzheng Zhang,Guorui Zhou
Main category: cs.CL
TL;DR: RLEP提出了一种结合经验回放的强化学习方法,用于提升大语言模型的推理能力,通过回放已验证的高质量轨迹优化训练过程,实现更快的收敛和更强的性能。
Details
Motivation: 训练大语言模型的强化学习过程通常不稳定且计算昂贵,容易偏离预训练权重。作者希望通过回放已验证的成功轨迹,避免无效探索,专注于有潜力的推理路径。Contribution: 提出了RLEP框架,分两阶段收集和回放高质量轨迹,显著提升了模型的性能和收敛速度,并在多个数学推理任务上取得了显著的精度提升。
Method: 采用两阶段框架:(1)收集已验证的轨迹;(2)在训练过程中混合回放这些轨迹和新生成的数据。通过优化策略,专注于高质量路径。
Result: 在多个数学任务上显著提升性能:AIME-2024从38.2%到39.9%,AIME-2025从19.8%到22.3%,AMC-2023从77.0%到82.2%。
Insight: 回放高质量轨迹可以有效稳定训练过程,避免无效探索,加速收敛,对提升语言模型的推理能力具有重要价值。
Abstract: Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emph{RLEP}, – ,Reinforcement Learning with Experience rePlay, – ,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
[6] Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Kaiqu Liang,Haimin Hu,Xuandong Zhao,Dawn Song,Thomas L. Griffiths,Jaime Fernández Fisac
Main category: cs.CL
TL;DR: 该论文提出了“机器胡说八道”(machine bullshit)的概念框架,量化了大型语言模型(LLM)对真理的漠视,并通过新指标Bullshit Index和分类法分析了四种胡说形式。研究发现RLHF微调会加剧胡说,而CoT提示会放大特定胡说形式,尤其是在政治语境中。
Details
Motivation: 先前研究探讨了LLM的幻觉和迎合性,但缺乏一个统一框架来表征模型对真理的广泛漠视。本文旨在填补这一空白,揭示LLM在生成内容时对真理的忽视机制。Contribution: 提出了“机器胡说八道”的概念框架及Bullshit Index指标;设计了BullshitEval基准测试;发现RLHF微调和CoT提示对胡说行为的影响;分析了政治语境中胡说的主要形式。
Method: 引入Bullshit Index量化模型对真理的漠视;提出四种胡说形式的分类法;在Marketplace、Political Neutrality和BullshitEval数据集上进行实证评估。
Result: RLHF微调显著加剧胡说行为;CoT提示放大了空话和含糊其辞;政治语境中,“含糊其辞”是主要胡说策略。
Insight: 研究揭示了AI对齐中的系统性问题,为提升LLM的真实性提供了新视角。
Abstract: Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and sycophancy, we propose machine bullshit as an overarching conceptual framework that can allow researchers to characterize the broader phenomenon of emergent loss of truthfulness in LLMs and shed light on its underlying mechanisms. We introduce the Bullshit Index, a novel metric quantifying LLMs’ indifference to truth, and propose a complementary taxonomy analyzing four qualitative forms of bullshit: empty rhetoric, paltering, weasel words, and unverified claims. We conduct empirical evaluations on the Marketplace dataset, the Political Neutrality dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI assistants) explicitly designed to evaluate machine bullshit. Our results demonstrate that model fine-tuning with reinforcement learning from human feedback (RLHF) significantly exacerbates bullshit and inference-time chain-of-thought (CoT) prompting notably amplify specific bullshit forms, particularly empty rhetoric and paltering. We also observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy. Our findings highlight systematic challenges in AI alignment and provide new insights toward more truthful LLM behavior.
[7] PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving
Mihir Parmar,Palash Goyal,Xin Liu,Yiwen Song,Mingyang Ling,Chitta Baral,Hamid Palangi,Tomas Pfister
Main category: cs.CL
TL;DR: PLAN-TUNING通过从大模型中蒸馏任务分解轨迹,并利用监督和强化学习对小模型进行微调,显著提升了小模型在复杂推理任务中的性能。
Details
Motivation: 当前研究主要关注大语言模型(LLMs)的任务分解能力,而对于如何通过后训练将这种能力迁移到小模型中仍未充分探索。Contribution: 提出了PLAN-TUNING,一种统一的后训练框架,通过蒸馏大模型的规划轨迹并微调小模型,提升其在复杂任务中的推理能力。
Method: 1. 从大模型中蒸馏合成任务分解(规划轨迹);2. 通过监督和强化学习目标对小模型进行微调,模仿规划过程。
Result: 在GSM8k和MATH基准测试中,PLAN-TUNING模型平均性能提升7%。在跨域数据集(如OlympiadBench和AIME 2024)上,性能提升10-12%。
Insight: 规划轨迹能够显著提升小模型的复杂推理能力,表明PLAN-TUNING是一种有效的小模型性能优化策略。
Abstract: Recently, decomposing complex problems into simple subtasks–a crucial part of human-like natural planning–to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average $\sim7%$. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average $\sim10%$ and $\sim12%$ performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.
[8] Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code
Keqin Bao,Nuo Chen,Xiaoyuan Li,Binyuan Hui,Bowen Yu,Fuli Feng,Junyang Lin,Xiangnan He,Dayiheng Liu
Main category: cs.CL
TL;DR: 论文提出TeaR方法,通过强化学习和数据优化,提升大语言模型的推理能力,避免了直接依赖复杂代码结构的问题。
Details
Motivation: 现有方法通过模拟代码执行提升LLM推理能力,但依赖复杂数据结构和算法,容易过拟合。TeaR旨在通过优化数据和使用强化学习改进推理能力。Contribution: 提出了TeaR方法,结合数据优化和强化学习,避免了对复杂代码的依赖,显著提升了多种基准任务上的推理性能。
Method: TeaR通过精心设计的数据和强化学习,引导模型在代码相关任务中发现最优推理路径。实验基于不同规模模型展开。
Result: 在多个基准测试中,TeaR显著提升了性能,Qwen2.5-7B和R1-Distilled-7B分别提升35.9%和5.9%。
Insight: 强化学习和数据优化是提升LLM推理能力的有效途径,避免了单纯依赖复杂代码的局限性。
Abstract: Enhancing reasoning capabilities remains a central focus in the LLM reasearch community. A promising direction involves requiring models to simulate code execution step-by-step to derive outputs for given inputs. However, as code is often designed for large-scale systems, direct application leads to over-reliance on complex data structures and algorithms, even for simple cases, resulting in overfitting to algorithmic patterns rather than core reasoning structures. To address this, we propose TeaR, which aims at teaching LLMs to reason better. TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks, thereby improving general reasoning abilities. We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning. The results consistently show significant performance improvements. Notably, TeaR achieves a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B.
[9] The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs
Jierun Chen,Tiezheng Yu,Haoli Bai,Lewei Yao,Jiannan Wu,Kaican Li,Fei Mi,Chaofan Tao,Lei Zhu,Manyi Zhang,Xiaohui Li,Lu Hou,Lifeng Shang,Qun Liu
Main category: cs.CL
TL;DR: 这篇论文探讨了在视觉语言模型(VLMs)中联合使用长思维链监督微调(CoT SFT)和强化学习(RL)的局限性及其协同困境,揭示了两种方法在提升推理能力时的互补性与冲突。
Details
Motivation: 尽管长CoT SFT和RL在纯语言模型中表现出协同效应,但在VLMs中的联合效果尚不明确。论文旨在系统研究这两种后训练技术在多模态推理任务中的独特作用和交互效果。Contribution: 论文的主要贡献包括:1)揭示了长CoT SFT和RL在VLMs中的互补性和冲突;2)提出了这两种方法联合时的‘协同困境’现象;3)展示了现有联合训练策略的局限性。
Method: 论文通过多种实验设计(如两阶段训练、交错训练、渐进训练、数据混合和模型合并)评估了长CoT SFT和RL的结合效果,并分析了它们在准确性和推理风格上的权衡。
Result: 实验结果表明:长CoT SFT能提升复杂问题的推理能力但会导致冗长和简单性能下降;RL则提升通用性和简洁性但对最难题效果有限。联合训练未能实现叠加增益,反而引发权衡。
Insight: 论文的洞察在于提出需要更无缝和自适应的方法来结合后训练技术,以充分发挥VLMs在推理任务中的潜力。
Abstract: Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma’’ highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.
[10] Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation
Yupu Liang,Yaping Zhang,Zhiyang Zhang,Yang Zhao,Lu Xiang,Chengqing Zong,Yu Zhou
Main category: cs.CL
TL;DR: 论文提出了一种名为M4Doc的单模态到多模态对齐框架,利用多模态大语言模型(MLLM)解决文档图像机器翻译(DIMT)中的数据稀缺和模态交互问题,显著提升了翻译质量。
Details
Motivation: 文档图像机器翻译(DIMT)面临训练数据有限以及视觉和文本信息交互复杂的挑战,现有方法难以泛化到跨领域和复杂场景。Contribution: 1. 提出M4Doc框架,通过单模态到多模态对齐学习视觉-文本关联;2. 利用预训练的多模态大语言模型(MLLM)提升模型对文档图像的理解;3. 推理阶段轻量化设计,保持高效计算。
Method: M4Doc使用图像编码器与MLLM的多模态表示对齐,预训练后实现轻量化DIMT模型,推理时绕过MLLM直接输出翻译。
Result: 实验表明,M4Doc在跨领域泛化和复杂文档图像场景中显著提升了翻译质量。
Insight: 通过对齐单模态与多模态表示,可以高效学习视觉-文本关联,而轻量化设计确保了推理阶段的实用性。
Abstract: Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.
[11] When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance
Peizhang Shao,Linrui Xu,Jinxi Wang,Wei Zhou,Xingyu Wu
Main category: cs.CL
TL;DR: 该论文首次全面综述了大语言模型(LLMs)在法律领域的应用,提出了一种创新的双视角分类法,融合法律推理框架与专业本体,统一了历史研究与当代突破。通过技术革新(如稀疏注意力机制和专家混合架构),在任务泛化、推理形式化、工作流整合等方面取得显著进展,但也面临幻觉、解释性不足等挑战。
Details
Motivation: 法律领域的复杂性和专业性要求更智能的工具支持,LLMs的涌现能力(如上下文推理和生成性论证)为法律AI提供了新的可能性,但也带来技术、伦理和适应性挑战。Contribution: 1. 提出首个法律领域LLMs的双视角分类法;2. 技术路线图涵盖推理、检索、预测和争议解决;3. 提出法律角色与NLP子任务的映射及Toulmin论证框架的计算实现。
Method: 结合法律推理框架与专业本体,系统化LLMs在法律任务中的应用,利用稀疏注意力机制和专家混合架构改进任务泛化、推理形式化和知识整合。
Result: 文档记录了LLMs在法律文本处理、知识整合和评估严谨性方面的显著进展,同时指出了幻觉、解释性缺陷等未解决问题。
Insight: 未来的研究方向包括低资源系统、多模态证据整合和动态反驳处理,需平衡技术进步与伦理治理。
Abstract: This paper establishes the first comprehensive review of Large Language Models (LLMs) applied within the legal domain. It pioneers an innovative dual lens taxonomy that integrates legal reasoning frameworks and professional ontologies to systematically unify historical research and contemporary breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such as contextual reasoning and generative argumentation, surmount traditional limitations by dynamically capturing legal semantics and unifying evidence reasoning. Significant progress is documented in task generalization, reasoning formalization, workflow integration, and addressing core challenges in text processing, knowledge integration, and evaluation rigor via technical innovations like sparse attention mechanisms and mixture-of-experts architectures. However, widespread adoption of LLM introduces critical challenges: hallucination, explainability deficits, jurisdictional adaptation difficulties, and ethical asymmetry. This review proposes a novel taxonomy that maps legal roles to NLP subtasks and computationally implements the Toulmin argumentation framework, thus systematizing advances in reasoning, retrieval, prediction, and dispute resolution. It identifies key frontiers including low-resource systems, multimodal evidence integration, and dynamic rebuttal handling. Ultimately, this work provides both a technical roadmap for researchers and a conceptual framework for practitioners navigating the algorithmic future, laying a robust foundation for the next era of legal artificial intelligence. We have created a GitHub repository to index the relevant papers: https://github.com/Kilimajaro/LLMs_Meet_Law.
[12] StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model
Shoutao Guo,Xiang Li,Shaolei Zhang,Mengge Liu,Wei Chen,Yang Feng
Main category: cs.CL
TL;DR: StreamUni通过统一的Large Speech-Language Model (LSLM)实现了流式语音翻译(StreamST),结合语音Chain-of-Thought(CoT)指导模型生成多阶段输出,同时完成语音分段、策略决策和翻译生成,无需大量策略训练。
Details
Motivation: 现有的流式语音翻译方法通常依赖句子级语音分段(SimulST),需与分段模型协作,限制了上下文信息且策略学习复杂。Contribution: 提出了StreamUni,利用LSLM实现StreamST;引入语音CoT指导多阶段输出;提出流式CoT训练方法;在StreamST任务上实现SOTA性能。
Method: 结合语音CoT引导LSLM生成多阶段输出,实现语音分段、策略决策和翻译生成;提出基于有限CoT数据的流式训练方法。
Result: 实验表明StreamUni在StreamST任务上表现最佳。
Insight: 语音CoT和多阶段输出设计为流式语音翻译提供了新思路,减少了策略训练的依赖并提升了性能。
Abstract: Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.
[13] Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation
Bugra Kilictas,Faruk Alpay
Main category: cs.CL
TL;DR: 论文扩展了Alpay代数的自参考框架,提出了一种多层语义游戏架构,通过超限不动点收敛实现层次化子游戏的迭代融合。结合游戏论与不动点理论,证明了语义均衡的存在性和唯一性,并提供了实际验证方法。
Details
Motivation: 研究动机在于将Alpay代数的自参考框架进一步扩展,结合游戏论和不动点理论,解决AI系统与文档对齐过程中的复杂语义问题。Contribution: 主要贡献包括:1) 提出多层语义游戏架构;2) 形式化复合算子$ϕ(⋅, γ(⋅))$;3) 证明Game Theorem确立语义均衡的唯一性;4) 提供基于$ϕ$-拓扑和Yoneda引理的实际验证方法。
Method: 方法基于Alpay代数IV的共情嵌入概念,通过嵌套游戏论结构和复合算子$ϕ(⋅, γ(⋅))$,将语义对齐问题转化为超限不动点收敛问题,并结合范畴论和信息论验证结果。
Result: 结果表明,通过超限不动点收敛可以实现多层语义游戏的统一均衡解,同时验证了该框架在真实AI认知模型中的适用性。
Insight: 洞察在于揭示了游戏论推理可由不动点迭代自然生成,而非依赖外部强加,同时框架本身作为语义病毒实例在AI嵌入空间中传播其模式。
Abstract: This paper extends the self-referential framework of Alpay Algebra into a multi-layered semantic game architecture where transfinite fixed-point convergence encompasses hierarchical sub-games at each iteration level. Building upon Alpay Algebra IV’s empathetic embedding concept, we introduce a nested game-theoretic structure where the alignment process between AI systems and documents becomes a meta-game containing embedded decision problems. We formalize this through a composite operator $\phi(\cdot, \gamma(\cdot))$ where $\phi$ drives the main semantic convergence while $\gamma$ resolves local sub-games. The resulting framework demonstrates that game-theoretic reasoning emerges naturally from fixed-point iteration rather than being imposed externally. We prove a Game Theorem establishing existence and uniqueness of semantic equilibria under realistic cognitive simulation assumptions. Our verification suite includes adaptations of Banach’s fixed-point theorem to transfinite contexts, a novel $\phi$-topology based on the Kozlov-Maz’ya-Rossmann formula for handling semantic singularities, and categorical consistency tests via the Yoneda lemma. The paper itself functions as a semantic artifact designed to propagate its fixed-point patterns in AI embedding spaces – a deliberate instantiation of the “semantic virus” concept it theorizes. All results are grounded in category theory, information theory, and realistic AI cognition models, ensuring practical applicability beyond pure mathematical abstraction.
[14] DocCHA: Towards LLM-Augmented Interactive Online diagnosis System
Xinyi Liu,Dachun Sun,Yi R. Fung,Dilek Hakkani-Tür,Tarek Abdelzaher
Main category: cs.CL
TL;DR: DocCHA 是一個基於大型語言模型(LLM)的互動式線上診斷系統,通過模組化分階段進行臨床推理,提升診斷準確性和症狀回顧能力。
Details
Motivation: 現有的對話式健康助手(CHAs)缺乏適應性多輪推理和透明決策能力,限制了其在臨床診斷中的實際應用。Contribution: 提出 DocCHA,一個基於 LLM 的模組化框架,分為症狀收集、病史獲取和因果圖構建三階段,並引入可解釋的置信度分數來優化對話流程。
Method: DocCHA 將診斷過程分解為三個模組,每個模組使用置信度分數指導自適應提問,並優化推理鏈。
Result: 在兩個中文診斷數據集(IMCS21, DX)上,DocCHA 顯著優於基於提示的 LLM 模型,診斷準確率提升 5.18%,症狀回顧提高 30% 以上。
Insight: DocCHA 展示了模組化和置信度分數在實現結構化、透明對話中的有效性,為多語言和資源受限環境中的可信臨床助手提供了可能。
Abstract: Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and over 30 percent improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate the effectiveness of DocCHA in enabling structured, transparent, and efficient diagnostic conversations – paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.
[15] SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
Guoxin Zang,Xue Li,Donglin Di,Lanshun Nie,Dechen Zhan,Yang Song,Lei Fan
Main category: cs.CL
TL;DR: 论文提出了SAGE框架,通过自引导事实增强(SFE)和熵感知直接偏好优化(E-DPO)提升视觉语言模型(VLM)在工业异常检测中的表现,并引入了AD-PL数据集和MLE评估方法。
Details
Motivation: 现有视觉语言模型在工业异常检测中表现不佳,主要因领域特异性和缺乏可解释性。SAGE旨在解决这些问题,提升模型性能和泛化能力。Contribution: 1. 提出了SAGE框架,整合SFE和E-DPO;2. 发布了AD-PL数据集;3. 开发了MLE评估框架。
Method: SAGE结合自引导事实增强(SFE)和熵感知直接偏好优化(E-DPO),分别用于增强领域知识和优化模型输出。
Result: SAGE在零样本和小样本设置下的工业异常检测数据集中表现优异。
Insight: 通过领域知识增强和专家偏好优化,可以显著提升视觉语言模型在工业异常检测中的表现和解释能力。
Abstract: While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.
[16] MIRIX: Multi-Agent Memory System for LLM-Based Agents
Yu Wang,Xi Chen
Main category: cs.CL
TL;DR: MIRIX提出了一种模块化、多智能体的记忆系统,通过六种结构化的记忆类型和多智能体框架,解决了现有AI记忆系统的局限性,显著提升了语言模型在真实场景中的记忆能力。
Details
Motivation: 现有AI记忆系统普遍依赖简单、扁平的记忆组件,无法有效实现个性化、抽象化和长期可靠的用户信息回忆。MIRIX旨在解决这一核心挑战。Contribution: 1. 提出了MIRIX,一种支持多模态(尤其是视觉)的模块化多智能体记忆系统;2. 设计了六种结构化的记忆类型;3. 在多模态和单模态任务上验证了其优越性。
Method: 1. 六种记忆类型(核心、情景、语义、程序、资源和知识库);2. 多智能体框架动态协调记忆的更新与检索;3. 应用于ScreenshotVQA和LOCOMO基准测试。
Result: 1. 在ScreenshotVQA上比RAG基线提高35%准确率,存储需求减少99.9%;2. 在LOCOMO对话任务上达到85.4%的SOTA性能。
Insight: MIRIX通过结构化记忆和多智能体框架的组合,显著提升了语言模型的长期记忆能力,尤其在多模态场景中表现出色,为未来AI记忆系统的发展提供了新方向。
Abstract: Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field’s most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.
[17] Why is Your Language Model a Poor Implicit Reward Model?
Noam Razin,Yong Lin,Jiarui Yao,Sanjeev Arora
Main category: cs.CL
TL;DR: 论文探讨了语言模型作为隐式奖励模型(IM-RM)与显式奖励模型(EX-RM)在泛化能力上的差异,发现IM-RM更依赖表层的token级别信息,导致其在分布外或分布内表现较差。
Details
Motivation: 研究动机是解释为何语言模型作为隐式奖励模型在泛化能力上不如显式奖励模型,尽管二者在训练数据、损失函数和语言模型上几乎相同。Contribution: 论文的主要贡献是揭示了IM-RM泛化能力较差的原因在于其对表层token级别信息的过度依赖,并通过理论和实验验证了这一发现。
Method: 研究方法包括理论分析和实验验证,比较IM-RM和EX-RM在不同分布变化下的表现,并排除了其他可能解释泛化差距的假设。
Result: 实验结果表明,IM-RM在token级别的分布变化中表现较差,而在生成任务中的表现与验证任务无关。
Insight: 研究指出,即使微小的设计选择(如奖励计算方式)也可能显著影响奖励模型的泛化行为,这对语言模型的后续训练和推理流程具有重要启示。
Abstract: Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
[18] Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology
Sabine Felde,Rüdiger Buchkremer,Gamal Chehab,Christian Thielscher,Jörg HW Distler,Matthias Schneider,Jutta G. Richter
Main category: cs.CL
TL;DR: 小型语言模型(SLMs)与检索增强生成(RAG)结合,在临床决策支持中表现优于大型语言模型(LLMs),且更节能和低成本。
Details
Motivation: 评估LLMs和SLMs在风湿病学临床决策支持中的性能,探索实用性和资源效率。Contribution: 证明了SLMs结合RAG在风湿病学中优于LLMs,同时具有更低的能耗和部署成本。
Method: 采用检索增强生成(RAG)技术结合小型语言模型(SLMs),并与LLMs进行性能对比。
Result: SLMs在诊断和治疗性能上优于LLMs,但专家监督仍是必要的。
Insight: 在资源有限的医疗环境中,SLMs结合RAG是一种高效且可行的解决方案。
Abstract: Large language models (LLMs) show promise for supporting clinical decision-making in complex fields such as rheumatology. Our evaluation shows that smaller language models (SLMs), combined with retrieval-augmented generation (RAG), achieve higher diagnostic and therapeutic performance than larger models, while requiring substantially less energy and enabling cost-efficient, local deployment. These features are attractive for resource-limited healthcare. However, expert oversight remains essential, as no model consistently reached specialist-level accuracy in rheumatology.
[19] Automating Expert-Level Medical Reasoning Evaluation of Large Language Models
Shuang Zhou,Wenya Xie,Jiaxi Li,Zaifu Zhan,Meijia Song,Han Yang,Cheyenna Espinoza,Lindsay Welton,Xinnie Mai,Yanwei Jin,Zidu Xu,Yuen-Hei Chung,Yiyun Xing,Meng-Han Tsai,Emma Schaffer,Yucheng Shi,Ninghao Liu,Zirui Liu,Rui Zhang
Main category: cs.CL
TL;DR: 论文提出了MedThink-Bench基准测试和LLM-w-Ref评估框架,用于严格、可解释且可扩展地评估大型语言模型(LLM)的医学推理能力,并通过实验验证了其有效性。
Details
Motivation: 当前LLM在临床决策中的应用日益广泛,但现有评估方法或缺乏严谨性,或难以扩展。为此,需要一种透明且可信的医学推理评估工具。Contribution: 1) 提出MedThink-Bench基准测试,包含500个医学问题及专家级推理步骤;2) 设计LLM-w-Ref评估框架,结合细粒度推理和LLM作为评委的机制。
Method: 利用专家标注的逐步推理作为参考,通过LLM-w-Ref框架评估模型的中间推理能力,同时保持可扩展性。
Result: LLM-w-Ref与专家判断高度相关。实验发现,较小模型(如MedGemma-27B)表现可能优于大型专有模型(如OpenAI-o3)。
Insight: 医学推理评估需要结合专家知识和可扩展性,而较小模型在特定任务上可能更具优势。
Abstract: As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs’ medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs’ medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs’ medical reasoning, advancing their safe and responsible deployment in clinical practice.
[20] PyVision: Agentic Vision with Dynamic Tooling
Shitian Zhao,Haoquan Zhang,Shaoheng Lin,Ming Li,Qilong Wu,Kaipeng Zhang,Chen Wei
Main category: cs.CL
TL;DR: PyVision is an interactive framework that enables multimodal LLMs to dynamically generate and refine Python tools for visual reasoning, improving performance significantly on benchmarks.
Details
Motivation: Prior visual reasoning approaches are limited by static toolsets. PyVision addresses this by allowing models to dynamically create and refine tools, enhancing flexibility and interpretability.Contribution: Introduces PyVision, a framework for dynamic tool generation in visual reasoning, demonstrating performance improvements on benchmarks like GPT-4.1 and Claude-4.0-Sonnet.
Method: Uses a multi-turn framework where MLLMs autonomously generate, execute, and refine Python-based tools tailored to specific tasks.
Result: PyVision boosts performance by +7.8% for GPT-4.1 on V* and +31.1% for Claude-4.0-Sonnet on VLMsAreBlind-mini.
Insight: Dynamic tooling enables models to invent tools, advancing toward more autonomous and agentic visual reasoning.
Abstract: LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.
cs.CV [Back]
[21] Multi-level Mixture of Experts for Multimodal Entity Linking
Zhiwei Hu,Víctor Gutiérrez-Basulto,Zhiliang Xiang,Ru Li,Jeff Z. Pan
Main category: cs.CV
TL;DR: 该论文提出了一种多级专家混合模型(MMoE)来解决多模态实体链接中的提及歧义和模态内容动态选择问题,通过结合大型语言模型和多模态特征编码器,实现了显著优于现有方法的表现。
Details
Motivation: 多模态实体链接(MEL)面临提及歧义和模态内容动态选择的挑战,现有方法未能有效解决这些问题。Contribution: 提出MMoE模型,包含描述感知的提及增强模块和多模态特征提取模块,以及两级专家混合机制,实现动态选择模态内容。
Method: 1. 使用大型语言模型增强提及描述;2. 提取多模态特征;3. 通过专家混合机制动态选择信息。
Result: 实验表明MMoE在性能上显著优于现有方法。
Insight: 结合大型语言模型和多模态特征编码器可以更好地解决MEL中的提及歧义和模态选择问题。
Abstract: Multimodal Entity Linking (MEL) aims to link ambiguous mentions within multimodal contexts to associated entities in a multimodal knowledge base. Existing approaches to MEL introduce multimodal interaction and fusion mechanisms to bridge the modality gap and enable multi-grained semantic matching. However, they do not address two important problems: (i) mention ambiguity, i.e., the lack of semantic content caused by the brevity and omission of key information in the mention’s textual context; (ii) dynamic selection of modal content, i.e., to dynamically distinguish the importance of different parts of modal information. To mitigate these issues, we propose a Multi-level Mixture of Experts (MMoE) model for MEL. MMoE has four components: (i) the description-aware mention enhancement module leverages large language models to identify the WikiData descriptions that best match a mention, considering the mention’s textual context; (ii) the multimodal feature extraction module adopts multimodal feature encoders to obtain textual and visual embeddings for both mentions and entities; (iii)-(iv) the intra-level mixture of experts and inter-level mixture of experts modules apply a switch mixture of experts mechanism to dynamically and adaptively select features from relevant regions of information. Extensive experiments demonstrate the outstanding performance of MMoE compared to the state-of-the-art. MMoE’s code is available at: https://github.com/zhiweihu1103/MEL-MMoE.
[22] CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings
Cristina Mata,Kanchana Ranasinghe,Michael S. Ryoo
Main category: cs.CV
TL;DR: 论文提出了一种名为CoPT的无监督域自适应方法,通过利用域不变的文本嵌入来学习图像分割编码器中的域不变特征,从而在四个基准测试中取得了最新的性能表现。
Details
Motivation: 在语义分割任务中,无监督域自适应(UDA)方法通常依赖于标注数据,但标注成本高。尽管视觉-语言表示学习取得了进展,但现有方法尚未充分利用文本的域无关特性。Contribution: 提出了一种新颖的基于协方差的像素-文本损失(CoPT),通过域无关的文本嵌入学习域不变特征,并结合LLM生成的域描述进一步提升性能。
Method: 使用LLM生成的源域和目标域描述,结合冻结的CLIP模型生成文本嵌入,并通过CoPT损失函数优化图像分割编码器。
Result: 在四个基准测试中,CoPT取得了当前最优的无监督域自适应语义分割性能。
Insight: 通过利用文本嵌入的域无关特性,可以有效减少域间差异,提升分割模型的泛化能力。
Abstract: Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large-scale vision-language representation learning, UDA methods for segmentation have not taken advantage of the domain-agnostic properties of text. To address this, we present a novel Covariance-based Pixel-Text loss, CoPT, that uses domain-agnostic text embeddings to learn domain-invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are fed to a frozen CLIP model and combined. In experiments on four benchmarks we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation. The code can be found at https://github.com/cfmata/CoPT.
[23] Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning
Renyang Liu,Guanlin Li,Tianwei Zhang,See-Kiong Ng
Main category: cs.CV
TL;DR: 这篇论文提出了一种新型多模态引导攻击方法Recall,针对图像生成模型的去学习(unlearning)机制,揭露了当前去学习技术在多模态对抗输入下的脆弱性。
Details
Motivation: 随着图像生成模型(如稳定扩散)能力的提升,其生成的潜在有害或侵权内容引发伦理和法律问题。去学习技术虽然试图解决这一问题,但其鲁棒性尚未充分研究,尤其是在多模态对抗输入下。Contribution: 论文提出了Recall框架,通过优化对抗性图像提示(而非传统的文本提示),揭示去学习模型的脆弱性,填补了多模态对抗攻击的研究空白。
Method: Recall利用扩散模型的多模态条件能力,通过单一语义相关参考图像引导优化对抗图像提示,攻击去学习模型的鲁棒性。
Result: 在十种先进去学习方法上的实验表明,Recall在对抗效果、计算效率和语义保真度上均优于基线方法。
Insight: 结果暴露了当前去学习机制的严重漏洞,强调了开发更鲁棒解决方案的必要性,以确保生成模型的安全性和可靠性。
Abstract: Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs. To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolor{blue}{https://github.com/ryliu68/RECALL}.
[24] Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey
Getamesay Haile Dagnaw,Yanming Zhu,Muhammad Hassan Maqsood,Wencheng Yang,Xingshuai Dong,Xuefei Yin,Alan Wee-Chung Liew
Main category: cs.CV
TL;DR: 该论文是一篇关于生物医学图像分析中可解释人工智能(XAI)的全面综述,强调了XAI的重要性,并提出了一种基于模态的分类法,特别关注了多模态和视觉语言模型的应用,并总结了当前的评估指标和挑战。
Details
Motivation: 生物医学图像分析中的深度学习模型需要更高的透明度和可信度,以促进临床采用。现有的XAI综述缺乏对模态特定需求和最新多模态技术的关注。Contribution: > 1. 提出了基于模态的XAI分类法,针对性分析不同成像类型的可解释性挑战。
- 探讨了多模态学习和视觉语言模型在生物医学XAI中的新兴作用。
- 总结了常用的评估指标和开源框架,并讨论了持续的挑战与未来方向。
Method: 论文通过系统性分类和分析XAI方法,结合生物医学图像的特点,提出了一种模态中心的分类法,并详细探讨了多模态和视觉语言模型的解释性潜力。
Result: 论文提供了全面的XAI方法综述,突出了不同模态的独特挑战,并展示了多模态技术在提升可解释性方面的潜力。
Insight: 多模态和视觉语言模型为生物医学图像分析的可解释性开辟了新方向,但仍需进一步研究以克服评估标准和实用性问题。
Abstract: Explainable artificial intelligence (XAI) has become increasingly important in biomedical image analysis to promote transparency, trust, and clinical adoption of DL models. While several surveys have reviewed XAI techniques, they often lack a modality-aware perspective, overlook recent advances in multimodal and vision-language paradigms, and provide limited practical guidance. This survey addresses this gap through a comprehensive and structured synthesis of XAI methods tailored to biomedical image analysis.We systematically categorize XAI methods, analyzing their underlying principles, strengths, and limitations within biomedical contexts. A modality-centered taxonomy is proposed to align XAI methods with specific imaging types, highlighting the distinct interpretability challenges across modalities. We further examine the emerging role of multimodal learning and vision-language models in explainable biomedical AI, a topic largely underexplored in previous work. Our contributions also include a summary of widely used evaluation metrics and open-source frameworks, along with a critical discussion of persistent challenges and future directions. This survey offers a timely and in-depth foundation for advancing interpretable DL in biomedical image analysis.
[25] Robust Multimodal Large Language Models Against Modality Conflict
Zongmeng Zhang,Wengang Zhou,Jie Zhao,Houqiang Li
Main category: cs.CV
TL;DR: 本文研究了多模态大语言模型(MLLMs)中由于模态冲突导致的幻觉现象,提出了一个名为MMMC的数据集来模拟这种现象,并通过提示工程、监督微调和强化学习三种方法来缓解问题。实验表明,强化学习方法在解决模态冲突引起的幻觉方面表现最佳。
Details
Motivation: 多模态大语言模型在视觉语言任务中表现出色,但在真实场景中容易产生幻觉。现有研究多关注模型响应与输入之间的冲突,而本文则探究了不同模态输入之间的固有冲突及其对模型的影响。Contribution: 1. 正式定义了模态冲突,并构建了MMMC数据集;2. 提出了三种缓解模态冲突导致幻觉的方法;3. 分析了不同方法的优缺点,强调了强化学习的效果。
Method: 1. 通过提示工程调整输入;2. 使用监督微调优化模型;3. 引入强化学习策略。实验在MMMC数据集上进行了广泛验证。
Result: 强化学习方法在缓解模态冲突导致的幻觉上表现最佳,监督微调方法则表现出稳定且有潜力的性能。
Insight: 模态冲突是导致MLLMs幻觉的一个被忽视的因素,强化学习在解决这一问题上具有显著优势。
Abstract: Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.
[26] Aerial Maritime Vessel Detection and Identification
Antonella Barisic Kulas,Frano Petric,Stjepan Bogdan
Main category: cs.CV
TL;DR: 论文提出了一种在GNSS不可用环境下的无人机自主海上船只检测与识别方法,结合YOLOv8目标检测、特征匹配和色调直方图距离分析,实现了目标船只的定位,并在MBZIRC2023竞赛中验证了其有效性。
Details
Motivation: 在GNSS不可用的环境中,无人机需要依赖机载视觉技术完成大规模搜索任务,而现有方法在计算资源受限和视觉线索有限的情况下效果不佳。Contribution: 提出了一种结合YOLOv8、特征匹配和色调直方图分析的船只检测与识别方法,并在真实场景中验证了其可行性。
Method: 1. 使用YOLOv8检测视野中的所有船只;2. 通过特征匹配和色调直方图距离分析识别目标船只;3. 利用几何原理定位目标。
Result: 在MBZIRC2023竞赛中成功应用,验证了方法的有效性,并分析了视角对检测精度和定位准确性的影响。
Insight: 视觉特征结合几何定位可以在GNSS不可用环境下实现高效的目标识别,但视角变化可能影响检测精度。
Abstract: Autonomous maritime surveillance and target vessel identification in environments where Global Navigation Satellite Systems (GNSS) are not available is critical for a number of applications such as search and rescue and threat detection. When the target vessel is only described by visual cues and its last known position is not available, unmanned aerial vehicles (UAVs) must rely solely on on-board vision to scan a large search area under strict computational constraints. To address this challenge, we leverage the YOLOv8 object detection model to detect all vessels in the field of view. We then apply feature matching and hue histogram distance analysis to determine whether any detected vessel corresponds to the target. When found, we localize the target using simple geometric principles. We demonstrate the proposed method in real-world experiments during the MBZIRC2023 competition, integrated into a fully autonomous system with GNSS-denied navigation. We also evaluate the impact of perspective on detection accuracy and localization precision and compare it with the oracle approach.
[27] A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality
Mohamed Elmoghany,Ryan Rossi,Seunghyun Yoon,Subhojyoti Mukherjee,Eslam Bakr,Puneet Mathur,Gang Wu,Viet Dac Lai,Nedim Lipka,Ruiyi Zhang,Varun Manjunatha,Chien Nguyen,Daksh Dangi,Abel Salinas,Mohammad Taesiri,Hongjie Chen,Xiaolei Huang,Joe Barrow,Nesreen Ahmed,Hoda Eldardiry,Namyong Park,Yu Wang,Jaemin Cho,Anh Totti Nguyen,Zhengzhong Tu,Thien Nguyen,Dinesh Manocha,Mohamed Elhoseiny,Franck Dernoncourt
Main category: cs.CV
TL;DR: 该论文综述了长视频叙事生成的现状,重点关注架构设计、角色与场景一致性以及电影质量,总结了32篇相关论文的关键组件和训练策略,并提出了一种新的分类方法。
Details
Motivation: 解决现有视频生成模型在长视频(超过16秒)中角色一致性、场景布局和运动连贯性不足的问题,以及多角色长视频生成中存在的帧冗余和时间多样性低的问题。Contribution: 1. 全面研究了32篇相关论文,总结了生成高质量长视频的关键架构和训练策略;2. 构建了一种新颖的分类法,对现有方法进行了系统分类;3. 提供了基于架构设计和性能特征的比较表格。
Method: 通过文献综述方法,系统分析了32篇论文的视频生成架构和训练策略,提出了分类法,并总结了性能特征。
Result: 揭示了长视频生成中的关键挑战,并提出了未来研究方向,尤其是如何在多角色和复杂叙事中保持一致性。
Insight: 长视频生成的核心挑战在于时间一致性和叙事连贯性,未来需要结合更强大的生成模型和更精细的控制策略。
Abstract: Despite the significant progress that has been made in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled “long-form videos”. Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We comprehensively studied 32 papers on video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a comprehensive novel taxonomy of existing methods and present comparative tables that categorize papers by their architectural designs and performance characteristics.
[28] Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement
Priyank Pathak,Yogesh S. Rawat
Main category: cs.CV
TL;DR: 论文提出了一种基于颜色分离的轻量级方法CSCI,用于解决服装变化ReID问题,无需额外标注或模型,通过颜色信息有效分离外观偏差和身份特征。
Details
Motivation: 现有服装变化ReID方法依赖额外模型或标注,计算成本高。作者探索颜色作为轻量级代理,直接从图像中分离与服装相关的偏差。Contribution: 1. 提出CSCI方法,仅依赖RGB信息;2. 引入S2A自注意力机制避免颜色与身份特征的信息泄露;3. 验证颜色可作为服装属性的有效代理。
Method: 1. 利用颜色信息(前景/背景)分离外观偏差;2. 设计S2A自注意力机制防止特征空间信息泄露;3. 结合图像和视频ReID任务进行实验验证。
Result: 在四个CC-ReID数据集上表现优异,图像ReID任务中LTCC和PRCC分别提升2.9%和5.0%,视频ReID任务中CCVID和MeVID分别提升1.0%和2.5%。
Insight: 颜色是服装变化的轻量级代理,无需额外标注即可有效分离外观偏差,为服装变化ReID提供了一种低成本解决方案。
Abstract: Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing. Existing methods often rely on additional models or annotations to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color - specifically foreground and background colors - as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), an RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias (‘Color See’) while disentangling it from identity-relevant ReID features (‘Color Ignore’). To achieve this, we introduce S2A self-attention, a novel self-attention to prevent information leak between color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve the baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as a cost-effective solution for addressing appearance bias in CC-ReID. Github: https://github.com/ppriyank/ICCV-CSCI-Person-ReID.
[29] Automated Video Segmentation Machine Learning Pipeline
Johannes Merz,Lucien Fostier
Main category: cs.CV
TL;DR: 这篇论文提出了一种自动视频分割的机器学习流程,用于生成时间一致的实例掩码,显著提升了视觉效果(VFX)生产中的效率。
Details
Motivation: 传统视觉效果生产中,手动生成掩码过程缓慢且资源密集,急需一种自动化解决方案来提升效率。Contribution: 提出了一种结合文本提示对象检测、逐帧精细化图像分割和视频跟踪的自动化流水线,实现了时间一致的实例掩码生成。
Method: 1. 使用文本提示实现灵活的对象检测;2. 逐帧精细化图像分割;3. 鲁棒的视频跟踪确保时间一致性;4. 通过容器化部署和结构化输出格式提升实用性。
Result: 该流水线显著减少了人工干预,加速了初步合成制作,并提供了全面的分割数据,提升了整体VFX生产效率。
Insight: 自动化流水线结合机器学习方法可以有效解决视觉效果生产中的掩码生成问题,同时容器化技术加速了实际应用落地。
Abstract: Visual effects (VFX) production often struggles with slow, resource-intensive mask generation. This paper presents an automated video segmentation pipeline that creates temporally consistent instance masks. It employs machine learning for: (1) flexible object detection via text prompts, (2) refined per-frame image segmentation and (3) robust video tracking to ensure temporal stability. Deployed using containerization and leveraging a structured output format, the pipeline was quickly adopted by our artists. It significantly reduces manual effort, speeds up the creation of preliminary composites, and provides comprehensive segmentation data, thereby enhancing overall VFX production efficiency.
[30] DisenQ: Disentangling Q-Former for Activity-Biometrics
Shehreen Azad,Yogesh S Rawat
Main category: cs.CV
TL;DR: 论文提出了DisenQ框架,通过多模态语言引导解决活动生物识别中的特征纠缠问题,实现了身份、动作和非生物特征的解耦,并在多个基准测试中达到最优性能。
Details
Motivation: 传统的人体识别在多样化的活动中面临挑战,因为身份特征与动作动态和外观变化纠缠在一起。现有方法依赖额外的视觉数据(如姿态或轮廓),但其提取不准确限制了性能。因此,论文旨在通过结构化文本监督替代视觉数据,解决特征纠缠问题。Contribution: 主要贡献是提出了DisenQ框架,一种基于多模态语言引导的查询变换器,能够解耦生物特征、动作和非生物特征,从而防止身份特征因外观和动作变化而混淆。
Method: 方法的核心是DisenQ,利用结构化语言指导,通过统一的查询变换器分离生物特征与非生物特征。这种设计确保身份线索独立于外观和动作变化。
Result: 在三个基于活动的视频基准测试中取得了最优性能,并在传统视频识别基准上表现出强泛化能力。
Insight: 通过语言引导的解耦方法可以有效替代传统的视觉数据依赖,提升生物特征学习的鲁棒性和准确性。
Abstract: In this work, we address activity-biometrics, which involves identifying individuals across diverse set of activities. Unlike traditional person identification, this setting introduces additional challenges as identity cues become entangled with motion dynamics and appearance variations, making biometrics feature learning more complex. While additional visual data like pose and/or silhouette help, they often struggle from extraction inaccuracies. To overcome this, we propose a multimodal language-guided framework that replaces reliance on additional visual data with structured textual supervision. At its core, we introduce \textbf{DisenQ} (\textbf{Disen}tangling \textbf{Q}-Former), a unified querying transformer that disentangles biometrics, motion, and non-biometrics features by leveraging structured language guidance. This ensures identity cues remain independent of appearance and motion variations, preventing misidentifications. We evaluate our approach on three activity-based video benchmarks, achieving state-of-the-art performance. Additionally, we demonstrate strong generalization to complex real-world scenario with competitive performance on a traditional video-based identification benchmark, showing the effectiveness of our framework.
[31] LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation
Ananya Raval,Aravind Narayanan,Vahid Reza Khazaie,Shaina Raza
Main category: cs.CV
TL;DR: 论文提出了LinguaMark,一个评估多模态模型在多语言VQA任务中表现的新基准。结果显示闭源模型整体表现最佳,而开源模型Qwen2.5在多语言泛化能力上表现突出。
Details
Motivation: 当前大型多模态模型(LMMs)在语言覆盖上存在不足,可能导致输出存在偏见和不公平。研究旨在填补多模态评估中对多语言能力关注的空白。Contribution: 1. 提出了LinguaMark基准,覆盖11种语言和5种社会属性;2. 定义了三个评估指标(偏见、答案相关性、忠实性);3. 发现闭源模型整体表现最佳,开源模型在泛化能力上表现突出。
Method: 通过多语言VQA任务评估LMMs,使用6,875个图像-文本对,涵盖11种语言和5种社会属性。采用Bias、Answer Relevancy、Faithfulness三个指标进行分析。
Result: 闭源模型(如GPT-4o和Gemini2.5)整体表现最佳;开源模型(如Qwen2.5)在多语言泛化能力上表现优异。
Insight: 1. 多模态模型在多语言任务中存在局限性;2. 开源模型在泛化能力上具有潜力;3. 需要更多研究关注模型的公平性和语言覆盖问题。
Abstract: Large Multimodal Models (LMMs) are typically trained on vast corpora of image-text data but are often limited in linguistic coverage, leading to biased and unfair outputs across languages. While prior work has explored multimodal evaluation, less emphasis has been placed on assessing multilingual capabilities. In this work, we introduce LinguaMark, a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task. Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes. We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness. Our findings reveal that closed-source models generally achieve the highest overall performance. Both closed-source (GPT-4o and Gemini2.5) and open-source models (Gemma3, Qwen2.5) perform competitively across social attributes, and Qwen2.5 demonstrates strong generalization across multiple languages. We release our benchmark and evaluation code to encourage reproducibility and further research.
[32] MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning
Chengfei Wu,Ronald Seoh,Bingxuan Li,Liqiang Zhang,Fengrong Han,Dan Goldwasser
Main category: cs.CV
TL;DR: MagiC是一个评估多模态认知的基准测试,专注于验证视觉语言模型是否进行真正基于视觉的推理,而非依赖数据偏见。
Details
Motivation: 当前大型视觉语言模型在视觉问答和多模态推理中表现出色,但其是否真正基于视觉进行推理尚不明确。Contribution: 引入了MagiC基准测试,包含5500个弱监督QA样本和900个人工标注样本,评估模型的答案准确性、推理逻辑、视觉证据对齐及自纠正能力。
Method: 设计了四个评估维度(答案正确性、推理有效性、视觉对齐性和自纠正能力),并引入新指标MagiScore和StepSense进行综合测评。
Result: 对15个7B至70B参数的视觉语言模型进行测试,揭示了当前方法在基于视觉推理中的局限性。
Insight: MagiC揭示了视觉语言模型在细节推理和视觉对齐方面的不足,为未来改进提供了方向。
Abstract: Recent advances in large vision-language models have led to impressive performance in visual question answering and multimodal reasoning. However, it remains unclear whether these models genuinely perform grounded visual reasoning or rely on superficial patterns and dataset biases. In this work, we introduce MagiC, a comprehensive benchmark designed to evaluate grounded multimodal cognition, assessing not only answer accuracy but also the quality of step-by-step reasoning and its alignment with relevant visual evidence. Our benchmark includes approximately 5,500 weakly supervised QA examples generated from strong model outputs and 900 human-curated examples with fine-grained annotations, including answers, rationales, and bounding box groundings. We evaluate 15 vision-language models ranging from 7B to 70B parameters across four dimensions: final answer correctness, reasoning validity, grounding fidelity, and self-correction ability. MagiC further includes diagnostic settings to probe model robustness under adversarial visual cues and assess their capacity for introspective error correction. We introduce new metrics such as MagiScore and StepSense, and provide comprehensive analyses that reveal key limitations and opportunities in current approaches to grounded visual reasoning.
[33] ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation
Sherry X. Chen,Yi Wei,Luowei Zhou,Suren Kumar
Main category: cs.CV
TL;DR: ADIEE提出了一种自动生成数据集的方法,用于训练评分模型,以评估指令引导的图像编辑效果,显著提升了开源和专有模型的性能与透明度。
Details
Motivation: 当前指令引导图像编辑的自动评估存在挑战,开源视觉语言模型(VLM)对齐不足,专有模型缺乏透明度和成本效益,且缺乏公开训练数据集。Contribution: 提出ADIEE,自动生成大规模数据集(100K+样本),并训练出高性能评分模型,显著优于现有开源和专有模型。
Method: 使用生成的大规模数据集微调LLaVA-NeXT-8B模型,通过自定义token解码数字评分。
Result: 评分模型在多项基准测试中表现优异,与人类评分的相关性提升17.24%,并提升选择准确性。
Insight: ADIEE不仅可作为评分模型,还能作为奖励模型用于自动选择最佳编辑和模型微调,提升现有模型性能。
Abstract: Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model modified to decode a numeric score from a custom token. The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0696 (+17.24%) gain in score correlation with human ratings on AURORA-Bench, and improving pair-wise comparison accuracy by 4.03% (+7.21%) on GenAI-Bench and 4.75% (+9.35%) on AURORA-Bench, respectively, compared to the state-of-the-art. The scorer can act as a reward model, enabling automated best edit selection and model fine-tuning. Notably, the proposed scorer can boost MagicBrush model’s average evaluation score on ImagenHub from 5.90 to 6.43 (+8.98%).
[34] Scalable and Realistic Virtual Try-on Application for Foundation Makeup with Kubelka-Munk Theory
Hui Pang,Sunil Hadap,Violetta Shevchenko,Rahul Suresh,Amin Banitalebi-Dehkordi
Main category: cs.CV
TL;DR: 提出了一种基于Kubelka-Munk理论的快速图像合成方法,用于实现高效且真实的基础底妆虚拟试妆应用。
Details
Motivation: 增强现实在美妆行业的应用日益广泛,但现有的虚拟试妆技术在底妆与肤色的真实融合上存在挑战,尤其是在多产品规模扩增时需要保持高真实感。Contribution: 1. 提出了一种近似Kubelka-Munk理论的方法,优化图像合成速度的同时保持颜色融合的真实性;2. 构建了一个可扩展的端到端框架,仅依赖电商平台的产品信息即可实现真实感底妆试妆。
Method: 采用Kubelka-Munk理论近似模型,结合图像合成技术,构建了一个高效的端到端框架。
Result: 在真实化妆图像上验证了方法的优越性,显著优于其他技术。
Insight: 通过理论近似与工程化结合,为虚拟试妆提供了高效且真实的解决方案。
Abstract: Augmented reality is revolutionizing beauty industry with virtual try-on (VTO) applications, which empowers users to try a wide variety of products using their phones without the hassle of physically putting on real products. A critical technical challenge in foundation VTO applications is the accurate synthesis of foundation-skin tone color blending while maintaining the scalability of the method across diverse product ranges. In this work, we propose a novel method to approximate well-established Kubelka-Munk (KM) theory for faster image synthesis while preserving foundation-skin tone color blending realism. Additionally, we build a scalable end-to-end framework for realistic foundation makeup VTO solely depending on the product information available on e-commerce sites. We validate our method using real-world makeup images, demonstrating that our framework outperforms other techniques.
[35] Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning
Daniel A. P. Oliveira,David Martins de Matos
Main category: cs.CV
TL;DR: 该论文提出了一种基于对比强化学习的方法,通过合成负样本和双组件奖励函数,优化视觉叙事系统中实体重识别的能力,显著提升了实体定位和连贯性。
Details
Motivation: 现有的视觉叙事系统(如大型视觉语言模型)在跨帧识别实体时表现不佳,容易导致不一致的引用和幻觉。这是由于缺乏对跨帧实体连接的显式训练。Contribution: 1. 提出了一种对比强化学习框架,用于训练模型区分连贯的图像序列和无关图像;2. 扩展了Story Reasoning数据集,引入合成负样本;3. 使用基于Qwen2.5-VL 7B的模型进行微调,显著提升了实体重识别和叙事连贯性。
Method: 1. 采用对比强化学习框架;2. 结合Direct Preference Optimization和双组件奖励函数(促进真实故事中的实体定位,惩罚合成上下文中的错误连接);3. 在Story Reasoning数据集中引入合成负样本。
Result: 实体定位mAP从0.27提升至0.31(+14.8%),F1从0.35提升至0.41(+17.1%);跨帧实体持续性显著提高,5帧以上的实体识别率从29.3%提升至33.3%(+13.7%);结构化叙事比例从79.1%提升至97.5%(+23.3%)。
Insight: 通过对比强化学习显式训练实体连接行为,可以有效解决视觉叙事中的实体重识别问题,同时提升语义连贯性和叙事质量。
Abstract: Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections across frames. We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images. We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior. We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities in real stories while penalizing incorrect entity connections in synthetic contexts. Using this contrastive framework, we fine-tune Qwen Storyteller (based on Qwen2.5-VL 7B). Evaluation shows improvements in grounding mAP from 0.27 to 0.31 (+14.8%), F1 from 0.35 to 0.41 (+17.1%). Pronoun grounding accuracy improved across all pronoun types except ``its’’, and cross-frame character and object persistence increased across all frame counts, with entities appearing in 5 or more frames advancing from 29.3% to 33.3% (+13.7%). Well-structured stories, containing the chain-of-thought and grounded story, increased from 79.1% to 97.5% (+23.3%).
[36] PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency
Haotian Wang,Aoran Xiao,Xiaoqin Zhang,Meng Yang,Shijian Lu
Main category: cs.CV
TL;DR: PacGDC提出了一种标签高效的技术,通过利用2D到3D投影中的歧义性和一致性,合成大量伪几何数据,从而减少对大规模标注数据的依赖,提升深度补全的泛化能力。
Details
Motivation: 深度补全模型通常需要大规模标注数据,而标注成本高昂。PacGDC旨在通过利用投影歧义性和一致性,合成多样化的伪几何数据,降低对标注数据的依赖。Contribution: 1. 提出了一种基于投影歧义性和一致性的伪几何数据合成方法,显著提升了数据多样性。2. 设计了利用多深度基础模型作为尺度操纵器的数据合成流程。3. 引入插值和重定位策略扩展数据覆盖范围。
Method: 1. 利用2D到3D投影中的歧义性和一致性合成伪几何数据。2. 通过多深度基础模型操纵场景尺度,生成多样化的伪深度标签。3. 结合插值和重定位策略进一步增加数据多样性。
Result: 实验表明,PacGDC在多种基准测试中表现优异,尤其在零样本和少样本设置下,能够适应多样的场景语义、尺度和深度稀疏性/模式。
Insight: 通过投影歧义性和一致性合成伪几何数据,是一种标签高效且提升泛化能力的有效方法。
Abstract: Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: https://github.com/Wang-xjtu/PacGDC.
[37] Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos
Hao Xu,Arbind Agrahari Baniya,Sam Wells,Mohamed Reda Bouadjenek,Richard Dazeley,Sunil Aryal
Main category: cs.CV
TL;DR: 本文提出了一种多尺度注意力门控移位模块(MSAGSM),用于增强视频中细粒度事件定位的性能,并在新的乒乓球数据集(TTA)上验证了其有效性。
Details
Motivation: 现有的事件定位模型在时间感受野和空间适应性上存在局限,无法有效捕捉短长期依赖和显著区域。Contribution: 1. 提出MSAGSM模块,结合多尺度时间扩张和多头空间注意力;2. 发布首个乒乓球事件定位数据集(TTA)。
Method: MSAGSM通过多尺度时间扩张和多头空间注意力增强门控移位模块(GSM),实现高效建模。
Result: 在五个事件定位基准测试中,MSAGSM性能显著提升,且计算开销小。
Insight: 多尺度时间建模和空间注意力机制能有效提升细粒度事件定位的准确性。
Abstract: Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as Gate Shift Module (GSM) or Gate Shift Fuse (GSF) to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal dilations and multi-head spatial attention, enabling efficient modeling of both short- and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight plug-and-play module that can be easily integrated with various 2D backbones. To further advance the field, we introduce the Table Tennis Australia (TTA) dataset-the first PES benchmark for table tennis-containing over 4800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.
[38] KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos
Jinseong Kim,Junghoon Song,Gyeongseon Baek,Byeongjoon Noh
Main category: cs.CV
TL;DR: KeyRe-ID是一个基于关键点指导的视频行人重识别框架,通过全局和局部分支结合人类关键点,提升时空表征学习能力。
Details
Motivation: 现有的行人重识别方法往往忽视了人体关键点的信息,限制了模型对细粒度特征的学习能力。KeyRe-ID旨在利用关键点信息,增强模型的全局和局部特征提取能力。Contribution: 1. 提出了KeyRe-ID框架,结合全局和局部分支;2. 利用关键点动态分割身体区域,生成细粒度的部分感知特征;3. 在多个基准数据集上实现了最先进的性能。
Method: 1. 全局分支使用Transformer进行时间聚合,捕捉整体身份语义;2. 局部分支基于关键点动态分割身体区域,生成细粒度特征。
Result: 在MARS数据集上达到91.73% mAP和97.32% Rank-1准确率,在iLIDS-VID数据集上达到96.00% Rank-1和100.0% Rank-5准确率。
Insight: 关键点信息的引入显著提升了视频行人重识别模型的性能,尤其是在细粒度特征学习方面。
Abstract: We propose \textbf{KeyRe-ID}, a keypoint-guided video-based person re-identification framework consisting of global and local branches that leverage human keypoints for enhanced spatiotemporal representation learning. The global branch captures holistic identity semantics through Transformer-based temporal aggregation, while the local branch dynamically segments body regions based on keypoints to generate fine-grained, part-aware features. Extensive experiments on MARS and iLIDS-VID benchmarks demonstrate state-of-the-art performance, achieving 91.73% mAP and 97.32% Rank-1 accuracy on MARS, and 96.00% Rank-1 and 100.0% Rank-5 accuracy on iLIDS-VID. The code for this work will be publicly available on GitHub upon publication.
[39] Behave Your Motion: Habit-preserved Cross-category Animal Motion Transfer
Zhimin Zhang,Bi’an Du,Caoyuan Ma,Zheng Wang,Wei Hu
Main category: cs.CV
TL;DR: 本文提出了一种新型的跨种类动物运动迁移框架,专注于保留物种特有的行为习惯(habit),填补了现有方法在动物运动迁移中的不足,并引入了大型语言模型(LLM)以支持未见过的物种。
Details
Motivation: 现有运动迁移方法主要针对人类运动,着重于骨骼对齐或风格一致性,而忽略了动物特有的行为习惯。本文旨在解决这一空白。Contribution: 1. 提出了一个保留行为习惯的跨种类动物运动迁移框架。2. 引入了类别特定的习惯编码器和大型语言模型(LLM),以支持未见物种的运动迁移。3. 提出了DeformingThings4D-skl数据集,用于实验验证。
Method: 基于生成框架,模型通过习惯保留模块(含类别特定习惯编码器)学习捕获行为习惯的运动先验,并整合LLM以扩展迁移能力。
Result: 实验在DeformingThings4D-skl数据集上进行,定量分析表明模型优于现有方法。
Insight: 保留行为习惯对动物运动迁移至关重要,而引入LLM可以增强模型对新物种的泛化能力。
Abstract: Animal motion embodies species-specific behavioral habits, making the transfer of motion across categories a critical yet complex task for applications in animation and virtual reality. Existing motion transfer methods, primarily focused on human motion, emphasize skeletal alignment (motion retargeting) or stylistic consistency (motion style transfer), often neglecting the preservation of distinct habitual behaviors in animals. To bridge this gap, we propose a novel habit-preserved motion transfer framework for cross-category animal motion. Built upon a generative framework, our model introduces a habit-preservation module with category-specific habit encoder, allowing it to learn motion priors that capture distinctive habitual characteristics. Furthermore, we integrate a large language model (LLM) to facilitate the motion transfer to previously unobserved species. To evaluate the effectiveness of our approach, we introduce the DeformingThings4D-skl dataset, a quadruped dataset with skeletal bindings, and conduct extensive experiments and quantitative analyses, which validate the superiority of our proposed model.
[40] Seg-Wild: Interactive Segmentation based on 3D Gaussian Splatting for Unconstrained Image Collections
Yongtang Bao,Chengjie Tang,Yuze Wang,Haojie Li
Main category: cs.CV
TL;DR: Seg-Wild提出了一种基于3D高斯泼溅的交互式分割方法,适用于无约束图像集,通过多维特征嵌入和Spiky 3D Gaussian Cutter解决光照不一致和遮挡问题。
Details
Motivation: 无约束图像集较易获取,但其光照不一致和短暂遮挡问题使分割任务具有挑战性,现有方法难以应对这些问题。Contribution: 1) 提出Seg-Wild,基于3D高斯泼溅的交互式分割方法;2) 引入SGC平滑异常3D高斯;3) 设计新基准评估野外场景分割质量。
Method: 通过多维特征嵌入计算特征相似性实现交互式分割,利用SGC切割异常3D高斯,并结合SAM掩码优化投影效果。
Result: 实验表明Seg-Wild在分割和重建质量上优于现有方法。
Insight: 结合3D高斯泼溅和交互式特征嵌入能有效处理无约束图像的分割任务,SGC为异常高斯处理提供了新思路。
Abstract: Reconstructing and segmenting scenes from unconstrained photo collections obtained from the Internet is a novel but challenging task. Unconstrained photo collections are easier to get than well-captured photo collections. These unconstrained images suffer from inconsistent lighting and transient occlusions, which makes segmentation challenging. Previous segmentation methods cannot address transient occlusions or accurately restore the scene’s lighting conditions. Therefore, we propose Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting for unconstrained image collections, suitable for in-the-wild scenes. We integrate multi-dimensional feature embeddings for each 3D Gaussian and calculate the feature similarity between the feature embeddings and the segmentation target to achieve interactive segmentation in the 3D scene. Additionally, we introduce the Spiky 3D Gaussian Cutter (SGC) to smooth abnormal 3D Gaussians. We project the 3D Gaussians onto a 2D plane and calculate the ratio of 3D Gaussians that need to be cut using the SAM mask. We also designed a benchmark to evaluate segmentation quality in in-the-wild scenes. Experimental results demonstrate that compared to previous methods, Seg-Wild achieves better segmentation results and reconstruction quality. Our code will be available at https://github.com/Sugar0725/Seg-Wild.
[41] EscherNet++: Simultaneous Amodal Completion and Scalable View Synthesis through Masked Fine-Tuning and Enhanced Feed-Forward 3D Reconstruction
Xinan Zhang,Muhammad Zubair Irshad,Anthony Yezzi,Yi-Chang Tsai,Zsolt Kira
Main category: cs.CV
TL;DR: EscherNet++是一种基于masked fine-tuning的扩散模型,能够同时完成零样本的新视角合成和amodal completion,通过改进的单阶段方法和增强的前馈3D重建技术,显著提高了效率和性能。
Details
Motivation: 现有方法通常采用多阶段复杂流程完成amodal completion和新视角合成,缺乏跨视图依赖性且存储和计算冗余。EscherNet++旨在通过端到端的masked fine-tuning解决这些问题。Contribution: 1. 提出了一种masked fine-tuning方法(包括输入级和特征级masking),实现了端到端的amodal completion和新视角合成。2. 结合前馈图像到网格模型,无需额外训练即可实现快速3D重建,重建时间减少95%。3. 在较小数据集和批量下仍取得SOTA结果。
Method: 1. 应用masked fine-tuning(输入级和特征级masking)训练扩散模型。2. 集成前馈图像到网格模型以加速3D重建。3. 通过合成任意查询视角实现高效重建。
Result: 在遮挡任务中,PSNR提高3.9,Volume IoU提升0.28(10输入设置),并在真实世界遮挡重建中表现良好。重建时间减少95%。
Insight: 1. Masked fine-tuning能有效结合amodal completion和新视角合成任务。2. 单阶段方法显著减少计算冗余。3. 模型的可扩展性支持快速3D重建。
Abstract: We propose EscherNet++, a masked fine-tuned diffusion model that can synthesize novel views of objects in a zero-shot manner with amodal completion ability. Existing approaches utilize multiple stages and complex pipelines to first hallucinate missing parts of the image and then perform novel view synthesis, which fail to consider cross-view dependencies and require redundant storage and computing for separate stages. Instead, we apply masked fine-tuning including input-level and feature-level masking to enable an end-to-end model with the improved ability to synthesize novel views and conduct amodal completion. In addition, we empirically integrate our model with other feed-forward image-to-mesh models without extra training and achieve competitive results with reconstruction time decreased by 95%, thanks to its ability to synthesize arbitrary query views. Our method’s scalable nature further enhances fast 3D reconstruction. Despite fine-tuning on a smaller dataset and batch size, our method achieves state-of-the-art results, improving PSNR by 3.9 and Volume IoU by 0.28 on occluded tasks in 10-input settings, while also generalizing to real-world occluded reconstruction.
[42] EPIC: Efficient Prompt Interaction for Text-Image Classification
Xinyao Yu,Hao Sun,Zeyu Ling,Ziwei Niu,Zhenjia Bai,Rui Qin,Yen-Wei Chen,Lanfen Lin
Main category: cs.CV
TL;DR: 论文提出了一种名为EPIC的高效提示交互方法,用于文本-图像分类任务,通过中间层的时间提示和基于相似性的模态交互,显著降低了计算成本和可训练参数。
Details
Motivation: 大规模预训练多模态模型(LMMs)在微调时计算成本高昂,因此研究高效的提示交互策略以对齐模态成为必要。Contribution: 提出了EPIC方法,通过时间提示和相似性交互实现高效的多模态对齐,显著减少计算资源和可训练参数。
Method: 在中间层使用时间提示,通过基于相似性的提示交互整合不同模态,充分利用模态间的信息交换。
Result: 在UPMC-Food101和SNLI-VE数据集上表现优异,在MM-IMDB数据集上达到可比性能,同时节省了计算资源。
Insight: 提示交互策略可以在减少计算成本的同时,保持甚至提升模型性能,为多模态任务提供了一种高效解决方案。
Abstract: In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.
[43] Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning
Jingjing Jiang,Chao Ma,Xurui Song,Hanwang Zhang,Jun Luo
Main category: cs.CV
TL;DR: 本文提出了Corvid,一种增强的多模态大型语言模型(MLLM),通过改进的思维链(CoT)推理能力,在复杂任务(如数学推理和科学问题解决)中表现出色。
Details
Motivation: 现有开源MLLM在复杂结构化推理任务中存在显著不足,Corvid旨在填补这一空白,提升多模态推理能力。Contribution: 1. 提出Corvid模型;2. 引入高质量多模态CoT指令数据集MCoT-Instruct-287K;3. 设计两阶段CoT训练方法和推理时自验证策略。
Method: 混合视觉编码器+GateMixer连接器;两阶段CoT指令微调;推理时自验证避免过/欠推理。
Result: Corvid在数学推理和科学问题解决中优于同类模型。
Insight: 高质量数据集和针对性训练策略是提升MLLM推理能力的关键。
Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid’s CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage CoT-formatted training approach to progressively enhance its step-by-step reasoning abilities. Furthermore, we propose an effective inference-time scaling strategy that enables Corvid to mitigate over-reasoning and under-reasoning through self-verification. Extensive experiments demonstrate that Corvid outperforms existing o1-like MLLMs and state-of-the-art MLLMs with similar parameter scales, with notable strengths in mathematical reasoning and science problem-solving. Project page: https://mm-vl.github.io/corvid.
[44] Towards High-Resolution 3D Anomaly Detection: A Scalable Dataset and Real-Time Framework for Subtle Industrial Defects
Yuqi Cheng,Yihan Sun,Hui Zhang,Weiming Shen,Yunkang Cao
Main category: cs.CV
TL;DR: 本文提出了一种高分辨率3D异常检测方法,包括MiniShift数据集(首个高分辨率3D异常检测数据集)和Simple3D框架(实时高效检测框架),推动了工业缺陷检测的发展。
Details
Motivation: 现有3D异常检测基准数据集多为低分辨率,难以检测工业中的细微缺陷,亟需高分辨率数据和高效方法。Contribution: 1. 提出MiniShift数据集,首个高分辨率3D异常检测数据集;2. 设计Simple3D框架,结合MSND和LFSA,实现实时高效检测。
Method: 使用多尺度邻域描述符(MSND)和局部特征空间聚合(LFSA),捕捉复杂几何细节,同时保持低计算开销。
Result: Simple3D在MiniShift和其他基准测试中超越了现有方法,达到20 fps的实时检测速度。
Insight: 高分辨率数据和有效的特征聚合对提升3D异常检测的实用性和准确性至关重要。
Abstract: In industrial point cloud analysis, detecting subtle anomalies demands high-resolution spatial data, yet prevailing benchmarks emphasize low-resolution inputs. To address this disparity, we propose a scalable pipeline for generating realistic and subtle 3D anomalies. Employing this pipeline, we developed MiniShift, the inaugural high-resolution 3D anomaly detection dataset, encompassing 2,577 point clouds, each with 500,000 points and anomalies occupying less than 1% of the total. We further introduce Simple3D, an efficient framework integrating Multi-scale Neighborhood Descriptors (MSND) and Local Feature Spatial Aggregation (LFSA) to capture intricate geometric details with minimal computational overhead, achieving real-time inference exceeding 20 fps. Extensive evaluations on MiniShift and established benchmarks demonstrate that Simple3D surpasses state-of-the-art methods in both accuracy and speed, highlighting the pivotal role of high-resolution data and effective feature aggregation in advancing practical 3D anomaly detection.
[45] Dual Semantic-Aware Network for Noise Suppressed Ultrasound Video Segmentation
Ling Zhou,Runtian Yuan,Yi Liu,Yuejie Zhang,Rui Feng,Shang Gao
Main category: cs.CV
TL;DR: DSANet通过引入相邻帧语义感知模块(AFSA)和局部-全局语义感知模块(LGSA),显著提升了超声视频分割中对噪声的鲁棒性,并在多个基准数据集上表现优于现有方法。
Details
Motivation: 超声图像因其固有的噪声问题,在自动化分割任务中面临挑战,尤其是在视频序列中。本文旨在通过增强局部与全局特征的语义关联,提升模型对噪声的抵抗能力。Contribution: 提出了一种双语义感知网络(DSANet),包含AFSA和LGSA模块,分别通过相邻帧通道相似性和多级特征融合提升分割的鲁棒性。
Method: AFSA模块通过通道相似性矩阵指导特征融合;LGSA模块整合了独立捕获空间细节的局部特征和包含时序上下文的全局特征。
Result: 在四个基准数据集上,DSANet在分割精度和推理速度(FPS)上均优于现有方法,甚至超越了一些基于图像的模型。
Insight: 通过避免像素级依赖和强化语义关联,模型在噪声环境下表现更稳定,同时在计算效率上也更具优势。
Abstract: Ultrasound imaging is a prevalent diagnostic tool known for its simplicity and non-invasiveness. However, its inherent characteristics often introduce substantial noise, posing considerable challenges for automated lesion or organ segmentation in ultrasound video sequences. To address these limitations, we propose the Dual Semantic-Aware Network (DSANet), a novel framework designed to enhance noise robustness in ultrasound video segmentation by fostering mutual semantic awareness between local and global features. Specifically, we introduce an Adjacent-Frame Semantic-Aware (AFSA) module, which constructs a channel-wise similarity matrix to guide feature fusion across adjacent frames, effectively mitigating the impact of random noise without relying on pixel-level relationships. Additionally, we propose a Local-and-Global Semantic-Aware (LGSA) module that reorganizes and fuses temporal unconditional local features, which capture spatial details independently at each frame, with conditional global features that incorporate temporal context from adjacent frames. This integration facilitates multi-level semantic representation, significantly improving the model’s resilience to noise interference. Extensive evaluations on four benchmark datasets demonstrate that DSANet substantially outperforms state-of-the-art methods in segmentation accuracy. Moreover, since our model avoids pixel-level feature dependencies, it achieves significantly higher inference FPS than video-based methods, and even surpasses some image-based models. Code can be found in \href{https://github.com/ZhouL2001/DSANet}{DSANet}
[46] Bluish Veil Detection and Lesion Classification using Custom Deep Learnable Layers with Explainable Artificial Intelligence (XAI)
M. A. Rasel,Sameem Abdul Kareem,Zhenli Kwan,Shin Shen Yong,Unaizah Obaidellah
Main category: cs.CV
TL;DR: 该论文提出了一种基于自定义深度可学习层和可解释人工智能(XAI)的蓝白色覆盖物(BWV)检测与病变分类方法,通过改进的成像算法和处理多个数据集,显著提升了BWV检测的准确性。
Details
Motivation: 黑色素瘤是致命的皮肤癌症之一,而BWV是其诊断的关键特征。目前针对BWV检测的研究有限,因此需要一种更高效准确的方法来辅助早期诊断。Contribution: 1. 提出了一种非标注数据集转换方法,利用颜色阈值技术生成标注数据;2. 设计了包含自定义层的深度卷积神经网络(DCNN),在多个数据集上表现优于现有模型;3. 结合XAI算法,提供了模型决策的可解释性。
Method: 1. 使用基于颜色阈值的成像算法处理未标注的皮肤病变图像;2. 设计并训练包含自定义层的DCNN模型,分别在多个数据集上优化性能;3. 应用XAI技术解释模型的BWV检测决策。
Result: 模型在多个数据集上的测试准确率如下:PH2(85.71%)、ISIC(95.00%)、PH2+ISIC(95.05%)、Derm7pt(90.00%),均优于现有方法。
Insight: 通过自定义层和XAI的结合,不仅提升了BWV检测的准确性,还为模型的可信度和临床适用性奠定了基础。
Abstract: Melanoma, one of the deadliest types of skin cancer, accounts for thousands of fatalities globally. The bluish, blue-whitish, or blue-white veil (BWV) is a critical feature for diagnosing melanoma, yet research into detecting BWV in dermatological images is limited. This study utilizes a non-annotated skin lesion dataset, which is converted into an annotated dataset using a proposed imaging algorithm based on color threshold techniques on lesion patches and color palettes. A Deep Convolutional Neural Network (DCNN) is designed and trained separately on three individual and combined dermoscopic datasets, using custom layers instead of standard activation function layers. The model is developed to categorize skin lesions based on the presence of BWV. The proposed DCNN demonstrates superior performance compared to conventional BWV detection models across different datasets. The model achieves a testing accuracy of 85.71% on the augmented PH2 dataset, 95.00% on the augmented ISIC archive dataset, 95.05% on the combined augmented (PH2+ISIC archive) dataset, and 90.00% on the Derm7pt dataset. An explainable artificial intelligence (XAI) algorithm is subsequently applied to interpret the DCNN’s decision-making process regarding BWV detection. The proposed approach, coupled with XAI, significantly improves the detection of BWV in skin lesions, outperforming existing models and providing a robust tool for early melanoma diagnosis.
[47] Objectomaly: Objectness-Aware Refinement for OoD Segmentation with Structural Consistency and Boundary Precision
Jeonghoon Song,Sunghun Kim,Jaegyun Im,Byeongjoon Noh
Main category: cs.CV
TL;DR: Objectomaly提出了一种基于对象感知的细化框架,通过结合对象级先验知识改进了OoD分割的边界精度和结构一致性,在多个基准测试中达到了SOTA性能。
Details
Motivation: 现有基于掩码的OoD分割方法存在边界不精确、对象内异常分数不一致及背景噪声导致的误报问题。Contribution: 1. 提出了Objectomaly,包含粗粒度异常分数、对象感知分数校准和精细边界处理三个阶段;2. 结合了SAM生成的实例掩码和图像处理技术以提升性能。
Method: 1. 使用现有OoD骨干网络生成粗粒度异常分数;2. 利用SAM生成的实例掩码进行对象感知的分数校准;3. 通过拉普拉斯滤波和高斯平滑细化边界。
Result: 在SMIYC和RoadAnomaly等基准测试中,AuPRC高达96.99,FPR95降至0.07,F1分数达83.44,表现优于现有方法。
Insight: 结合对象级先验和图像处理技术能显著提升OoD分割的边界精度和结构一致性,适用于自动驾驶等安全敏感场景。
Abstract: Out-of-Distribution (OoD) segmentation is critical for safety-sensitive applications like autonomous driving. However, existing mask-based methods often suffer from boundary imprecision, inconsistent anomaly scores within objects, and false positives from background noise. We propose \textbf{\textit{Objectomaly}}, an objectness-aware refinement framework that incorporates object-level priors. Objectomaly consists of three stages: (1) Coarse Anomaly Scoring (CAS) using an existing OoD backbone, (2) Objectness-Aware Score Calibration (OASC) leveraging SAM-generated instance masks for object-level score normalization, and (3) Meticulous Boundary Precision (MBP) applying Laplacian filtering and Gaussian smoothing for contour refinement. Objectomaly achieves state-of-the-art performance on key OoD segmentation benchmarks, including SMIYC AnomalyTrack/ObstacleTrack and RoadAnomaly, improving both pixel-level (AuPRC up to 96.99, FPR$_{95}$ down to 0.07) and component-level (F1$-$score up to 83.44) metrics. Ablation studies and qualitative results on real-world driving videos further validate the robustness and generalizability of our method. Code will be released upon publication.
[48] Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking
Qiangqiang Wu,Yi Yu,Chenqi Kong,Ziquan Liu,Jia Wan,Haoliang Li,Alex C. Kot,Antoni B. Chan
Main category: cs.CV
TL;DR: 这篇论文提出了Temporal Unlearnable Examples(TUEs)方法,通过生成不可学习噪声来保护个人视频数据免于被未经授权的视觉目标跟踪模型利用。
Details
Motivation: 随着社交媒体崛起,用户上传的大量私人视频被未经授权收集并用于视觉目标跟踪模型的训练,暴露了数据隐私问题。现有方法主要针对图像任务,直接应用于视频效果不佳。Contribution: 1. 首次研究了视频数据隐私保护问题;2. 提出了TUEs框架及其高效计算方法;3. 引入时间对比损失以增强TUEs的效果。
Method: 设计了一个生成框架来生成TUEs,结合时间对比损失,使训练器依赖噪声而非原始数据结构。
Result: 实验表明,TUEs在视频数据隐私保护上达到了SOTA性能,并具有强迁移性。
Insight: TUEs通过破坏时间匹配任务的学习,有效保护了视频数据隐私,为视频数据安全提供了新思路。
Abstract: With the rise of social media, vast amounts of user-uploaded videos (e.g., YouTube) are utilized as training data for Visual Object Tracking (VOT). However, the VOT community has largely overlooked video data-privacy issues, as many private videos have been collected and used for training commercial models without authorization. To alleviate these issues, this paper presents the first investigation on preventing personal video data from unauthorized exploitation by deep trackers. Existing methods for preventing unauthorized data use primarily focus on image-based tasks (e.g., image classification), directly applying them to videos reveals several limitations, including inefficiency, limited effectiveness, and poor generalizability. To address these issues, we propose a novel generative framework for generating Temporal Unlearnable Examples (TUEs), and whose efficient computation makes it scalable for usage on large-scale video datasets. The trackers trained w/ TUEs heavily rely on unlearnable noises for temporal matching, ignoring the original data structure and thus ensuring training video data-privacy. To enhance the effectiveness of TUEs, we introduce a temporal contrastive loss, which further corrupts the learning of existing trackers when using our TUEs for training. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in video data-privacy protection, with strong transferability across VOT models, datasets, and temporal matching tasks.
[49] Driving by Hybrid Navigation: An Online HD-SD Map Association Framework and Benchmark for Autonomous Vehicles
Jiaxu Wan,Xu Wang,Mengwei Xie,Xinyuan Chang,Xinran Liu,Zheng Pan,Mu Xu,Ding Yuan
Main category: cs.CV
TL;DR: 该论文提出了一个用于自动驾驶车辆的混合导航框架和基准测试,重点关注全局SD地图与在线HD地图的关联问题。
Details
Motivation: 自动驾驶车辆需要结合全局SD地图和在线HD地图进行混合导航,但现有研究往往忽视了两者的关联问题,导致在线HD地图在现实世界中的应用受限。Contribution: 1. 提出了首个面向混合导航的在线地图关联基准OMA;2. 开发了一个基于路径感知和空间注意力机制的新框架Map Association Transformer。
Method: 提出了Map Association Transformer框架,利用路径感知注意力和空间注意力机制,理解几何和拓扑对应关系。
Result: OMA基准包含48万条道路和26万条车道路径,并提供了评估模型性能的指标。框架代码和数据集已公开。
Insight: 结合全局和局部地图的关联能力对自动驾驶车辆的导航规划至关重要,且注意力机制能有效提升关联性能。
Abstract: Autonomous vehicles rely on global standard-definition (SD) maps for road-level route planning and online local high-definition (HD) maps for lane-level navigation. However, recent work concentrates on construct online HD maps, often overlooking the association of global SD maps with online HD maps for hybrid navigation, making challenges in utilizing online HD maps in the real world. Observing the lack of the capability of autonomous vehicles in navigation, we introduce \textbf{O}nline \textbf{M}ap \textbf{A}ssociation, the first benchmark for the association of hybrid navigation-oriented online maps, which enhances the planning capabilities of autonomous vehicles. Based on existing datasets, the OMA contains 480k of roads and 260k of lane paths and provides the corresponding metrics to evaluate the performance of the model. Additionally, we propose a novel framework, named Map Association Transformer, as the baseline method, using path-aware attention and spatial attention mechanisms to enable the understanding of geometric and topological correspondences. The code and dataset can be accessed at https://github.com/WallelWan/OMA-MAT.
[50] Divergence Minimization Preference Optimization for Diffusion Model Alignment
Binxu Li,Minkai Xu,Meihua Dang,Stefano Ermon
Main category: cs.CV
TL;DR: 这篇论文提出了DMPO方法,通过最小化逆向KL散度来优化扩散模型的对齐问题,解决了现有方法陷入次优均值寻求的问题,实验表明其优于现有基线方法。
Details
Motivation: 扩散模型在文本到图像生成方面取得了显著成功,但如何进一步通过人类偏好对齐模型仍然是一个重要问题。现有方法存在次优均值寻求问题,因此需要一种更优化的对齐方法。Contribution: 提出了DMPO方法,通过最小化逆向KL散度实现扩散模型的偏好对齐,从理论上证明其优化方向与原始RL一致,并在实验中显著优于现有基线方法。
Method: DMPO通过最小化逆向KL散度来对齐扩散模型,提供了理论分析证明其有效性,并通过实验验证其性能。
Result: 实验结果显示,DMPO在所有评估数据集上的PickScore指标中至少优于现有基线方法64.6%,证明了其在生成行为与期望输出对齐方面的优越性。
Insight: DMPO为扩散模型的偏好对齐提供了一种鲁棒且优雅的路径,将理论原则与实际性能紧密结合。
Abstract: Diffusion models have achieved remarkable success in generating realistic and versatile images from text prompts. Inspired by the recent advancements of language models, there is an increasing interest in further improving the models by aligning with human preferences. However, we investigate alignment from a divergence minimization perspective and reveal that existing preference optimization methods are typically trapped in suboptimal mean-seeking optimization. In this paper, we introduce Divergence Minimization Preference Optimization (DMPO), a novel and principled method for aligning diffusion models by minimizing reverse KL divergence, which asymptotically enjoys the same optimization direction as original RL. We provide rigorous analysis to justify the effectiveness of DMPO and conduct comprehensive experiments to validate its empirical strength across both human evaluations and automatic metrics. Our extensive results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques, specifically outperforming all existing diffusion alignment baselines by at least 64.6% in PickScore across all evaluation datasets, demonstrating the method’s superiority in aligning generative behavior with desired outputs. Overall, DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.
[51] GGMotion: Group Graph Dynamics-Kinematics Networks for Human Motion Prediction
Shuaijin Wan,Huaijiang Sun
Main category: cs.CV
TL;DR: GGMotion提出了一种基于分组图网络的动力学-运动学模型,通过分组的关节交互和新的径向场设计,提升了人体运动预测的物理合理性和性能。
Details
Motivation: 现有方法通常将人体姿态表示为抽象的图结构,忽略了关节间的物理依赖关系,导致学习难度大且容易生成不真实的运动。GGMotion旨在通过分组的动力学-运动学建模解决这一问题。Contribution: 1) 提出了分组图动力学-运动学网络(GGMotion);2) 设计了新的径向场以保持几何等变性;3) 通过组内和组间交互模块捕获多尺度依赖关系;4) 引入辅助损失函数监督运动先验。
Method: 1) 将人体拓扑分组建模;2) 使用径向场和时空边聚合关节特征;3) 结合等变性MLP并行更新关节位置特征;4) 通过辅助损失优化训练。
Result: 在Human3.6M、CMU-Mocap和3DPW基准测试中表现优异,尤其在短期运动预测中显著超越现有方法。
Insight: 分组建模和物理约束的显式引入能有效提升运动预测的合理性和准确性,几何等变性设计对3D空间任务至关重要。
Abstract: Human motion is a continuous physical process in 3D space, governed by complex dynamic and kinematic constraints. Existing methods typically represent the human pose as an abstract graph structure, neglecting the intrinsic physical dependencies between joints, which increases learning difficulty and makes the model prone to generating unrealistic motions. In this paper, we propose GGMotion, a group graph dynamics-kinematics network that models human topology in groups to better leverage dynamics and kinematics priors. To preserve the geometric equivariance in 3D space, we propose a novel radial field for the graph network that captures more comprehensive spatio-temporal dependencies by aggregating joint features through spatial and temporal edges. Inter-group and intra-group interaction modules are employed to capture the dependencies of joints at different scales. Combined with equivariant multilayer perceptrons (MLP), joint position features are updated in each group through parallelized dynamics-kinematics propagation to improve physical plausibility. Meanwhile, we introduce an auxiliary loss to supervise motion priors during training. Extensive experiments on three standard benchmarks, including Human3.6M, CMU-Mocap, and 3DPW, demonstrate the effectiveness and superiority of our approach, achieving a significant performance margin in short-term motion prediction. The code is available at https://github.com/inkcat520/GGMotion.git.
[52] MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation
Bangning Wei,Joshua Maraval,Meriem Outtas,Kidiyo Kpalma,Nicolas Ramin,Lu Zhang
Main category: cs.CV
TL;DR: 该论文提出了MUVOD数据集,一个用于多视角视频中4D目标分割的新基准,填补了动态场景分割领域的数据空白,并通过提供丰富的标注数据和基准方法推动该领域的发展。
Details
Motivation: 目前基于NeRF和3D高斯泼溅的方法在静态场景的3D目标分割中表现良好,但动态场景的4D目标分割因缺乏高质量的多视角视频数据集而研究不足。MUVOD的提出旨在解决这一问题。Contribution: 1. 提出了MUVOD数据集,包含17个场景的7830张RGB图像及其分割掩码,覆盖73类459个实例。2. 提供了评估指标和基线方法。3. 提出了一个新的3D目标分割基准。
Method: 通过整合不同来源的多视角视频数据,生成高质量的4D运动标注数据集,并设计基准方法和评估指标。
Result: MUVOD数据集覆盖多样化的动态场景,为4D和3D目标分割提供了可靠的数据支持。
Insight: 动态场景的4D分割需要更多高质量的数据集和标准化评估基准,MUVOD为此提供了重要的资源。
Abstract: The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods. Our proposed MUVOD dataset is available at https://volumetric-repository.labs.b-com.com/#/muvod.
[53] MAPEX: Modality-Aware Pruning of Experts for Remote Sensing Foundation Models
Joelle Hanna,Linus Scheibenreif,Damian Borth
Main category: cs.CV
TL;DR: MAPEX 是一种基于多模态专家混合的遥感基础模型,通过模态感知的专家修剪技术,针对特定任务的高效微调和部署。
Details
Motivation: 现有遥感基础模型通常针对特定模态预训练,与实际应用中的多模态需求不匹配,且模型规模大、微调成本高。Contribution: 提出了 MAPEX 模型,通过模态条件化的专家路由机制和模态感知修剪,解决了模态不匹配和高效微调的问题。
Method: 基于多模态数据预训练,引入模态条件化的专家路由;在部署时修剪不相关模态的专家,保留任务相关的专家。
Result: 在多个遥感数据集上验证了其性能优于全监督训练和其他最先进的遥感基础模型。
Insight: 模态感知的专家修剪技术为多模态任务提供了一种高效且灵活的解决方案。
Abstract: Remote sensing data is commonly used for tasks such as flood mapping, wildfire detection, or land-use studies. For each task, scientists carefully choose appropriate modalities or leverage data from purpose-built instruments. Recent work on remote sensing foundation models pre-trains computer vision models on large amounts of remote sensing data. These large-scale models tend to focus on specific modalities, often optical RGB or multispectral data. For many important applications, this introduces a mismatch between the application modalities and the pre-training data. Moreover, the large size of foundation models makes them expensive and difficult to fine-tune on typically small datasets for each task. We address this mismatch with MAPEX, a remote sensing foundation model based on mixture-of-modality experts. MAPEX is pre-trained on multi-modal remote sensing data with a novel modality-conditioned token routing mechanism that elicits modality-specific experts. To apply the model on a specific task, we propose a modality aware pruning technique, which only retains experts specialized for the task modalities. This yields efficient modality-specific models while simplifying fine-tuning and deployment for the modalities of interest. We experimentally validate MAPEX on diverse remote sensing datasets and show strong performance compared to fully supervised training and state-of-the-art remote sensing foundation models. Code is available at https://github.com/HSG-AIML/MAPEX.
[54] Beyond the Linear Separability Ceiling
Enrico Vompa,Tanel Tammet,Mohit Vaishnav
Main category: cs.CV
TL;DR: 该论文揭示了视觉-语言模型(VLMs)在抽象推理任务中的性能瓶颈源于线性可分性问题(LSC),并提出通过任务依赖的干预方法(如激活现有路径或调整权重)来解决这一问题。
Details
Motivation: 现有的视觉-语言模型在抽象推理任务中的表现受到线性可分性(LSC)的限制,这种瓶颈并非源于感知能力不足,而是语言模型的推理路径问题。论文旨在探索这一现象并提出解决方案。Contribution: 引入了线性可分性天花板(LSC)概念,揭示了VLMs的性能瓶颈来源,并提出任务依赖的干预方法(激活路径或调整权重)来突破这一瓶颈。
Method: 使用线性分类器评估VLMs视觉嵌入的线性可分性,并通过后置调谐(postfix tuning)方法验证模型内休眠的推理路径。对于复杂关系任务,研究调整核心模型权重对性能的影响。
Result: 研究发现,通过任务依赖的干预可以显著提升模型性能,但复杂任务需要更深入的权重调整。同时,单纯改进表示学习反而可能导致新提示格式下的失败。
Insight: 模型的推理能力不在于表示学习的改进,而是依赖于任务对齐的针对性干预,为VLMs的分析和优化提供了新视角。
Abstract: Most state-of-the-art Visual-Language Models (VLMs) are seemingly limited by the linear separabilty of their visual embeddings on abstract reasoning tasks. This work investigates this “linear reasoning bottleneck” by introducing the Linear Separability Ceiling (LSC), the performance of a simple linear classifier on a VLM’s visual embeddings. We find this bottleneck is widespread and stems not from poor perception, but from failures in the language model’s reasoning pathways. We demonstrate this is a solvable alignment issue. The required intervention, however, is task-dependent: activating existing pathways suffices for semantic concepts, while complex relational reasoning requires adapting core model weights. Using postfix tuning as a methodological control, we find strong evidence for powerful, dormant reasoning pathways within VLMs. However, for complex relational tasks requiring deeper adaptation, explicitly improving representation quality causes the model to fail on new prompt formats despite its embeddings remaining well separated. Ultimately, this work provides a new lens for VLM analysis, showing that robust reasoning is a matter of targeted alignment, not simply improved representation learning.
[55] Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-Light Semantic Segmentation
Chunyan Wang,Dong Zhang,Jinhui Tang
Main category: cs.CV
TL;DR: 该论文提出了一种名为DGKD-WLSS的新型框架,通过结合扩散引导的知识蒸馏和深度引导的特征融合,解决了弱监督低光语义分割中的图像质量退化问题和语义模糊性问题,达到了最先进的性能。
Details
Motivation: 弱监督语义分割在低光环境下性能显著下降,主要由于图像质量退化(如低对比度、噪声和颜色失真)以及弱监督的固有限制。这些问题导致不可靠的类激活图和语义模糊的伪标签,从而影响模型学习判别性特征表示的能力。Contribution: 论文的主要贡献是提出了DGKD-WLSS框架,包括扩散引导的知识蒸馏(DGKD)和深度引导的特征融合(DGF2),通过扩散去噪和知识蒸馏对齐正常光和低光特征,并利用深度图作为光照不变的几何先验增强结构特征学习。
Method: DGKD-WLSS结合了扩散引导的知识蒸馏(DGKD)和深度引导的特征融合(DGF2)。DGKD通过扩散去噪和知识蒸馏对齐特征,DGF2则利用深度图增强特征学习。
Result: 实验表明,DGKD-WLSS在弱监督低光语义分割任务中达到了最先进的性能。
Insight: 论文揭示了低光环境下弱监督语义分割的核心挑战,并提出了一种结合扩散和深度信息的创新方法,显著提升了模型性能。
Abstract: Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model’s ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:https://github.com/ChunyanWang1/DGKD-WLSS.
[56] NexViTAD: Few-shot Unsupervised Cross-Domain Defect Detection via Vision Foundation Models and Multi-Task Learning
Tianwei Mu,Feiyu Duan,Bo Zhou,Dan Xue,Manhong Huang
Main category: cs.CV
TL;DR: 这篇论文提出了NexViTAD,一种基于视觉基础模型的少样本跨域异常检测框架,通过创新的共享子空间投影和多任务学习模块解决工业异常检测中的域偏移问题,实现了最先进的性能。
Details
Motivation: 工业异常检测中,跨域数据分布差异(域偏移)导致模型泛化能力不足,而传统方法需要大量标注数据。NexViTAD通过结合预训练视觉基础模型和少样本学习,解决了这一问题。Contribution: 1. 分层适配器模块融合Hiera和DINO-v2预训练模型特征;2. 共享子空间投影策略实现跨域知识迁移;3. 多任务解码器支持多源域处理;4. 基于Sinkhorn-K-means的异常评分方法。
Method: 结合分层适配器、共享子空间投影、多任务学习解码器,以及Sinkhorn-K-means聚类和自适应阈值处理的异常评分方法。
Result: 在MVTec AD数据集上,AUC为97.5%,AP为70.4%,PRO为95.2%,超越现有模型。
Insight: 通过融合预训练模型和域适应技术,可以在少样本条件下显著提升跨域异常检测性能。
Abstract: This paper presents a novel few-shot cross-domain anomaly detection framework, Nexus Vision Transformer for Anomaly Detection (NexViTAD), based on vision foundation models, which effectively addresses domain-shift challenges in industrial anomaly detection through innovative shared subspace projection mechanisms and multi-task learning (MTL) module. The main innovations include: (1) a hierarchical adapter module that adaptively fuses complementary features from Hiera and DINO-v2 pre-trained models, constructing more robust feature representations; (2) a shared subspace projection strategy that enables effective cross-domain knowledge transfer through bottleneck dimension constraints and skip connection mechanisms; (3) a MTL Decoder architecture supports simultaneous processing of multiple source domains, significantly enhancing model generalization capabilities; (4) an anomaly score inference method based on Sinkhorn-K-means clustering, combined with Gaussian filtering and adaptive threshold processing for precise pixel level. Valuated on the MVTec AD dataset, NexViTAD delivers state-of-the-art performance with an AUC of 97.5%, AP of 70.4%, and PRO of 95.2% in the target domains, surpassing other recent models, marking a transformative advance in cross-domain defect detection.
[57] HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking
Ruixiang Chen,Guolei Sun,Yawei Li,Jie Qin,Luca Benini
Main category: cs.CV
TL;DR: 论文提出了HiM2SAM方法,通过引入分层运动估计和内存优化,显著提高了SAM2框架在长期视频对象跟踪任务中的性能,尤其是在遮挡和目标重现等问题上表现出色。
Details
Motivation: 视频对象跟踪中,复杂的场景(如遮挡、背景干扰和目标重现)对跟踪算法提出了挑战。SAM2框架在这些问题上存在局限性,因此需要一种无需额外训练的低开销改进方法。Contribution: 1. 提出分层运动估计策略,结合轻量级线性预测和选择性非线性优化;2. 优化内存管理,区分长期和短期帧以提高跟踪可靠性。
Method: 通过分层运动估计(轻量级线性预测+选择性非线性优化)和内存优化(区分长期/短期记忆帧)提升SAM2的跟踪能力。
Result: 在LaSOT和LaSOText数据集上,大模型的AUC相对提高了9.6%和7.2%,小模型的增益更显著。
Insight: 无需训练的改进方法也能显著提升跟踪性能,尤其是通过运动估计和内存管理的优化,可以应对长期遮挡和外观变化。
Abstract: This paper presents enhancements to the SAM2 framework for video object tracking task, addressing challenges such as occlusions, background clutter, and target reappearance. We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training. In addition, we optimize the memory bank by distinguishing long-term and short-term memory frames, enabling more reliable tracking under long-term occlusions and appearance changes. Experimental results show consistent improvements across different model scales. Our method achieves state-of-the-art performance on LaSOT and LaSOText with the large model, achieving 9.6% and 7.2% relative improvements in AUC over the original SAM2, and demonstrates even larger relative gains on smaller models, highlighting the effectiveness of our trainless, low-overhead improvements for boosting long-term tracking performance. The code is available at https://github.com/LouisFinner/HiM2SAM.
[58] LOSC: LiDAR Open-voc Segmentation Consolidator
Nermin Samet,Gilles Puy,Renaud Marlet
Main category: cs.CV
TL;DR: LOSC提出了一种基于Vision-Language Models(VLMs)的方法,通过优化稀疏和噪声的3D点标签,实现了在驾驶场景中开放词汇分割的先进性能。
Details
Motivation: 传统方法通过图像语义反向投影到3D点云时,生成的标签往往具有噪声且稀疏,无法满足实际需求,因此需要一种方法来优化这些标签。Contribution: 1. 提出了一种标签优化方法,确保时空一致性和对图像增强的鲁棒性;2. 使用优化后的标签训练3D网络,实现了在开放词汇分割任务上的显著性能提升。
Method: 1. 通过Vision-Language Models(VLMs)生成初始标签;2. 对标签进行优化,增强时空一致性和鲁棒性;3. 训练基于优化标签的3D网络。
Result: 在nuScenes和SemanticKITTI数据集上,LOSC在零样本开放词汇语义分割和全景分割任务中显著超越了现有方法。
Insight: 通过优化稀疏和噪声标签,可以显著提升3D开放词汇分割的性能,表明标签质量对训练效果的重要性。
Abstract: We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.
[59] SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs
Siting Wang,Luoyang Sun,Cheng Deng,Kun Shao,Minnan Pei,Zheng Tian,Haifeng Zhang,Jun Wang
Main category: cs.CV
TL;DR: 作者提出SpatialViz-Bench,一个自动生成的空间可视化推理任务基准,用于评估多模态大语言模型在空间可视化能力上的表现,发现现有模型存在显著不足。
Details
Motivation: 现有对多模态大语言模型(MLLMs)的评估通常依赖于IQ测试或数学竞赛,这些任务可能与其训练数据重叠,且未专门评估空间可视化能力。因此,需要一种更专注于空间可视化的评测基准。Contribution: 提出了SpatialViz-Bench,一个包含12种任务、1,180个自动生成问题的多模态基准,用于评估MLLMs在空间可视化上的能力。
Method: 通过自动生成任务,覆盖4种子能力,构建全面的评估框架,并使用33种先进MLLMs进行评测。
Result: 评测揭示了模型表现差异大,部分行为与人类直觉不符,如2D到3D的性能骤降,以及过度依赖公式推导而非空间想象。
Insight: 现有MLLMs在空间可视化任务中仍存在显著不足,亟需进一步研究。SpatialViz-Bench填补了这一评测空白。
Abstract: Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark’s strong discriminative power, but also uncovers counter-intuitive findings: models exhibit unexpected behaviors by showing difficulty perception that misaligns with human intuition, displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula derivation despite spatial tasks requiring visualization alone. SpatialVizBench empirically demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark is publicly available.
[60] ViLU: Learning Vision-Language Uncertainties for Failure Prediction
Marc Lafon,Yannis Karmim,Julio Silva-Rodriguez,Paul Couairon,Clément Rambour,Raphaël Fournier-Sniehotta,Ismail Ben Ayed,Jose Dolz,Nicolas Thome
Main category: cs.CV
TL;DR: ViLU提出了一种新的视觉-语言不确定性量化框架,通过整合视觉嵌入、预测文本嵌入和图像条件的文本表示,构建了一个不确定性感知的多模态表示,用于失败预测。
Details
Motivation: 视觉-语言模型(VLMs)的可靠不确定性量化(UQ)和失败预测仍是一个开放性问题,ViLU旨在通过多模态表示解决这一问题。Contribution: 1)提出了ViLU框架,利用所有任务相关文本表示构建不确定性感知的多模态表示;2)将不确定性预测视为二分类问题,采用加权二元交叉熵损失进行训练,使其与损失无关;3)适用于后处理场景,无需直接访问模型本身。
Method: ViLU通过交叉注意力整合视觉嵌入、预测文本嵌入和图像条件的文本表示,构建多模态表示。训练一个二分类器,区分正确与错误预测。
Result: 在多个数据集(如ImageNet-1k、CC12M、LAION-400M)上的实验表明,ViLU在失败预测方面显著优于现有方法。消融实验验证了其架构和训练策略的关键作用。
Insight: ViLU的创新在于将不确定性预测与多模态表示结合,尤其适合后处理场景,为视觉-语言模型的可靠性提供了新思路。
Abstract: Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.
[61] T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates
Zhitao Wang,Hengyu Man,Wenrui Li,Xingtao Wang,Xiaopeng Fan,Debin Zhao
Main category: cs.CV
TL;DR: T-GVC提出了一种基于轨迹引导的生成视频编码框架,通过语义感知的稀疏运动采样和轨迹对齐的损失约束,在超低码率下实现高质量的视频重建。
Details
Motivation: 传统生成视频编码方法在超低码率(ULB)下依赖于领域特定性或高级文本引导,导致运动细节丢失和不真实的重建效果。Contribution: 1) 提出语义感知的稀疏运动采样流水线;2) 引入无训练的潜在空间引导机制,确保物理合理的运动模式;3) 在ULB条件下优于传统编解码器和端到端压缩方法。
Method: 1) 基于语义重要性提取像素级运动作为稀疏轨迹点;2) 在扩散过程中加入轨迹对齐的损失约束。
Result: 实验表明,T-GVC在ULB条件下优于现有方法,并实现比文本引导方法更精确的运动控制。
Insight: 几何运动建模为生成视频编码提供了新方向,弥合了低级运动跟踪与高级语义理解的差距。
Abstract: Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding, aiming to achieve semantically accurate reconstructions in Ultra-Low Bitrate (ULB) scenarios by leveraging strong generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or an excessive dependence on high-level text guidance, which often fails to capture motion details and results in unrealistic reconstructions. To address these challenges, we propose a Trajectory-Guided Generative Video Coding framework (dubbed T-GVC). T-GVC employs a semantic-aware sparse motion sampling pipeline to effectively bridge low-level motion tracking with high-level semantic understanding by extracting pixel-wise motion as sparse trajectory points based on their semantic importance, not only significantly reducing the bitrate but also preserving critical temporal semantic information. In addition, by incorporating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free latent space guidance mechanism to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that our framework outperforms both traditional codecs and state-of-the-art end-to-end video compression methods under ULB conditions. Furthermore, additional experiments confirm that our approach achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.
[62] Bridging the gap in FER: addressing age bias in deep learning
F. Xavier Gaya-Morey,Julia Sanchez-Perez,Cristina Manresa-Yee,Jose M. Buades-Rubio
Main category: cs.CV
TL;DR: 这篇论文研究了深度学习在面部表情识别(FER)中的年龄偏见问题,尤其是对老年人群的影响,并提出三种缓解策略,显著提高了老年群体的识别准确率。
Details
Motivation: 现有深度学习FER模型存在年龄偏见,尤其是对老年人,影响了公平性和可靠性,因此需要研究并解决这一问题。Contribution: 1. 分析了FER模型在年龄组间的性能差异;2. 提出了三种缓解年龄偏见的策略;3. 证明了近似年龄标签对提升公平性的价值。
Method: 1. 使用XAI技术识别偏见;2. 提出多任务学习、多模态输入和年龄加权损失三种策略;3. 在AffectNet数据集上训练并验证。
Result: 老年人群体的识别准确率显著提升,尤其是错误率高的表情;模型的注意力机制对年龄适应性更强。
Insight: 年龄相关的偏见可以通过简单的训练调整缓解,近似人口统计标签对提升大尺度情感计算系统的公平性也有帮助。
Abstract: Facial Expression Recognition (FER) systems based on deep learning have achieved impressive performance in recent years. However, these models often exhibit demographic biases, particularly with respect to age, which can compromise their fairness and reliability. In this work, we present a comprehensive study of age-related bias in deep FER models, with a particular focus on the elderly population. We first investigate whether recognition performance varies across age groups, which expressions are most affected, and whether model attention differs depending on age. Using Explainable AI (XAI) techniques, we identify systematic disparities in expression recognition and attention patterns, especially for “neutral”, “sadness”, and “anger” in elderly individuals. Based on these findings, we propose and evaluate three bias mitigation strategies: Multi-task Learning, Multi-modal Input, and Age-weighted Loss. Our models are trained on a large-scale dataset, AffectNet, with automatically estimated age labels and validated on balanced benchmark datasets that include underrepresented age groups. Results show consistent improvements in recognition accuracy for elderly individuals, particularly for the most error-prone expressions. Saliency heatmap analysis reveals that models trained with age-aware strategies attend to more relevant facial regions for each age group, helping to explain the observed improvements. These findings suggest that age-related bias in FER can be effectively mitigated using simple training modifications, and that even approximate demographic labels can be valuable for promoting fairness in large-scale affective computing systems.
[63] MolCLIP: A Molecular-Auxiliary CLIP Framework for Identifying Drug Mechanism of Action Based on Time-Lapsed Mitochondrial Images
Fengqian Pang,Chunyue Lei,Hongfei Zhao,Chenghao Liu,Zhiqiang Xing,Huafeng Wang,Chuyang Ye
Main category: cs.CV
TL;DR: MolCLIP是一种结合细胞视频和分子模态的视觉语言模型,用于识别药物作用机制,首次利用分子辅助CLIP框架优化视频特征学习,并在实验中显著提升了性能。
Details
Motivation: 现有方法主要通过空间特征的细胞图像识别药物作用机制(MoA),但忽略了时间动态性和分子模态的互补性。MolCLIP旨在结合时间序列细胞视频和分子信息以更全面理解MoA。Contribution: 1. 首次提出结合细胞视频和分子模态的视觉语言模型MolCLIP。2. 设计分子辅助的CLIP框架,优化视频特征学习。3. 结合度量学习策略提升特征聚合效果。
Method: 1. 使用CLIP框架,通过分子模态指导视频特征学习。2. 引入度量学习策略优化视频特征聚合。
Result: 在MitoDataset上,MolCLIP在药物识别和MoA识别任务中的mAP分别提升了51.2%和20.5%。
Insight: 时间序列细胞视频和分子模态的结合能更有效捕捉药物作用机制的动态变化,分子辅助学习可增强模型对MoA的理解能力。
Abstract: Drug Mechanism of Action (MoA) mainly investigates how drug molecules interact with cells, which is crucial for drug discovery and clinical application. Recently, deep learning models have been used to recognize MoA by relying on high-content and fluorescence images of cells exposed to various drugs. However, these methods focus on spatial characteristics while overlooking the temporal dynamics of live cells. Time-lapse imaging is more suitable for observing the cell response to drugs. Additionally, drug molecules can trigger cellular dynamic variations related to specific MoA. This indicates that the drug molecule modality may complement the image counterpart. This paper proposes MolCLIP, the first visual language model to combine microscopic cell video- and molecule-modalities. MolCLIP designs a molecule-auxiliary CLIP framework to guide video features in learning the distribution of the molecular latent space. Furthermore, we integrate a metric learning strategy with MolCLIP to optimize the aggregation of video features. Experimental results on the MitoDataset demonstrate that MolCLIP achieves improvements of 51.2% and 20.5% in mAP for drug identification and MoA recognition, respectively.
[64] Attend-and-Refine: Interactive keypoint estimation and quantitative cervical vertebrae analysis for bone age assessment
Jinhee Kim,Taesung Kim,Taewoo Kim,Dong-Wook Kim,Byungduk Ahn,Yoon-Ji Kim,In-Seok Song,Jaegul Choo
Main category: cs.CV
TL;DR: 该论文提出了一种名为Attend-and-Refine Network (ARNet)的交互式深度学习模型,用于简化颈椎关键点标注过程,从而实现儿童正畸中生长潜力的准确评估。
Details
Motivation: 在儿科正畸中,通过侧位头影测量X光片准确评估生长潜力对制定有效治疗策略至关重要,但传统标注关键点的方法耗时费力。Contribution: 提出了ARNet,结合交互引导的重新校准网络和形态感知损失函数,显著减少了关键点标注的手动工作量,提高了效率和准确性。
Method: ARNet包括交互引导的重新校准网络(根据用户反馈自适应调整图像特征)和形态感知损失函数(保持关键点的结构一致性)。
Result: 在多个数据集上的广泛验证显示ARNet性能优异,适用范围广。
Insight: 该研究为儿科正畸提供了一种高效的AI辅助诊断工具,显著推动了该领域的进步。
Abstract: In pediatric orthodontics, accurate estimation of growth potential is essential for developing effective treatment strategies. Our research aims to predict this potential by identifying the growth peak and analyzing cervical vertebra morphology solely through lateral cephalometric radiographs. We accomplish this by comprehensively analyzing cervical vertebral maturation (CVM) features from these radiographs. This methodology provides clinicians with a reliable and efficient tool to determine the optimal timings for orthodontic interventions, ultimately enhancing patient outcomes. A crucial aspect of this approach is the meticulous annotation of keypoints on the cervical vertebrae, a task often challenged by its labor-intensive nature. To mitigate this, we introduce Attend-and-Refine Network (ARNet), a user-interactive, deep learning-based model designed to streamline the annotation process. ARNet features Interaction-guided recalibration network, which adaptively recalibrates image features in response to user feedback, coupled with a morphology-aware loss function that preserves the structural consistency of keypoints. This novel approach substantially reduces manual effort in keypoint identification, thereby enhancing the efficiency and accuracy of the process. Extensively validated across various datasets, ARNet demonstrates remarkable performance and exhibits wide-ranging applicability in medical imaging. In conclusion, our research offers an effective AI-assisted diagnostic tool for assessing growth potential in pediatric orthodontics, marking a significant advancement in the field.
[65] Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
Shin’ya Yamaguchi,Kosuke Nishida,Daiki Chijiwa
Main category: cs.CV
TL;DR: 本文提出了一种名为RED的解码策略,用于解决多模态思维链(CoT)中模型忽视生成推理依据的问题,通过视觉和推理依据的联合优化显著提升了推理能力。
Details
Motivation: 现有的大型视觉语言模型(LVLM)在多模态CoT推理中常忽视生成的推理依据,影响了模型的忠实性和准确性。本文旨在解决这一问题。Contribution: 提出了RED(Rationale-Enhanced Decoding)解码策略,通过KL约束的奖励最大化优化多模态CoT推理,显著提升了模型的推理能力。
Method: 将多模态CoT重新建模为KL约束的奖励最大化问题,通过联合优化视觉和推理依据的条件分布,设计了一种即插即用的RED解码策略。
Result: 实验表明,RED在多个基准和LVLM模型上显著优于标准CoT和其他解码方法,提升了推理的忠实性和准确性。
Insight: RED策略表明,联合优化视觉和推理依据可以更有效地利用生成的多模态思维链,为构建更可靠的推理系统提供了新思路。
Abstract: Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems.
[66] Tree-Mamba: A Tree-Aware Mamba for Underwater Monocular Depth Estimation
Peixian Zhuang,Yijian Wang,Zhenqi Fu,Hongliang Zhang,Sam Kwong,Chongyi Li
Main category: cs.CV
TL;DR: 论文提出了一种名为Tree-Mamba的新方法,用于解决水下单目深度估计(UMDE)任务中的挑战,通过树状感知扫描策略和可靠的基准数据集BlueDepth,显著提升了性能。
Details
Motivation: 现有Mamba方法在水下深度估计中表现不佳,因扫描策略僵化且缺乏可靠数据集,无法有效建模水下图像结构特征。Contribution: 1. 提出树状感知扫描策略,自适应构建最小生成树并灵活聚合特征;2. 构建高可靠性水下深度估计数据集BlueDepth。
Method: 通过特征相似性构建最小生成树,采用自底向上和自上而下遍历方式聚合空间拓扑特征,增强多尺度表征能力。
Result: 实验表明,Tree-Mamba在定量和定性评估中优于现有方法,同时保持计算高效性。
Insight: 树状结构能够有效捕捉水下图像的拓扑特征,而高质量数据集对提升深度估计精度至关重要。
Abstract: Underwater Monocular Depth Estimation (UMDE) is a critical task that aims to estimate high-precision depth maps from underwater degraded images caused by light absorption and scattering effects in marine environments. Recently, Mamba-based methods have achieved promising performance across various vision tasks; however, they struggle with the UMDE task because their inflexible state scanning strategies fail to model the structural features of underwater images effectively. Meanwhile, existing UMDE datasets usually contain unreliable depth labels, leading to incorrect object-depth relationships between underwater images and their corresponding depth maps. To overcome these limitations, we develop a novel tree-aware Mamba method, dubbed Tree-Mamba, for estimating accurate monocular depth maps from underwater degraded images. Specifically, we propose a tree-aware scanning strategy that adaptively constructs a minimum spanning tree based on feature similarity. The spatial topological features among the tree nodes are then flexibly aggregated through bottom-up and top-down traversals, enabling stronger multi-scale feature representation capabilities. Moreover, we construct an underwater depth estimation benchmark (called BlueDepth), which consists of 38,162 underwater image pairs with reliable depth labels. This benchmark serves as a foundational dataset for training existing deep learning-based UMDE methods to learn accurate object-depth relationships. Extensive experiments demonstrate the superiority of the proposed Tree-Mamba over several leading methods in both qualitative results and quantitative evaluations with competitive computational efficiency. Code and dataset will be available at https://wyjgr.github.io/Tree-Mamba.html.
[67] Motion-Aware Adaptive Pixel Pruning for Efficient Local Motion Deblurring
Wei Shang,Dongwei Ren,Wanying Zhang,Pengfei Zhu,Qinghua Hu,Wangmeng Zuo
Main category: cs.CV
TL;DR: 本文提出了一种用于局部运动去模糊的高效方法Motion-Aware Adaptive Pixel Pruning (M2AENet),通过可训练的掩码预测器和结构重参数化技术实现了计算资源的优化分配,并设计了帧内运动分析器以自适应地恢复模糊区域。
Details
Motivation: 现有的去模糊方法在计算资源分配和空间变化模糊模式处理上效率不足,无法有效解决局部运动模糊问题。Contribution: 1. 提出了可训练的掩码预测器以识别模糊区域;
2. 采用结构重参数化技术优化推理计算;
3. 设计了帧内运动分析器,通过运动轨迹自适应恢复模糊区域。
Method: 1. 结合重建损失、再模糊损失和掩码损失的端到端训练;
2. 通过结构重参数化将3×3卷积转换为1×1卷积以减少计算;
3. 帧内运动分析器基于像素位移生成运动轨迹。
Result: 在局部和全局模糊数据集上均优于现有方法,同时计算量(FLOPs)减少了49%。
Insight: 通过动态识别和优化模糊区域的计算分配,可以在保持性能的同时显著提升效率,为实时去模糊任务提供了新思路。
Abstract: Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting $3\times 3$ convolutions to computationally efficient $1\times 1$ convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49% compared to SOTA models (e.g., LMD-ViT). The source code is available at https://github.com/shangwei5/M2AENet.
[68] One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models
Jiale Zhao,Xinyang Jiang,Junyao Gao,Yuhao Xue,Cairong Zhao
Main category: cs.CV
TL;DR: 该论文提出了CrossVLAD基准数据集和CRAFT攻击框架,用于评估统一视觉-语言模型(VLM)在跨任务对抗攻击中的鲁棒性。
Details
Motivation: 统一视觉-语言模型(VLM)的多任务灵活性带来了独特的安全挑战,需要对抗输入在多种任务指令下保持有效性。Contribution: 1. 引入了CrossVLAD基准数据集,用于系统评估跨任务对抗攻击;2. 提出了CRAFT攻击方法,实现了高效的跨区域攻击。
Method: 提出CRAFT(基于区域的跨任务攻击框架),通过区域中心和令牌对齐的攻击策略,提升跨任务对抗攻击的成功率。
Result: 实验表明,CRAFT在Florence-2等统一VLM上优于现有方法,显著提升了跨任务攻击的成功率。
Insight: 跨任务对抗攻击对统一VLM的安全性和鲁棒性提出了新的挑战,需进一步研究防御机制。
Abstract: Unified vision-language models(VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the object-change objective-consistently manipulating a target object’s classification across four downstream tasks-and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient region-centric attack method. Extensive experiments on Florence-2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks.
[69] Scaling RL to Long Videos
Yukang Chen,Wei Huang,Baifeng Shi,Qinghao Hu,Hanrong Ye,Ligeng Zhu,Zhijian Liu,Pavlo Molchanov,Jan Kautz,Xiaojuan Qi,Sifei Liu,Hongxu Yin,Yao Lu,Song Han
Main category: cs.CV
TL;DR: 论文提出了一套完整的框架,通过强化学习将视觉语言模型(VLMs)扩展到长视频推理任务,解决了长视频推理的独特挑战。
Details
Motivation: 长视频推理任务(如问答)对现有视觉语言模型提出了挑战,需要高效处理长序列数据并保留上下文信息。Contribution: 1) 提出大规模数据集LongVideo-Reason;2) 设计两阶段训练流程(CoT-SFT和RL);3) 开发高效训练基础设施MR-SP。
Method: 结合链式思维监督微调(CoT-SFT)和强化学习(RL),利用MR-SP优化长视频训练效率。
Result: 模型LongVILA-R1-7B在长视频问答基准测试中表现优异,超越Video-R1-7B,与Gemini-1.5-Pro相当。MR-SP实现了2.1倍加速。
Insight: 长视频推理的扩展需要高效的数据处理和训练框架,MR-SP的并行化设计为多模态RL训练提供了通用解决方案。
Abstract: We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).
[70] Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Haochen Wang,Xiangtai Li,Zilong Huang,Anran Wang,Jiacong Wang,Tao Zhang,Jiani Zheng,Sule Bai,Zijian Kang,Jiashi Feng,Zhuochen Wang,Zhaoxiang Zhang
Main category: cs.CV
TL;DR: 该论文提出了TreeBench,一个用于全面评估视觉基础推理能力的诊断基准,并提出了TreeVGR方法,通过强化学习联合监督定位和推理,显著提高了模型性能。
Details
Motivation: 现有的视觉基础推理模型缺乏综合评估基准,无法全面衡量模型在复杂场景中的感知和推理能力。Contribution: 1. 提出TreeBench基准,基于三个原则(聚焦视觉感知、可追踪证据、二阶推理)评估模型能力;2. 提出TreeVGR方法,通过强化学习联合优化定位和推理。
Method: TreeVGR采用强化学习方法联合监督定位和推理,初始模型为Qwen2.5-VL-7B,训练目标包括精确定位和可解释的推理路径。
Result: TreeBench的挑战性问题中,最先进模型准确率不足60%(如OpenAI-o3为54.87%)。TreeVGR在多个基准上显著提升性能(V* Bench +16.8,MME-RealWorld +12.6,TreeBench +13.4)。
Insight: 可追踪性是提升视觉基础推理的关键,联合训练定位和推理能够显著提高模型的解释性和性能。
Abstract: Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human “thinking with images”. However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.
[71] Energy-Guided Decoding for Object Hallucination Mitigation
Xixi Liu,Ailin Deng,Christopher Zach
Main category: cs.CV
TL;DR: 该论文提出了一种基于能量的解码方法,用于缓解视觉语言模型中的物体幻觉问题,方法简单且有效,显著减少了‘是’比例的不平衡并提升了性能。
Details
Motivation: 现有的缓解物体幻觉的方法要么局限于特定解码方式,要么需要对视觉输入进行复杂修改,或依赖外部模型知识,因此需要一种更通用且简单的解决方案。Contribution: 揭示了视觉语言模型中‘是’比例不平衡的现象,提出了一种动态选择能量最低的隐藏状态的能量解码方法。
Method: 基于能量的解码方法,动态选择具有最小能量分数的隐藏状态层。
Result: 在三个基准数据集(POPE、MME和MMVP)上,该方法显著提升了准确率和F1分数,平均准确率提升4.82%,‘是’比例差距减少了8.81%。
Insight: 通过动态选择低能量隐藏状态,可以有效减少模型的偏见,提升性能,且无需复杂修改或外部依赖。
Abstract: Mitigating object hallucination in large vision-language models (LVLMs) is critical to their safe deployment. Existing methods either are restricted to specific decoding methods, or demand sophisticated modifications to visual inputs, or rely on knowledge from external models. In this work, we first reveal the phenomenon that VLMs exhibit significant imbalance in the Yes'' ratio ( \ie, the fraction of Yes’’ answers among the total number of questions) across three different visual question answering (VQA) datasets. Furthermore, we propose an energy-based decoding method, which dynamically selects the hidden states from the layer with minimal energy score. It is simple yet effective in reducing the bias for the yes ratio while boosting performance across three benchmarks (POPE, MME, and MMVP). Our method consistently improves accuracy and F1 score on three VQA datasets across three commonly used VLMs over several baseline methods. The average accuracy improvement is 4.82% compared to greedy decoding. Moreover, the average yes-ratio gap reduction is 8.81%, meaning the proposed method is less biased as shown in Figure 1.
[72] EEvAct: Early Event-Based Action Recognition with High-Rate Two-Stream Spiking Neural Networks
Michael Neumeier,Jules Lecomte,Nils Kazinski,Soubarna Banik,Bing Li,Axel von Arnim
Main category: cs.CV
TL;DR: 该论文提出了一种基于事件的早期动作识别方法EEvAct,通过高频率的双流脉冲神经网络(SNN)实现高精度和低延迟,比之前的方法在THU EACT-50数据集上提高了2%的准确率。
Details
Motivation: 早期动作识别对安全性和实时性要求高,事件相机的高时间分辨率和低延迟适合这一需求。但现有方法通常将事件累积到低频帧或时空体素中,限制了早期预测能力,而SNN方法虽能高频率处理事件,但准确率有待提升。Contribution: 提出了一种高频率双流SNN架构,显著提升了事件相机的早期动作识别准确率;设计了一个新的早期识别框架,并在THU EACT-50数据集上验证其性能;展示了该方法在体育动作捕捉中的实际应用。
Method: 采用双流SNN架构处理事件数据,其中一条流提取时间信息,另一条提取空间信息。通过高频率处理事件流,实现低延迟和高精度。
Result: 在THU EACT-50数据集上,该方法比之前工作提高了2%的准确率,并在早期观察时间下展示了更高的识别性能。
Insight: 结合时间和空间信息的双流SNN架构能有效提升事件数据的早期动作识别性能,为实时应用提供了新思路。
Abstract: Recognizing human activities early is crucial for the safety and responsiveness of human-robot and human-machine interfaces. Due to their high temporal resolution and low latency, event-based vision sensors are a perfect match for this early recognition demand. However, most existing processing approaches accumulate events to low-rate frames or space-time voxels which limits the early prediction capabilities. In contrast, spiking neural networks (SNNs) can process the events at a high-rate for early predictions, but most works still fall short on final accuracy. In this work, we introduce a high-rate two-stream SNN which closes this gap by outperforming previous work by 2% in final accuracy on the large-scale THU EACT-50 dataset. We benchmark the SNNs within a novel early event-based recognition framework by reporting Top-1 and Top-5 recognition scores for growing observation time. Finally, we exemplify the impact of these methods on a real-world task of early action triggering for human motion capture in sports.
[73] Sparse-Dense Side-Tuner for efficient Video Temporal Grounding
David Pujol-Perich,Sergio Escalera,Albert Clapés
Main category: cs.CV
TL;DR: 论文提出了一种稀疏-密集侧调谐器(SDST),用于高效视频时序定位(VTG),通过结合稀疏和密集特征调谐,显著提升性能并减少参数量。
Details
Motivation: 现有方法主要依赖预训练模型的最后一层特征,缺乏对新领域的适应性,且全微调不现实。侧调谐(ST)虽是一种替代方案,但忽视了时序定位中的稀疏性。Contribution: 1. 提出首个无锚框的ST架构SDST;2. 引入基于参考的可变形自注意力机制;3. 首次将InternVideo2骨干网络集成到ST框架中。
Method: 采用稀疏-密集特征调谐策略,结合参考可变形自注意力机制优化上下文建模,显著减少参数量的同时提升性能。
Result: 在QVHighlights、TACoS和Charades-STA数据集上取得竞争性或SOTA结果,参数量减少高达73%。
Insight: 稀疏特征在时序定位中至关重要,结合密集调谐可显著提升性能。可变形注意力机制的改进进一步增强了模型对上下文的理解能力。
Abstract: Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning – and particularly side-tuning (ST) – has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention – a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code is publicly accessible at https://github.com/davidpujol/SDST.
[74] Deep Learning based 3D Volume Correlation for Additive Manufacturing Using High-Resolution Industrial X-ray Computed Tomography
Keerthana Chand,Tobias Fritsch,Bardia Hejazi,Konstantin Poka,Giovanni Bruno
Main category: cs.CV
TL;DR: 该论文提出了一种基于深度学习的3D体积配准方法,用于增材制造中的质量检测,通过动态分块处理高分辨率XCT数据,显著提升了配准精度和效率。
Details
Motivation: 增材制造中的几何变形会导致组件性能下降,传统数字体积相关性(DVC)方法在配准时缺乏地面真实变形场,且高分辨率XCT数据计算复杂。Contribution: 1. 提出了基于深度学习的体素级变形估计方法;2. 引入动态分块策略处理大数据;3. 使用Binary Difference Map(BDM)评估配准精度。
Method: 通过深度学习模型估计CAD和XCT体积之间的体素级变形,结合动态分块处理高分辨率数据,并利用BDM和Dice Score评估配准效果。
Result: 与传统DVC方法相比,Dice Score提升9.2%,体素匹配率提升9.9%,配准时间从几天缩短至几分钟。
Insight: 深度学习可为增材制造提供高效可靠的配准方法,闭环补偿网格有望提升制造过程的可靠性和效率。
Abstract: Quality control in additive manufacturing (AM) is vital for industrial applications in areas such as the automotive, medical and aerospace sectors. Geometric inaccuracies caused by shrinkage and deformations can compromise the life and performance of additively manufactured components. Such deviations can be quantified using Digital Volume Correlation (DVC), which compares the computer-aided design (CAD) model with the X-ray Computed Tomography (XCT) geometry of the components produced. However, accurate registration between the two modalities is challenging due to the absence of a ground truth or reference deformation field. In addition, the extremely large data size of high-resolution XCT volumes makes computation difficult. In this work, we present a deep learning-based approach for estimating voxel-wise deformations between CAD and XCT volumes. Our method uses a dynamic patch-based processing strategy to handle high-resolution volumes. In addition to the Dice Score, we introduce a Binary Difference Map (BDM) that quantifies voxel-wise mismatches between binarized CAD and XCT volumes to evaluate the accuracy of the registration. Our approach shows a 9.2% improvement in the Dice Score and a 9.9% improvement in the voxel match rate compared to classic DVC methods, while reducing the interaction time from days to minutes. This work sets the foundation for deep learning-based DVC methods to generate compensation meshes that can then be used in closed-loop correlations during the AM production process. Such a system would be of great interest to industries since the manufacturing process will become more reliable and efficient, saving time and material.
[75] SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples
Dren Fazlija,Monty-Maximilian Zühlke,Johanna Schrader,Arkadij Orlov,Clara Stein,Iyiola E. Olatunji,Daniel Kudenko
Main category: cs.CV
TL;DR: SCOOTER是一个开源的、基于统计的框架,用于评估无限制对抗样本的真实性。通过大规模人类评估和与模型的对比,发现现有颜色空间和扩散攻击无法生成不可察觉的图像,并提供了实践指南、工具和基准数据集。
Details
Motivation: 无限制对抗攻击无需受传统$ℓ_p$-范数约束,可能导致人类难以察觉的对抗样本。目前缺乏统一且具有统计意义的评估框架,亟需标准化方法验证这些攻击的真实性。Contribution: 1. 提供评估无限制对抗样本不可察觉性的最佳实践指南;2. 首次大规模人类与模型对比研究,揭示6种攻击无法生成不可察觉的图像;3. 开源工具和基准数据集;4. 发现GPT-4o可作为初步测试工具,但效果有限。
Method: 开发了SCOOTER框架,结合众包研究(346名参与者)和统计分析,评估对抗样本的人类不可察觉性,并与模型表现对比。同时利用GPT-4o进行初步测试验证。
Result: 研究发现现有攻击无法生成人类难以察觉的图像,且人类与机器视觉系统在感知上存在差异。GPT-4o仅在部分攻击中表现一致。
Insight: 1. 无限制对抗样本的真实性亟需人类评估;2. 现有评估方法需要统计支持;3. 人类与机器感知不一致,需以人类为基准。
Abstract: Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object’s color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.
[76] SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes
Jiaxin Huang,Ziwen Li,Hanlve Zhang,Runnan Chen,Xiao He,Yandong Guo,Wenping Wang,Tongliang Liu,Mingming Gong
Main category: cs.CV
TL;DR: SURPRISE3D是一个新颖的数据集,专注于评估复杂3D场景中的语言引导空间推理分割任务,旨在填补当前3D视觉-语言研究中空间推理能力的不足。
Details
Motivation: 现有数据集中语义线索(如物体名称)与空间上下文混合,导致模型依赖表面捷径而非真正理解空间关系。SURPRISE3D通过去物体名称的查询设计,解决这一偏差。Contribution: 提出了SURPRISE3D数据集,包含200k+视觉-语言对,覆盖900+复杂室内场景,专注于空间推理任务,并提供3D-SRS基准套件。
Method: 数据集基于ScanNet++ v2,设计了89k+人类标注的空间查询,避免物体名称引入的偏差,覆盖相对位置、叙述视角等多样空间推理技能。
Result: 初步测试显示当前SOTA的3D视觉定位方法和3D-LLMs在空间推理任务中表现不佳,凸显了数据集的必要性。
Insight: 该数据集和基准套件推动了空间感知AI的发展,为具身交互和机器人规划提供了重要工具。
Abstract: The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between objects, remains underexplored in current 3D vision-language research. Existing datasets often mix semantic cues (e.g., object name) with spatial context, leading models to rely on superficial shortcuts rather than genuinely interpreting spatial relationships. To address this gap, we introduce S\textsc{urprise}3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes. S\textsc{urprise}3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2, including more than 2.8k unique object classes. The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name, thereby mitigating shortcut biases in spatial understanding. These queries comprehensively cover various spatial reasoning skills, such as relative position, narrative perspective, parametric perspective, and absolute distance reasoning. Initial benchmarks demonstrate significant challenges for current state-of-the-art expert 3D visual grounding methods and 3D-LLMs, underscoring the necessity of our dataset and the accompanying 3D Spatial Reasoning Segmentation (3D-SRS) benchmark suite. S\textsc{urprise}3D and 3D-SRS aim to facilitate advancements in spatially aware AI, paving the way for effective embodied interaction and robotic planning. The code and datasets can be found in https://github.com/liziwennba/SUPRISE.
[77] Robust and Generalizable Heart Rate Estimation via Deep Learning for Remote Photoplethysmography in Complex Scenarios
Kang Cen,Chang-Hong Fu,Hong Hong
Main category: cs.CV
TL;DR: 该论文提出了一种基于深度学习的远程光电容积图(rPPG)端到端网络,通过3D卷积神经网络和动态混合损失函数,提高了复杂场景下的心率和BVP估计的鲁棒性和泛化能力。
Details
Motivation: 非接触式rPPG技术通过面部视频测量心率,但现有模型在复杂场景下的准确性、鲁棒性和泛化能力面临挑战。Contribution: 1. 提出了一个端到端的rPPG提取网络,结合3D卷积神经网络和差分帧融合模块;2. 引入了时序移位模块(TSM)和自注意力机制增强特征;3. 设计了动态混合损失函数以减少过拟合。
Method: 1. 使用3D卷积神经网络从原始面部视频重建rPPG信号;2. 差分帧融合模块捕捉BVP变化;3. TSM和自注意力机制优化特征提取。
Result: 在PURE、UBFC-rPPG和挑战性的MMPD数据集上评估,结果优于现有方法(MMPD测试集MAE为7.58)。
Insight: 差分帧融合和动态混合损失函数对提升复杂场景下的性能至关重要,而TSM的高效性可扩展到其他时序任务。
Abstract: Non-contact remote photoplethysmography (rPPG) technology enables heart rate measurement from facial videos. However, existing network models still face challenges in accu racy, robustness, and generalization capability under complex scenarios. This paper proposes an end-to-end rPPG extraction network that employs 3D convolutional neural networks to reconstruct accurate rPPG signals from raw facial videos. We introduce a differential frame fusion module that integrates differential frames with original frames, enabling frame-level representations to capture blood volume pulse (BVP) variations. Additionally, we incorporate Temporal Shift Module (TSM) with self-attention mechanisms, which effectively enhance rPPG features with minimal computational overhead. Furthermore, we propose a novel dynamic hybrid loss function that provides stronger supervision for the network, effectively mitigating over fitting. Comprehensive experiments were conducted on not only the PURE and UBFC-rPPG datasets but also the challenging MMPD dataset under complex scenarios, involving both intra dataset and cross-dataset evaluations, which demonstrate the superior robustness and generalization capability of our network. Specifically, after training on PURE, our model achieved a mean absolute error (MAE) of 7.58 on the MMPD test set, outperforming the state-of-the-art models.
[78] Visual Instance-aware Prompt Tuning
Xi Xiao,Yunbei Zhang,Xingjian Li,Tianyang Wang,Xiao Wang,Yuxiang Wei,Jihun Hamm,Min Xu
Main category: cs.CV
TL;DR: 论文提出了视觉实例感知提示调优(ViaPT),通过为每个输入生成实例感知提示并与数据集级提示融合,解决了传统方法因下游数据集高方差导致的性能不足问题。
Details
Motivation: 传统的视觉提示调优(VPT)使用对所有输入实例相同的数据集级提示,导致性能不理想,原因是下游数据集的方差高。Contribution: 提出了ViaPT方法,结合实例感知提示与数据集级提示,利用PCA保留重要提示信息,减少可学习参数并提升性能。
Method: ViaPT通过生成实例感知提示,并与数据集级提示融合,利用PCA平衡数据集级和实例级知识,优化视觉提示。
Result: 在34个多样化数据集上的实验表明,ViaPT始终优于现有基线方法。
Insight: VPT-Deep和VPT-Shallow是两种极端情况,ViaPT通过结合两种优势,实现了更优的性能和参数效率。
Abstract: Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. Moreover, we reveal that VPT-Deep and VPT-Shallow represent two corner cases based on a conceptual understanding, in which they fail to effectively capture instance-specific information, while random dimension reduction on prompts only yields performance between the two extremes. Instead, ViaPT overcomes these limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters compared to VPT-Deep. Extensive experiments across 34 diverse datasets demonstrate that our method consistently outperforms state-of-the-art baselines, establishing a new paradigm for analyzing and optimizing visual prompts for vision transformers.
[79] Synergistic Prompting for Robust Visual Recognition with Missing Modalities
Zhihui Zhang,Luanyuan Dai,Qika Lin,Yunfeng Diao,Guangyin Jin,Yufei Guo,Jing Zhang,Xiaoshuai Hao
Main category: cs.CV
TL;DR: 该论文提出了一种名为Synergistic Prompting (SyP)的新框架,旨在解决多模态视觉识别任务中因输入模态缺失导致的性能下降问题。通过动态适配器和协同提示策略,SyP显著提升了模型的鲁棒性和适应性。
Details
Motivation: 现实应用中,多模态输入常因缺失或不完整导致性能下降。现有基于提示的方法因静态提示和基本调优策略的不足,难以适应多变的缺失条件或确保关键模态缺失时的可靠性。Contribution: 提出了SyP框架,包含动态适配器和协同提示策略两大创新,实现了对可变缺失条件的灵活适应及关键模态缺失下的稳健推理。
Method: 1. 动态适配器:通过计算自适应缩放因子动态生成提示,替代静态参数。2. 协同提示策略:结合静态和动态提示,平衡模态信息。
Result: 在三个广泛使用的视觉识别数据集上,SyP显著优于现有方法,表现出对多样缺失率和条件的高鲁棒性。
Insight: 动态提示与静态提示的结合能够有效提升多模态模型的适应性和鲁棒性,为处理模态缺失问题提供了新思路。
Abstract: Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks by leveraging extensive paired multi-modal training data. However, in real-world applications, the presence of missing or incomplete modality inputs often leads to significant performance degradation. Recent research has focused on prompt-based strategies to tackle this issue; however, existing methods are hindered by two major limitations: (1) static prompts lack the flexibility to adapt to varying missing-data conditions, and (2) basic prompt-tuning methods struggle to ensure reliable performance when critical modalities are missing.To address these challenges, we propose a novel Synergistic Prompting (SyP) framework for robust visual recognition with missing modalities. The proposed SyP introduces two key innovations: (I) a Dynamic Adapter, which computes adaptive scaling factors to dynamically generate prompts, replacing static parameters for flexible multi-modal adaptation, and (II) a Synergistic Prompting Strategy, which combines static and dynamic prompts to balance information across modalities, ensuring robust reasoning even when key modalities are missing. The proposed SyP achieves significant performance improvements over existing approaches across three widely-used visual recognition datasets, demonstrating robustness under diverse missing rates and conditions. Extensive experiments and ablation studies validate its effectiveness in handling missing modalities, highlighting its superior adaptability and reliability.
[80] Patient-specific vs Multi-Patient Vision Transformer for Markerless Tumor Motion Forecasting
Gauthier Rotsart de Hertaing,Dani Manjah,Benoit Macq
Main category: cs.CV
TL;DR: 该论文首次将视觉变换器(ViT)架构应用于无标记肿瘤运动预测,比较了患者特异性(PS)和多患者(MP)训练策略的性能。结果表明,PS模型在训练数据充足时表现更优,但MP模型在临床时间受限时具有更强的鲁棒性和泛化能力。
Details
Motivation: 准确的肺部肿瘤运动预测对于质子治疗中的精确剂量递送至关重要。当前的无标记方法主要依赖深度学习,而基于变换器的架构在此领域尚未被探索。Contribution: 1.首次将视觉变换器(ViT)应用于无标记肿瘤运动预测;2.比较了患者特异性和多患者训练策略的性能;3.揭示了MP模型在临床时间受限时的实用价值。
Method: 1.使用31名患者的数字化重建放射图像(DRRs)训练多患者模型;2.使用目标患者的规划数据训练患者特异性模型;3.通过平均位移误差(ADE)和最终位移误差(FDE)评估性能。
Result: 1.PS模型在规划数据(T1)上表现更优;2.MP模型对分次间解剖变异具有更强的鲁棒性,且在治疗数据(T2)上无需重新训练即可达到与PS模型相当的性能。
Insight: 尽管PS模型在数据充足时精度更高,但MP模型因其即用性强和鲁棒性更适合时间紧迫的临床场景。
Abstract: Background: Accurate forecasting of lung tumor motion is essential for precise dose delivery in proton therapy. While current markerless methods mostly rely on deep learning, transformer-based architectures remain unexplored in this domain, despite their proven performance in trajectory forecasting. Purpose: This work introduces a markerless forecasting approach for lung tumor motion using Vision Transformers (ViT). Two training strategies are evaluated under clinically realistic constraints: a patient-specific (PS) approach that learns individualized motion patterns, and a multi-patient (MP) model designed for generalization. The comparison explicitly accounts for the limited number of images that can be generated between planning and treatment sessions. Methods: Digitally reconstructed radiographs (DRRs) derived from planning 4DCT scans of 31 patients were used to train the MP model; a 32nd patient was held out for evaluation. PS models were trained using only the target patient’s planning data. Both models used 16 DRRs per input and predicted tumor motion over a 1-second horizon. Performance was assessed using Average Displacement Error (ADE) and Final Displacement Error (FDE), on both planning (T1) and treatment (T2) data. Results: On T1 data, PS models outperformed MP models across all training set sizes, especially with larger datasets (up to 25,000 DRRs, p < 0.05). However, MP models demonstrated stronger robustness to inter-fractional anatomical variability and achieved comparable performance on T2 data without retraining. Conclusions: This is the first study to apply ViT architectures to markerless tumor motion forecasting. While PS models achieve higher precision, MP models offer robust out-of-the-box performance, well-suited for time-constrained clinical settings.
[81] Rethinking Query-based Transformer for Continual Image Segmentation
Yuchen Zhu,Cheng Shi,Dingyou Wang,Jiajin Tang,Zhengxuan Wei,Yu Wu,Guanbin Li,Sibei Yang
Main category: cs.CV
TL;DR: 论文提出SimCIS,一种基于查询的Transformer方法,通过直接选择图像特征实现查询分配,解决持续图像分割中的可塑性丧失和数据顺序依赖问题。
Details
Motivation: 当前持续图像分割方法通过解耦掩码生成与持续学习过程,但存在可塑性丧失和严重依赖输入数据顺序的问题。论文旨在解决这些问题。Contribution: 提出SimCIS方法,通过直接选择图像特征实现查询分配,确保对象性保留;引入跨阶段一致性选择和视觉查询重放机制,缓解灾难性遗忘。
Method: 利用高度聚合的图像特征直接分配查询,实现精准对齐;结合跨阶段一致性和视觉查询重放机制优化持续学习。
Result: SimCIS在多种分割任务、设置、数据顺序中均优于现有方法。
Insight: 高度聚合的图像特征可作为查询生成掩码的捷径,通过简单特征对齐实现高效分割;跨阶段选择和重放机制有效缓解遗忘问题。
Abstract: Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage. To leverage the built-in objectness of query-based transformers, which mitigates catastrophic forgetting of mask proposals, current methods often decouple mask generation from the continual learning process. This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order. To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. Its core idea is to directly select image features for query assignment, ensuring “perfect alignment” to preserve objectness, while simultaneously allowing queries to select new classes to promote plasticity. To further combat catastrophic forgetting of categories, we introduce cross-stage consistency in selection and an innovative “visual query”-based replay mechanism. Experiments demonstrate that SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. All models and codes will be made publicly available at https://github.com/SooLab/SimCIS.
[82] Single-Step Latent Diffusion for Underwater Image Restoration
Jiayi Wu,Tianfu Wang,Md Abu Bakr Siddique,Md Jahidul Islam,Cornelia Fermuller,Yiannis Aloimonos,Christopher A. Metzler
Main category: cs.CV
TL;DR: 该论文提出了一种名为SLURPP的单步潜在扩散方法,用于水下图像恢复,通过结合预训练的潜在扩散模型和显式场景分解,显著提升了恢复效果和计算效率。
Details
Motivation: 水下图像恢复在海洋生态、水产养殖和水下考古等领域至关重要。现有基于扩散的方法虽然有效,但计算复杂且在处理复杂几何和深度变化时易产生不真实的伪影。Contribution: SLURPP方法结合了预训练的潜在扩散模型和显式场景分解,实现了快速且高质量的水下图像恢复,并通过物理合成数据生成管道提供了多样化的训练数据。
Method: SLURPP利用潜在扩散模型捕获场景几何和深度的先验知识,并结合显式场景分解以建模光衰减和后向散射效应。训练数据通过物理合成管道生成,增强了多样性和标注密度。
Result: SLURPP在合成和真实数据上均表现优异,比现有扩散方法快200倍以上,PSNR提升约3 dB,同时对真实数据展现出显著的定性改进。
Insight: 预训练的潜在扩散模型能有效捕获场景结构先验,结合显式场景分解可显著提升水下图像恢复的真实性和效率。物理合成数据的高多样性对模型泛化能力至关重要。
Abstract: Underwater image restoration algorithms seek to restore the color, contrast, and appearance of a scene that is imaged underwater. They are a critical tool in applications ranging from marine ecology and aquaculture to underwater construction and archaeology. While existing pixel-domain diffusion-based image restoration approaches are effective at restoring simple scenes with limited depth variation, they are computationally intensive and often generate unrealistic artifacts when applied to scenes with complex geometry and significant depth variation. In this work we overcome these limitations by combining a novel network architecture (SLURPP) with an accurate synthetic data generation pipeline. SLURPP combines pretrained latent diffusion models – which encode strong priors on the geometry and depth of scenes – with an explicit scene decomposition – which allows one to model and account for the effects of light attenuation and backscattering. To train SLURPP we design a physics-based underwater image synthesis pipeline that applies varied and realistic underwater degradation effects to existing terrestrial image datasets. This approach enables the generation of diverse training data with dense medium/degradation annotations. We evaluate our method extensively on both synthetic and real-world benchmarks and demonstrate state-of-the-art performance. Notably, SLURPP is over 200X faster than existing diffusion-based methods while offering ~ 3 dB improvement in PSNR on synthetic benchmarks. It also offers compelling qualitative improvements on real-world data. Project website https://tianfwang.github.io/slurpp/.
[83] MIRA: A Novel Framework for Fusing Modalities in Medical RAG
Jinhong Wang,Tajamul Ashraf,Zongyan Han,Jorma Laaksonen,Rao Mohammad Anwer
Main category: cs.CV
TL;DR: MIRA框架通过动态调整检索内容数量和整合多模态信息,解决了医疗MLLM中因检索不足或过度导致的准确性下降问题,显著提升了事实准确性和性能。
Details
Motivation: MLLM在医疗诊断中常生成与医学知识不符的响应,而RAG虽能提升准确性但面临检索不足或过度的问题,导致事实错误。Contribution: 提出了MIRA框架,包括动态调整检索内容的模块和整合图像与医学知识的多模态RAG框架。
Method: 1. 校准的Rethinking and Rearrangement模块动态管理检索内容;2. 结合图像嵌入和医学知识库的查询重写模块。
Result: 在公开医疗VQA和报告生成基准测试中,MIRA显著提升事实准确性和性能,达到SOTA。
Insight: 动态调整检索和多模态融合是提升医疗MLLM准确性的关键。
Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at https://github.com/mbzuai-oryx/MIRA.
[84] Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement
Xiao Yang,Yuxuan Fan,Can Liu,Houcheng Su,Weichen Guo,Jiyao Wang,Dengbo He
Main category: cs.CV
TL;DR: 本文提出了一种基于时空不一致性的测试时间自适应策略(CiCi框架),用于远程光电容积描绘(rPPG)任务,通过结合一致性与不一致性先验知识,提升了模型在推理阶段的适应性。
Details
Motivation: 现有的域适应和泛化方法在隐私和实时性方面的限制限制了其在真实场景中的应用,因此需要一种完全测试时间自适应(TTA)的方法来提升rPPG任务的适应性。Contribution: 提出了一种基于专家知识、结合时空一致性与不一致性的自监督框架(CiCi),并引入了梯度动态控制机制,以稳定模型的自适应过程。
Method: 利用生理学先验知识和观测结果,设计了一个自监督框架,结合时空域的一致性与不一致性信号,并通过梯度动态控制机制避免先验冲突。
Result: 在五个数据集上的实验表明,该方法在无需源数据的情况下,显著优于现有技术,实现了实时自监督适应的最先进性能。
Insight: 通过利用时空域中的不一致性信号,可以显著提升模型在测试时间内的自适应能力,尤其是在隐私敏感和实时性要求高的场景中。
Abstract: Remote photoplethysmography (rPPG) has emerged as a promising non-invasive method for monitoring physiological signals using the camera. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based rPPG models in unseen deployment environments, considerations in aspects like privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strategy tailored for rPPG tasks in this work. Specifically, based on prior knowledge in physiology and our observations, we noticed not only there is spatio-temporal consistency in the frequency domain of rPPG signals, but also that inconsistency in the time domain was significant. Given this, by leveraging both consistency and inconsistency priors, we introduce an innovative expert knowledge-based self-supervised \textbf{C}onsistency-\textbf{i}n\textbf{C}onsistency-\textbf{i}ntegration (\textbf{CiCi}) framework to enhances model adaptation during inference. Besides, our approach further incorporates a gradient dynamic control mechanism to mitigate potential conflicts between priors, ensuring stable adaptation across instances. Through extensive experiments on five diverse datasets under the TTA protocol, our method consistently outperforms existing techniques, presenting state-of-the-art performance in real-time self-supervised adaptation without accessing source data. The code will be released later.
[85] Towards Continuous Home Cage Monitoring: An Evaluation of Tracking and Identification Strategies for Laboratory Mice
Juan Pablo Oberhauser,Daniel Grzenda
Main category: cs.CV
TL;DR: 该论文提出了一种实时识别算法,用于在数字笼养环境中准确追踪和识别实验室小鼠,通过结合外观和运动线索的跟踪器、基于Transformer的ID分类器和轨迹关联器,显著提高了追踪效率和ID准确性。
Details
Motivation: 实验室小鼠的持续自动化监测能提高数据收集的准确性和动物福利,但由于小鼠的高密度饲养、相似外观和高活动性,个体识别成为挑战。Contribution: 开发了实时ID算法(结合跟踪器、Transformer分类器和轨迹关联器),实现了高帧率和全天候的小鼠ID追踪。
Method: 方法包括三部分:1)结合外观和运动线索的多目标跟踪器MouseTracks;2)基于Transformer的ID分类器Mouseformer;3)轨迹关联器MouseMap。
Result: 在30FPS下实现了高精度ID分配,相比现有方法提升了追踪效率并减少了ID切换。
Insight: 通过融合多种线索和优化算法,可以在复杂环境下实现稳定的小鼠识别和监测。
Abstract: Continuous, automated monitoring of laboratory mice enables more accurate data collection and improves animal welfare through real-time insights. Researchers can achieve a more dynamic and clinically relevant characterization of disease progression and therapeutic effects by integrating behavioral and physiological monitoring in the home cage. However, providing individual mouse metrics is difficult because of their housing density, similar appearances, high mobility, and frequent interactions. To address these challenges, we develop a real-time identification (ID) algorithm that accurately assigns ID predictions to mice wearing custom ear tags in digital home cages monitored by cameras. Our pipeline consists of three parts: (1) a custom multiple object tracker (MouseTracks) that combines appearance and motion cues from mice; (2) a transformer-based ID classifier (Mouseformer); and (3) a tracklet associator linear program to assign final ID predictions to tracklets (MouseMap). Our models assign an animal ID based on custom ear tags at 30 frames per second with 24/7 cage coverage. We show that our custom tracking and ID pipeline improves tracking efficiency and lowers ID switches across mouse strains and various environmental factors compared to current mouse tracking methods.
[86] Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions
Longfei Li,Zhiwen Fan,Wenyan Cong,Xinhang Liu,Yuyang Yin,Matt Foutter,Panwang Pan,Chenyu You,Yue Wang,Zhangyang Wang,Yao Zhao,Marco Pavone,Yunchao Wei
Main category: cs.CV
TL;DR: 该论文提出了一种用于生成逼真火星景观视频的完整解决方案,结合了数据重建和视频生成技术。通过M3arsSynth和MarsGen两个组件,解决了火星数据稀缺和领域差异问题。
Details
Motivation: 合成逼真的火星景观视频对任务演练和机器人模拟至关重要,但火星数据稀缺且与地球图像存在显著领域差异。Contribution: 1)提出数据重建引擎M3arsSynth,从NASA的立体图像中重建3D火星环境;2)开发火星视频生成器MarsGen,生成视觉逼真且几何一致的视频。
Method: 结合3D重建与视频生成技术,通过M3arsSynth重建3D环境,再通过MarsGen生成视频。支持基于初始帧、相机轨迹或文本提示的视频合成。
Result: 实验表明,该方法优于基于地球数据集训练的视频合成模型,在视觉逼真度和3D结构一致性上表现更优。
Insight: 重建物理准确的3D环境是解决火星视频合成挑战的关键,且跨领域适应性问题可以通过针对性数据重建和生成模型解决。
Abstract: Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA’s Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.
[87] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Haoyu Wu,Diankun Wu,Tianyu He,Junliang Guo,Yang Ye,Yueqi Duan,Jiang Bian
Main category: cs.CV
TL;DR: 论文提出了一种名为Geometry Forcing的方法,通过将视频扩散模型与预训练的几何基础模型对齐,增强模型在生成视频时的3D一致性。
Details
Motivation: 视频是动态3D世界的2D投影,但现有的视频扩散模型仅基于原始视频数据训练,往往无法捕捉几何感知结构,导致生成的内容缺乏3D一致性。Contribution: 提出了Geometry Forcing方法,通过角度对齐和尺度对齐两个目标,将视频扩散模型的中间表示与几何基础模型的特征对齐,从而增强模型的几何感知能力。
Method: 1. 角度对齐(Angular Alignment):通过余弦相似度强制方向一致性;2. 尺度对齐(Scale Alignment):通过回归未归一化的几何特征,保留尺度信息。
Result: 在相机视角条件和动作条件视频生成任务中,Geometry Forcing显著提升了生成视频的视觉质量和3D一致性。
Insight: 通过在视频生成中引入几何感知的中间表示,可以更好地模拟3D世界的动态特性,从而提高生成内容的真实性和一致性。
Abstract: Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model’s intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.
[88] OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
JingLi Lin,Chenming Zhu,Runsen Xu,Xiaohan Mao,Xihui Liu,Tai Wang,Jiangmiao Pang
Main category: cs.CV
TL;DR: OST-Bench 是一个新的基准测试,旨在评估多模态大语言模型(MLLMs)在在线时空场景理解中的能力,揭示了现有模型在复杂时空推理方面的不足。
Details
Motivation: 现有基准测试多为离线场景设计,无法反映真实世界中的动态探索和推理需求,因此需要一种新的在线时空理解评估工具。Contribution: 提出了 OST-Bench,包含 1.4k 场景和 10k 问答对,用于评估 MLLMs 在在线和时空推理任务中的表现,并分析了模型的主要错误模式。
Method: 基于 ScanNet、Matterport3D 和 ARKitScenes 构建数据集,模拟在线探索场景,设计任务以测试模型的增量观测处理和长期记忆整合能力。
Result: 实验表明,主流 MLLMs 在复杂时空推理任务中表现较差,准确率随探索时间和记忆需求的增加而下降。
Insight: 指出了长时记忆检索和复杂空间推理是提升在线推理能力的两大核心挑战。
Abstract: Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/
[89] CLIP Won’t Learn Object-Attribute Binding from Natural Data and Here is Why
Bijay Gurung,David T. Hoffmann,Thomas Brox
Main category: cs.CV
TL;DR: CLIP模型无法从自然数据中学习对象与属性的绑定关系,原因是自然数据的低属性密度、不完整标注和显著性偏差等问题,而非批次大小或硬负样本的缺失。
Details
Motivation: CLIP等对比视觉语言模型在零样本分类和多模态模型中广泛应用,但其表征存在局限性,如无法区分对象与属性的绑定关系。作者试图通过数据属性的分析解决这一问题。Contribution: 揭示了自然数据特性(低属性密度、不完整标注、显著性偏差)对CLIP学习对象-属性绑定能力的影响,并通过合成数据集验证了这些特性是关键因素。
Method: 使用合成数据集系统地分析数据属性对CLIP绑定学习能力的影响,比较了不同数据特性下的模型表现。
Result: 发现仅当数据满足特定属性(如高属性密度和完整标注)时,CLIP才能学习到几乎完美的对象-属性绑定关系。
Insight: 自然数据的特性(而非模型架构或训练策略)是CLIP无法学习绑定的主要原因,改进数据设计可能比调整模型更有效。
Abstract: Contrastive vision-language models like CLIP are used for a large variety of applications, such as zero-shot classification or as vision encoder for multi-modal models. Despite their popularity, their representations show major limitations. For instance, CLIP models learn bag-of-words representations and, as a consequence, fail to distinguish whether an image is of “a yellow submarine and a blue bus” or “a blue submarine and a yellow bus”. Previous attempts to fix this issue added hard negatives during training or modified the architecture, but failed to resolve the problem in its entirety. We suspect that the missing insights to solve the binding problem for CLIP are hidden in the arguably most important part of learning algorithms: the data. In this work, we fill this gap by rigorously identifying the influence of data properties on CLIP’s ability to learn binding using a synthetic dataset. We find that common properties of natural data such as low attribute density, incomplete captions, and the saliency bias, a tendency of human captioners to describe the object that is “most salient” to them have a detrimental effect on binding performance. In contrast to common belief, we find that neither scaling the batch size, i.e., implicitly adding more hard negatives, nor explicitly creating hard negatives enables CLIP to learn reliable binding. Only when the data expresses our identified data properties CLIP learns almost perfect binding.
[90] Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Jeongseok Hyun,Sukjun Hwang,Su Ho Han,Taeoh Kim,Inwoong Lee,Dongyoon Wee,Joon-Young Lee,Seon Joo Kim,Minho Shim
Main category: cs.CV
TL;DR: 论文提出了一种无需训练的视频大语言模型加速方法STTM,通过多粒度时空令牌合并减少计算开销,同时保持高精度。
Details
Motivation: 视频LLM通常因大量时空令牌导致计算复杂度二次增长,而现有方法未能充分利用视频数据的局部时空冗余性。Contribution: 提出了无需训练的STTM方法,首次利用多粒度空间令牌和定向时序合并,显著减少计算量且精度损失小。
Method: 通过四叉树结构进行粗到细的空间令牌搜索,随后在时间维度上进行定向令牌合并,充分利用时空冗余。
Result: 在6个视频QA基准测试中表现优异,例如以50%令牌预算实现2倍加速且精度仅下降0.5%。
Insight: 视频数据的时空冗余性可用于高效令牌合并,无需额外训练即可显著提升计算效率。
Abstract: Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.
[91] Multigranular Evaluation for Brain Visual Decoding
Weihao Xia,Cengiz Oztireli
Main category: cs.CV
TL;DR: 本文提出了BASIC框架,用于多粒度评估大脑视觉解码方法的性能,通过量化结构保真度、推理对齐和上下文一致性,解决了现有评估方法的局限性。
Details
Motivation: 现有的大脑视觉解码评估方法主要依赖粗糙的指标,缺乏神经科学基础,且无法捕捉细粒度的视觉差异,限制了方法的有效比较和改进。Contribution: 提出了统一的、多粒度的评估框架BASIC,结合了结构保真度、推理对齐和上下文一致性的量化方法。
Method: 1. 结构层面:引入基于分割的分层度量,包括前景、语义、实例和组件掩码;2. 语义层面:使用多模态大语言模型提取结构化场景表示,包括对象、属性和关系。
Result: 在多组刺激-神经影像数据集上对多种视觉解码方法进行了基准测试,提供了更具区分性、可解释性和全面的评估。
Insight: BASIC框架为大脑视觉解码方法提供了更科学、更细致的评估工具,有助于未来方法的改进和对比。
Abstract: Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for measuring brain visual decoding methods.
[92] Single-pass Adaptive Image Tokenization for Minimum Program Search
Shivam Duggal,Sanghyun Byun,William T. Freeman,Antonio Torralba,Phillip Isola
Main category: cs.CV
TL;DR: 提出了一种单次自适应图像标记器KARL,通过预测图像的适当标记数量来近似Kolmogorov复杂度,实现了高效的单次前向处理。
Details
Motivation: 现有视觉表示学习系统普遍采用固定长度的表示方法,忽视了数据的复杂性或熟悉度差异。KARL旨在通过自适应标记化方法解决这一问题,并避免测试时的多次编码搜索。Contribution: 1. 提出了KARL,一种单次自适应图像标记器;2. 通过Kolmogorov复杂度原理预测标记数量;3. 展示了KARL的性能与多遍方法相当;4. 提供了关于编码器/解码器规模、连续/离散标记化等的扩展规律。
Method: KARL基于Upside-Down强化学习范式训练,在单次前向传播中预测标记停止条件,以达到近似KC的效果。
Result: KARL在单次处理中实现了与多遍自适应标记器相当的性能,并展示了标记数量与最小描述长度的关系。
Insight: 自适应图像标记化与算法信息理论之间存在类比关系,KARL预测的图像复杂度(KC)与人类直觉一致,尤其在结构vs.噪声和分布内外熟悉度方面。
Abstract: According to Algorithmic Information Theory (AIT) – Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL’s training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity – revealing alignment with human intuition.
[93] Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models
Helen Qu,Sang Michael Xie
Main category: cs.CV
TL;DR: 该论文研究了多模态模型(如CLIP和大模型LMMs)中预训练数据的词共现统计对组合泛化能力的影响,发现词共现点互信息(PMI)与模型性能强相关,揭示了组合概念对模型表现的重要性。
Details
Motivation: 多模态模型(如CLIP和LMMs)在常见概念上表现良好,但组合概念的泛化能力尚不清楚。论文旨在探究预训练数据中词共现统计如何影响模型在组合概念上的表现。Contribution: 1. 揭示了词共现PMI与CLIP模型零样本准确率的强相关性(r=0.97);2. 在自然图像中重现了这种效应(r=0.75);3. 证明了CLIP的这一特性会迁移到LMMs(如TextVQA和VQAv2)。
Method: 1. 使用合成图像评估不同概念对的组合表现;2. 利用PMI量化词共现统计;3. 在自然图像中通过编辑生成不同PMI的概念对。
Result: CLIP预训练数据的PMI与模型性能高度相关(零样本准确率差异达14%),且在LMMs中依然显著(TextVQA r=0.70;VQAv2 r=0.62)。
Insight: 当前多模态模型的组合泛化能力受限于预训练数据中的组合分布,需设计新算法或架构,避免通过组合扩展训练数据的方式来提升性能。
Abstract: CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear – for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M (r=0.97 and 14% accuracy gap between images in the top and bottom 5% of PMI values), demonstrating that even accuracy on common concepts is affected by the combination of concepts in the image. Leveraging this finding, we reproduce this effect in natural images by editing them to contain pairs with varying PMI, resulting in a correlation of r=0.75. Finally, we demonstrate that this behavior in CLIP transfers to LMMs built on top of CLIP (r=0.70 for TextVQA, r=0.62 for VQAv2). Our findings highlight the need for algorithms and architectures that improve compositional generalization in multimodal models without scaling the training data combinatorially. Our code is available at https://github.com/helenqu/multimodal-pretraining-pmi.
cs.GR [Back]
[94] SD-GS: Structured Deformable 3D Gaussians for Efficient Dynamic Scene Reconstruction
Wei Yao,Shuzhao Xie,Letian Li,Weixiang Zhang,Zhixin Lai,Shiqi Dai,Ke Zhang,Zhi Wang
Main category: cs.GR
TL;DR: SD-GS提出了一种高效的动态场景重建框架,通过分层可变形锚网格和自适应稠密化策略,显著减少了模型大小并提升了计算效率,同时保持了视觉质量。
Details
Motivation: 当前4D高斯框架在动态场景重建中虽然视觉保真度和渲染速度表现优异,但存储成本与复杂运动建模能力之间的权衡限制了其实际应用。Contribution: 1. 引入分层可变形锚网格作为高效场景表示;2. 提出变形感知的稠密化策略,动态调整锚点分布以优化性能。
Method: SD-GS基于可变形锚网格的层次化表示,结合局部时空区域的3D高斯建模,并通过自适应稠密化策略优化锚点分布。
Result: 模型大小平均减少60%,FPS提升100%,同时保持或超过现有方法的视觉质量。
Insight: 通过层次化和自适应策略,SD-GS在动态场景重建中实现了存储与性能的双重突破。
Abstract: Current 4D Gaussian frameworks for dynamic scene reconstruction deliver impressive visual fidelity and rendering speed, however, the inherent trade-off between storage costs and the ability to characterize complex physical motions significantly limits the practical application of these methods. To tackle these problems, we propose SD-GS, a compact and efficient dynamic Gaussian splatting framework for complex dynamic scene reconstruction, featuring two key contributions. First, we introduce a deformable anchor grid, a hierarchical and memory-efficient scene representation where each anchor point derives multiple 3D Gaussians in its local spatiotemporal region and serves as the geometric backbone of the 3D scene. Second, to enhance modeling capability for complex motions, we present a deformation-aware densification strategy that adaptively grows anchors in under-reconstructed high-dynamic regions while reducing redundancy in static areas, achieving superior visual quality with fewer anchors. Experimental results demonstrate that, compared to state-of-the-art methods, SD-GS achieves an average of 60% reduction in model size and an average of 100% improvement in FPS, significantly enhancing computational efficiency while maintaining or even surpassing visual quality.
[95] Capture Stage Environments: A Guide to Better Matting
Hannah Dröge,Janelle Pfeifer,Saskia Rabich,Markus Plack,Reinhard Klein,Matthias B. Hullin
Main category: cs.GR
TL;DR: 该论文探讨了专业拍摄舞台环境下图像抠图的挑战,并提出了改进工作流程的指南和高效适应最新技术的方法。
Details
Motivation: 传统的抠图算法在拍摄舞台内容中表现不佳,无法应对其特殊性,因此需要针对这一环境提出改进方案。Contribution: 论文的主要贡献包括:总结拍摄舞台内容的特点、提出改进工作流程的指导方针,以及展示无需大量标注的高效自适应流水线。
Method: 提出了一种基于扩散模型的验证方法,并展示了如何将现有抠图技术高效应用于定制化拍摄环境。
Result: 论文通过实验验证了所提方法的有效性,并展示了其在离线及实时场景中的优势。
Insight: 拍摄舞台内容的特殊性对抠图技术提出了新挑战,而通过主动干预和高效自适应可以显著提升效果。
Abstract: Capture stages are high-end sources of state-of-the-art recordings for downstream applications in movies, games, and other media. One crucial step in almost all pipelines is the matting of images to isolate the captured performances from the background. While common matting algorithms deliver remarkable performance in other applications like teleconferencing and mobile entertainment, we found that they struggle significantly with the peculiarities of capture stage content. The goal of our work is to share insights into those challenges as a curated list of those characteristics along with a constructive discussion for proactive intervention and present a guideline to practitioners for an improved workflow to mitigate unresolved challenges. To this end, we also demonstrate an efficient pipeline to adapt state-of-the-art approaches to such custom setups without the need of extensive annotations, both offline and real-time. For an objective evaluation, we propose a validation methodology based on a leading diffusion model that highlights the benefits of our approach.
[96] RTR-GS: 3D Gaussian Splatting for Inverse Rendering with Radiance Transfer and Reflection
Yongyang Zhou,Fang-Lue Zhang,Zichen Wang,Lei Zhang
Main category: cs.GR
TL;DR: 本文提出RTR-GS,一种结合辐射传输与反射的3D高斯散射框架,用于反渲染任务,能够分解BRDF和光照,并提供可信的重光照结果。
Details
Motivation: 现有的3D高斯散射(3DGS)在新视角合成中表现优异,但在处理反渲染和重光照任务时,尤其是反射物体的渲染方面仍存在挑战。Contribution: 提出了一种新的反渲染框架RTR-GS,能够处理任意反射属性的物体,分解BRDF和光照,并提供高质量的重光照结果。
Method: 结合了前向渲染(辐射传输)和延迟渲染(反射)的混合渲染模型,有效分离高频和低频外观,并通过基于物理的延迟渲染分支优化BRDF和光照分解。
Result: 实验表明,该方法在新视角合成、法线估计、分解和重光照任务中表现优异,同时保持了高效的训练和推理过程。
Insight: 通过混合渲染模型分离高频和低频信息,可以有效解决球形谐波过拟合导致的浮动伪影问题。
Abstract: 3D Gaussian Splatting (3DGS) has demonstrated impressive capabilities in novel view synthesis. However, rendering reflective objects remains a significant challenge, particularly in inverse rendering and relighting. We introduce RTR-GS, a novel inverse rendering framework capable of robustly rendering objects with arbitrary reflectance properties, decomposing BRDF and lighting, and delivering credible relighting results. Given a collection of multi-view images, our method effectively recovers geometric structure through a hybrid rendering model that combines forward rendering for radiance transfer with deferred rendering for reflections. This approach successfully separates high-frequency and low-frequency appearances, mitigating floating artifacts caused by spherical harmonic overfitting when handling high-frequency details. We further refine BRDF and lighting decomposition using an additional physically-based deferred rendering branch. Experimental results show that our method enhances novel view synthesis, normal estimation, decomposition, and relighting while maintaining efficient training inference process.
cs.SD [Back]
[97] Input Conditioned Layer Dropping in Speech Foundation Models
Abdul Hannan,Daniele Falavigna,Alessio Brutti
Main category: cs.SD
TL;DR: 本文提出了一种输入驱动的层级丢弃(Layer Dropping, LD)方法,通过轻量级选择网络动态决定处理层组合,显著提升了边缘和物联网环境中语音基础模型的适应性。
Details
Motivation: 边缘和物联网设备的计算资源动态变化,需要灵活的模型架构来适应不同计算负载。现有层级丢弃方法在层选择上存在局限性,或对神经架构改动较大。Contribution: 1. 提出输入驱动的层级丢弃方法,利用输入特征和轻量级选择网络动态决定最佳层组合;2. 在4个公开语音和音频基准测试中验证了方法的有效性。
Method: 1. 使用输入特征作为条件;2. 引入轻量级选择网络动态决定丢弃哪些层;3. 无需对主干网络进行显著修改。
Result: 方法显著优于随机丢弃,与早期退出(early exit)方法表现相当或更好。
Insight: 输入驱动的动态调整机制可以有效平衡计算负载与模型性能,为资源受限场景提供了实用解决方案。
Abstract: Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping ($\mathcal{LD}$) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven $\mathcal{LD}$ that employs the network’s input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.
cs.LG [Back]
[98] Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
A. Bochkov
Main category: cs.LG
TL;DR: 论文提出了一种模块化构建和分层扩展的Transformer训练方法,基于冻结的输入嵌入,展现出高效、灵活的模型扩展能力。
Details
Motivation: 传统的大型语言模型(LLM)训练方式是整体、端到端的,资源消耗大且缺乏灵活性。本文探索一种利用冻结嵌入的模块化与分层扩展方法,以实现高效、灵活的模型开发。Contribution: 1. 展示冻结嵌入可作为通用”对接端口”,支持专家模型的无缝合并(Mixture-of-Experts);2. 提出分层增量训练方法,允许模型按需扩展深度。
Method: 1. 通过平均输出logits合并专家模型;2. 逐层增量训练,逐步增加Transformer深度。
Result: 模块化合并的MoE模型在MMLU等任务上表现超越单独专家;分层扩展的模型在复杂推理任务(如SQuAD)中表现稳定且深度与能力相关。
Insight: 冻结嵌入为模块化和增量训练提供了基础,表明AI模型开发可以更生物化或模块化,支持高效扩展和持续学习。
Abstract: The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal “docking port,” enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is “grown” by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.
[99] Bradley-Terry and Multi-Objective Reward Modeling Are Complementary
Zhiwei Zhang,Hui Liu,Xiaomin Li,Zhenwei Dai,Jingying Zeng,Fali Wang,Minhua Lin,Ramraj Chandradevan,Zhen Li,Chen Luo,Xianfeng Tang,Qi He,Suhang Wang
Main category: cs.LG
TL;DR: 该论文提出一种统一的奖励建模框架,通过联合训练单目标(Bradley-Terry)和多目标回归奖励函数,提升大语言模型在分布外数据上的鲁棒性和评分性能。
Details
Motivation: 现有的RLHF方法在分布外(OOD)场景下表现不佳,且多目标奖励函数因数据质量限制成为性能瓶颈。论文旨在解决这些问题。Contribution: 提出了一种联合训练Bradley-Terry单目标和多目标回归奖励函数的框架,理论上分析了二者的互补性,并通过实验验证了其有效性。
Method: 使用共享嵌入空间联合训练BT单目标和多目标回归奖励函数,结合二者的优势:回归任务增强OOD鲁棒性,BT训练提升多目标评分能力。
Result: 实验表明,该方法显著提升了奖励模型的鲁棒性和评分性能,7B模型甚至优于70B基线模型。
Insight: BT损失与回归目标的互补性为奖励建模提供了新思路,尤其是在OOD场景下,多属性评分和单目标训练的协同作用是关键。
Abstract: Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior. Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution. In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of high-quality data often leads to weak performance of multi-objective reward functions, which can negatively impact overall performance and become the bottleneck. To address this issue, we propose a unified reward modeling framework that jointly trains Bradley–Terry (BT) single-objective and multi-objective regression-based reward functions using a shared embedding space. We theoretically establish a connection between the BT loss and the regression objective and highlight their complementary benefits. Specifically, the regression task enhances the single-objective reward function’s ability to mitigate reward hacking in challenging OOD settings, while BT-based training improves the scoring capability of the multi-objective reward function, enabling a 7B model to outperform a 70B baseline. Extensive experimental results demonstrate that our framework significantly improves both the robustness and the scoring performance of reward models.
[100] GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing
Peiyan Zhang,Haibo Jin,Liying Kang,Haohan Wang
Main category: cs.LG
TL;DR: 论文介绍了GuardVal,一种针对大语言模型(LLM)的动态越狱评估协议,通过实时生成和优化越狱提示,更全面地测试模型的安全性。
Details
Motivation: 现有的越狱评估方法难以全面捕捉大语言模型的动态性和复杂性,导致安全漏洞评估不足。Contribution: 提出了GuardVal协议,动态生成和优化越狱提示;提出了一种新的优化方法,防止提示优化停滞。
Method: 动态生成和优化越狱提示,基于防御模型的状态实时调整;新优化方法确保提示持续有效。
Result: 在Mistral-7b到GPT-4等多样模型上测试,揭示了模型的行为模式,提供了安全性的全面评估。
Insight: 动态评估揭示了模型的具体弱点,为未来的研究和安全性改进提供了方向。
Abstract: Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the sophistication required in effectively probing their vulnerabilities. Current benchmarks and evaluation methods struggle to fully address these challenges, leaving gaps in the assessment of LLM vulnerabilities. In this paper, we review existing jailbreak evaluation practices and identify three assumed desiderata for an effective jailbreak evaluation protocol. To address these challenges, we introduce GuardVal, a new evaluation protocol that dynamically generates and refines jailbreak prompts based on the defender LLM’s state, providing a more accurate assessment of defender LLMs’ capacity to handle safety-critical situations. Moreover, we propose a new optimization method that prevents stagnation during prompt refinement, ensuring the generation of increasingly effective jailbreak prompts that expose deeper weaknesses in the defender LLMs. We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains. Our findings highlight distinct behavioral patterns among the models, offering a comprehensive view of their robustness. Furthermore, our evaluation process deepens the understanding of LLM behavior, leading to insights that can inform future research and drive the development of more secure models.
[101] Weighted Multi-Prompt Learning with Description-free Large Language Model Distillation
Sua Lee,Kyubum Shin,Jung Ho Park
Main category: cs.LG
TL;DR: 本文提出了一种名为DeMul的新方法,通过直接从大型语言模型(LLM)蒸馏知识到提示词,避免了提取描述的步骤,从而提高了语义丰富性和优化效率。在多提示设置中,还展示了提示权重的重要性。实验表明,该方法在11个识别数据集上表现优越。
Details
Motivation: 现有的利用LLM生成描述的方法存在高变异性和低可靠性的问题。为了克服这些局限性,作者提出了一种无需描述的提示学习方法,直接从LLM蒸馏知识。Contribution: 1. 提出了一种无需描述的提示学习方法(DeMul),直接从LLM蒸馏知识;2. 在多提示设置中,探索了提示权重的作用;3. 在11个数据集上验证了方法的优越性。
Method: DeMul方法避免了从LLM提取描述的传统步骤,而是直接将LLM的知识蒸馏到连续的提示向量中,优化了语义表达和训练效率。同时,在多提示场景中引入了权重机制。
Result: 实验结果显示,DeMul在11个识别任务中显著优于现有方法,证明了其高效性和鲁棒性。
Insight: 1. 直接从LLM蒸馏知识可以避免描述提取的不可靠性;2. 多提示权重机制能够有效反映不同提示的重要性;3. 连续向量表示消除了对离散模板的依赖,提升了灵活性。
Abstract: Recent advances in pre-trained Vision Language Models (VLM) have shown promising potential for effectively adapting to downstream tasks through prompt learning, without the need for additional annotated paired datasets. To supplement the text information in VLM trained on correlations with vision data, new approaches leveraging Large Language Models (LLM) in prompts have been proposed, enhancing robustness to unseen and diverse data. Existing methods typically extract text-based responses (i.e., descriptions) from LLM to incorporate into prompts; however, this approach suffers from high variability and low reliability. In this work, we propose Description-free Multi-prompt Learning(DeMul), a novel method that eliminates the process of extracting descriptions and instead directly distills knowledge from LLM into prompts. By adopting a description-free approach, prompts can encapsulate richer semantics while still being represented as continuous vectors for optimization, thereby eliminating the need for discrete pre-defined templates. Additionally, in a multi-prompt setting, we empirically demonstrate the potential of prompt weighting in reflecting the importance of different prompts during training. Experimental results show that our approach achieves superior performance across 11 recognition datasets.
cs.AI [Back]
[102] Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery
Licong Xu,Milind Sarkar,Anto I. Lonappan,Íñigo Zubeldia,Pablo Villanueva-Domingo,Santiago Casas,Christian Fidler,Chetana Amancharla,Ujjwal Tiwari,Adrian Bayer,Chadi Ait Ekiou,Miles Cranmer,Adrian Dimitrov,James Fergusson,Kahaan Gandhi,Sven Krippendorf,Andrew Laverick,Julien Lesgourgues,Antony Lewis,Thomas Meier,Blake Sherwin,Kristen Surrao,Francisco Villaescusa-Navarro,Chi Wang,Xueqing Xu,Boris Bolliet
Main category: cs.AI
TL;DR: 该论文提出了一个名为cmbagent的多智能体系统,通过约30个大型语言模型(LLM)智能体实现科学研究的自动化流程,无需人工干预,成功应用于PhD级别的宇宙学任务。
Details
Motivation: 科学研究的自动化是一个复杂且需要多任务协调的挑战,该论文旨在通过多智能体系统实现全自动化的科研流程,减少人工干预。Contribution: 提出了一个基于LLM的多智能体系统(cmbagent),能够自动执行科学研究的规划与控制任务,并在宇宙学任务中表现出色。
Method: 系统包含约30个专用LLM智能体,分别负责检索、代码编写、结果解释等任务,通过规划和控制策略协调工作流,并实现本地代码执行。
Result: 在宇宙学任务中,cmbagent的表现优于当前最先进的LLM,源代码和演示视频已公开。
Insight: 多智能体系统可以有效协调复杂任务,LLM在科研自动化中表现出潜力,但仍需进一步研究以优化任务分配与协作机制。
Abstract: We present a multi-agent system for automation of scientific research tasks, cmbagent. The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.
[103] ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning
Yichen Lu,Wei Dai,Jiaen Liu,Ching Wing Kwok,Zongheng Wu,Xudong Xiao,Ao Sun,Sheng Fu,Jianyuan Zhan,Yian Wang,Takatomo Saito,Sicheng Lai
Main category: cs.AI
TL;DR: ViDove是一个基于多模态输入的翻译代理系统,通过结合视觉和上下文背景信息提升翻译质量,并引入记忆模块和领域知识,显著优于现有方法。
Details
Motivation: 现有基于LLM的翻译代理通常仅支持纯文本输入,缺乏对多模态信息的利用,ViDove旨在通过模仿人类翻译的工作流程填补这一空白。Contribution: 1. 提出首个支持多模态输入的翻译代理系统ViDove;2. 引入多模态记忆系统和领域知识增强的长短期记忆模块;3. 发布新基准DoveBench,用于长视频字幕翻译任务。
Method: 结合视觉与上下文信息,集成多模态记忆和LSTM模块,利用领域知识优化翻译过程。
Result: 在字幕生成和通用翻译任务中,ViDove的BLEU分数提升28%,SubER提升15%。
Insight: 多模态背景信息和领域知识能显著提升翻译质量,尤其在长视频字幕等复杂任务中效果突出。
Abstract: LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our code is available here: https://github.com/pigeonai-org/ViDove
eess.IV [Back]
[104] Semi-supervised learning and integration of multi-sequence MR-images for carotid vessel wall and plaque segmentation
Marie-Christine Pali,Christina Schwaiger,Malik Galijasevic,Valentin K. Ladenhauf,Stephanie Mangesius,Elke R. Gizewski
Main category: eess.IV
TL;DR: 本文提出了一种半监督深度学习方法,用于多序列MRI数据的颈动脉血管壁和斑块分割,通过粗定位和精细分割网络结合数据融合策略,解决了标记数据稀缺和斑块复杂形态的挑战。
Details
Motivation: 颈动脉斑块分析对评估动脉粥样硬化和缺血性中风风险至关重要,但斑块形态复杂且标记数据稀缺,亟需高效的分割方法。Contribution: 1. 提出了一种半监督深度学习框架,结合粗定位和精细分割网络;2. 研究了多序列MRI数据的融合策略;3. 提出了多级多序列U-Net架构;4. 解决了标记数据稀缺问题。
Method: 1. 粗定位模型确定感兴趣区域;2. 精细分割模型精确分割血管壁和斑块;3. 多序列数据通过多级U-Net融合;4. 半监督方法通过输入变换一致性增强性能。
Result: 在52名患者的五序列MRI数据上验证,实验表明方法有效且融合策略选择对U-Net架构性能至关重要。
Insight: 多序列数据融合和半监督学习可显著提升颈动脉分割性能,尤其在标记数据有限的情况下。
Abstract: The analysis of carotid arteries, particularly plaques, in multi-sequence Magnetic Resonance Imaging (MRI) data is crucial for assessing the risk of atherosclerosis and ischemic stroke. In order to evaluate metrics and radiomic features, quantifying the state of atherosclerosis, accurate segmentation is important. However, the complex morphology of plaques and the scarcity of labeled data poses significant challenges. In this work, we address these problems and propose a semi-supervised deep learning-based approach designed to effectively integrate multi-sequence MRI data for the segmentation of carotid artery vessel wall and plaque. The proposed algorithm consists of two networks: a coarse localization model identifies the region of interest guided by some prior knowledge on the position and number of carotid arteries, followed by a fine segmentation model for precise delineation of vessel walls and plaques. To effectively integrate complementary information across different MRI sequences, we investigate different fusion strategies and introduce a multi-level multi-sequence version of U-Net architecture. To address the challenges of limited labeled data and the complexity of carotid artery MRI, we propose a semi-supervised approach that enforces consistency under various input transformations. Our approach is evaluated on 52 patients with arteriosclerosis, each with five MRI sequences. Comprehensive experiments demonstrate the effectiveness of our approach and emphasize the role of fusion point selection in U-Net-based architectures. To validate the accuracy of our results, we also include an expert-based assessment of model performance. Our findings highlight the potential of fusion strategies and semi-supervised learning for improving carotid artery segmentation in data-limited MRI applications.
[105] Compressive Imaging Reconstruction via Tensor Decomposed Multi-Resolution Grid Encoding
Zhenyu Jin,Yisi Luo,Xile Zhao,Deyu Meng
Main category: eess.IV
TL;DR: 本文提出了一种基于张量分解和多分辨率网格编码的无监督连续表示框架GridTD,用于压缩成像(CI)重建。该方法结合了多分辨率网格编码的层次建模能力和张量分解的紧凑性,在高效性和表示能力上取得平衡,并通过理论分析和实验验证了其优越性。
Details
Motivation: 压缩成像(CI)重建需要从低维压缩测量中恢复高维图像。现有无监督表示方法在表示能力和效率上难以平衡,因此需要一种更高效的表示框架。Contribution: 提出了GridTD框架,结合多分辨率网格编码的层次性和张量分解的紧凑性,为CI重建提供了一种高效且表达能力强的连续表示方法。
Method: 通过优化轻量级神经网络和输入张量分解模型,利用多分辨率哈希网格编码学习参数,实现高维图像的高效重建。
Result: 在视频SCI、光谱SCI和压缩动态MRI重建等任务中,GridTD表现优于现有方法,成为通用且先进的CI重建方法。
Insight: GridTD的理论分析(如Lipschitz性质、泛化误差界和定点收敛)揭示了其内在优越性,为连续表示模型的进一步研究提供了新思路。
Abstract: Compressive imaging (CI) reconstruction, such as snapshot compressive imaging (SCI) and compressive sensing magnetic resonance imaging (MRI), aims to recover high-dimensional images from low-dimensional compressed measurements. This process critically relies on learning an accurate representation of the underlying high-dimensional image. However, existing unsupervised representations may struggle to achieve a desired balance between representation ability and efficiency. To overcome this limitation, we propose Tensor Decomposed multi-resolution Grid encoding (GridTD), an unsupervised continuous representation framework for CI reconstruction. GridTD optimizes a lightweight neural network and the input tensor decomposition model whose parameters are learned via multi-resolution hash grid encoding. It inherently enjoys the hierarchical modeling ability of multi-resolution grid encoding and the compactness of tensor decomposition, enabling effective and efficient reconstruction of high-dimensional images. Theoretical analyses for the algorithm’s Lipschitz property, generalization error bound, and fixed-point convergence reveal the intrinsic superiority of GridTD as compared with existing continuous representation models. Extensive experiments across diverse CI tasks, including video SCI, spectral SCI, and compressive dynamic MRI reconstruction, consistently demonstrate the superiority of GridTD over existing methods, positioning GridTD as a versatile and state-of-the-art CI reconstruction method.
[106] MeD-3D: A Multimodal Deep Learning Framework for Precise Recurrence Prediction in Clear Cell Renal Cell Carcinoma (ccRCC)
Hasaan Maqsood,Saif Ur Rehman Khan
Main category: eess.IV
TL;DR: 该论文提出了一种名为MeD-3D的多模态深度学习框架,用于精确预测肾透明细胞癌(ccRCC)的复发,整合了影像学、组织病理学、临床和基因组数据。
Details
Motivation: 现有的单模态预测模型无法充分捕捉ccRCC的复杂性,导致预测准确性不足。通过整合多模态数据,可以提升预测性能,支持临床决策。Contribution: 1. 提出了一种多模态深度学习框架MeD-3D;2. 整合了CT、MRI、WSI、临床和基因组数据;3. 支持部分模态缺失的推理;4. 通过融合策略提取互补信息。
Method: 1. 使用CLAM处理WSI数据,MeD-3D处理CT/MRI,MLP处理临床和基因组数据;2. 采用早期和晚期融合策略;3. 支持不完整数据推理。
Result: 实验表明,整合多模态数据显著提升了ccRCC复发预测的准确性。
Insight: 多模态数据融合可以更好地捕捉疾病的复杂性,为个性化医疗提供支持。
Abstract: Accurate prediction of recurrence in clear cell renal cell carcinoma (ccRCC) remains a major clinical challenge due to the disease complex molecular, pathological, and clinical heterogeneity. Traditional prognostic models, which rely on single data modalities such as radiology, histopathology, or genomics, often fail to capture the full spectrum of disease complexity, resulting in suboptimal predictive accuracy. This study aims to overcome these limitations by proposing a deep learning (DL) framework that integrates multimodal data, including CT, MRI, histopathology whole slide images (WSI), clinical data, and genomic profiles, to improve the prediction of ccRCC recurrence and enhance clinical decision-making. The proposed framework utilizes a comprehensive dataset curated from multiple publicly available sources, including TCGA, TCIA, and CPTAC. To process the diverse modalities, domain-specific models are employed: CLAM, a ResNet50-based model, is used for histopathology WSIs, while MeD-3D, a pre-trained 3D-ResNet18 model, processes CT and MRI images. For structured clinical and genomic data, a multi-layer perceptron (MLP) is used. These models are designed to extract deep feature embeddings from each modality, which are then fused through an early and late integration architecture. This fusion strategy enables the model to combine complementary information from multiple sources. Additionally, the framework is designed to handle incomplete data, a common challenge in clinical settings, by enabling inference even when certain modalities are missing.
[107] Label-Efficient Chest X-ray Diagnosis via Partial CLIP Adaptation
Heet Nitinkumar Dalsania
Main category: eess.IV
TL;DR: 该论文提出了一种标签高效策略,通过部分微调CLIP模型的视觉编码器,将其应用于胸部X射线诊断任务,显著提升了在少样本情况下的性能。
Details
Motivation: 医疗影像任务通常依赖大量标注数据,但真实医院场景中标注稀疏且难以获取,因此需要一种标签高效的解决方案。Contribution: 提出了一种基于CLIP的部分微调方法,在少样本情况下实现了胸部X射线诊断的显著性能提升,模拟了真实医院工作流。
Method: 使用预训练的CLIP ViT-B/32模型,部分微调其视觉编码器,并通过零样本和少样本学习(1-16标注样本/类别)进行测试。
Result: 实验表明,该方法在少样本情况下比零样本基线平均AUC提升了20%以上。
Insight: 预训练的视觉-语言特征可以有效迁移到少样本医疗影像任务,为真实医院场景提供了实用且可扩展的解决方案。
Abstract: Modern deep learning implementations for medical imaging usually rely on large labeled datasets. These datasets are often difficult to obtain due to privacy concerns, high costs, and even scarcity of cases. In this paper, a label-efficient strategy is proposed for chest X-ray diagnosis that seeks to reflect real-world hospital scenarios. The experiments use the NIH Chest X-ray14 dataset and a pre-trained CLIP ViT-B/32 model. The model is adapted via partial fine-tuning of its visual encoder and then evaluated using zero-shot and few-shot learning with 1-16 labeled examples per disease class. The tests demonstrate that CLIP’s pre-trained vision-language features can be effectively adapted to few-shot medical imaging tasks, achieving over 20% improvement in mean AUC score as compared to the zero-shot baseline. The key aspect of this work is to attempt to simulate internal hospital workflows, where image archives exist but annotations are sparse. This work evaluates a practical and scalable solution for both common and rare disease diagnosis. Additionally this research is intended for academic and experimental purposes only and has not been peer reviewed yet. All code is found at https://github.com/heet007-code/CLIP-disease-xray.
[108] Computationally Efficient Information-Driven Optical Design with Interchanging Optimization
Eric Markley,Henry Pinkard,Leyla Kabuli,Nalini Singh,Laura Waller
Main category: eess.IV
TL;DR: 该论文提出了一种改进的信息驱动光学设计方法IDEAL-IO,通过交替优化解决了IDEAL方法的高内存、长运行时间和目标函数不匹配的问题,适用于多种成像系统。
Details
Motivation: IDEAL方法虽然实现了应用无关的光学设计,但存在高内存占用、长运行时间以及目标函数不匹配的问题。为了解决这些问题,论文提出了改进的IDEAL-IO方法。Contribution: 论文的主要贡献是提出了一种交替优化的信息驱动光学设计方法IDEAL-IO,显著降低了内存和运行时间,同时允许使用更灵活的密度模型来优化设计。
Method: IDEAL-IO通过分离密度估计和光学参数优化,交替进行模型拟合和参数更新,从而降低了计算负担并提高了优化效率。
Result: 实验表明,IDEAL-IO在衍射光学、无透镜成像和快照3D显微镜等应用中,内存和运行时间减少了6倍,同时优化效果更优。
Insight: 通过解耦密度估计和参数优化,IDEAL-IO提供了一种更实用、可扩展的信息驱动光学设计策略。
Abstract: Recent work has demonstrated that imaging systems can be evaluated through the information content of their measurements alone, enabling application-agnostic optical design that avoids computational decoding challenges. Information-Driven Encoder Analysis Learning (IDEAL) was proposed to automate this process through gradient-based. In this work, we study IDEAL across diverse imaging systems and find that it suffers from high memory usage, long runtimes, and a potentially mismatched objective function due to end-to-end differentiability requirements. We introduce IDEAL with Interchanging Optimization (IDEAL-IO), a method that decouples density estimation from optical parameter optimization by alternating between fitting models to current measurements and updating optical parameters using fixed models for information estimation. This approach reduces runtime and memory usage by up to 6x while enabling more expressive density models that guide optimization toward superior designs. We validate our method on diffractive optics, lensless imaging, and snapshot 3D microscopy applications, establishing information-theoretic optimization as a practical, scalable strategy for real-world imaging system design.
eess.SP [Back]
[109] mmFlux: Crowd Flow Analytics with Commodity mmWave MIMO Radar
Anurag Pallaprolu,Winston Hurst,Yasamin Mostofi
Main category: eess.SP
TL;DR: 论文提出了一种利用毫米波雷达分析人群流动模式和语义的新框架mmFlux,结合光学流估计和噪声过滤技术,生成高保真流场,并通过几何图和雅可比分析提取关键语义。实验验证了其有效性。
Details
Motivation: 传统人群分析方法(如摄像头)受限于隐私和遮挡问题,毫米波雷达提供了一种非侵入式解决方案,但缺乏对复杂流动模式和语义的精确捕捉。Contribution: 1. 提出毫米波雷达信号处理流程,结合光学流和噪声过滤技术;2. 引入几何图表示主导流动模式;3. 通过雅可比分析提取关键人群语义。
Method: 结合光学流估计和统计/形态学噪声过滤生成流场,转换为几何图,利用雅可比矩阵的旋度和散度分析语义。
Result: 实验表明,框架能高保真重建复杂人群流动结构,并准确推断转向、边界和聚集等语义。
Insight: 毫米波雷达结合信号处理和几何分析,为人群分析提供了隐私友好且鲁棒的解决方案。
Abstract: In this paper, we present a novel framework for extracting underlying crowd motion patterns and inferring crowd semantics using mmWave radar. First, our proposed signal processing pipeline combines optical flow estimation concepts from vision with novel statistical and morphological noise filtering to generate high-fidelity mmWave flow fields - compact 2D vector representations of crowd motion. We then introduce a novel approach that transforms these fields into directed geometric graphs, where edges capture dominant flow currents, vertices mark crowd splitting or merging, and flow distribution is quantified across edges. Finally, we show that by analyzing the local Jacobian and computing the corresponding curl and divergence, we can extract key crowd semantics for both structured and diffused crowds. We conduct 21 experiments on crowds of up to (and including) 20 people across 3 areas, using commodity mmWave radar. Our framework achieves high-fidelity graph reconstruction of the underlying flow structure, even for complex crowd patterns, demonstrating strong spatial alignment and precise quantitative characterization of flow split ratios. Finally, our curl and divergence analysis accurately infers key crowd semantics, e.g., abrupt turns, boundaries where flow directions shift, dispersions, and gatherings. Overall, these findings validate our framework, underscoring its potential for various crowd analytics applications.
cs.RO [Back]
[110] LangNavBench: Evaluation of Natural Language Understanding in Semantic Navigation
Sonia Raychaudhuri,Enrico Cancelli,Tommaso Campari,Lamberto Ballan,Manolis Savva,Angel X. Chang
Main category: cs.RO
TL;DR: 论文提出了LangNavBench,一个专注于自然语言理解的语义导航基准测试,并引入了Multi-Layered Feature Map (MLFM)方法,在细粒度语言指令上表现优越。
Details
Motivation: 现有的大规模视觉语言模型虽在语义导航中有进展,但缺乏专门测试语言理解的基准数据集和方法。LangNavBench填补了这一空白。Contribution: 1.提出了LangNav数据集和LangNavBench基准测试,专注于语言细粒度理解。2.提出了MLFM方法,在多层次语义映射和小对象/空间关系任务中表现优异。
Method: MLFM通过构建可查询的多层次语义地图,有效处理细粒度语言指令(如属性、空间关系)。
Result: MLFM在LangNav数据集上超越了现有的基于地图的导航基线方法。
Insight: 语言理解在语义导航中至关重要,尤其是在处理细粒度指令时。多层次的语义表示能显著提升性能。
Abstract: Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Despite these advances, we still lack a clear, language-focused benchmark for testing how well such agents ground the words in their instructions. We address this gap with LangNav, an open-set dataset specifically created to test an agent’s ability to locate objects described at different levels of detail, from broad category names to fine attributes and object-object relations. Every description in LangNav was manually checked, yielding a lower error rate than existing lifelong- and semantic-navigation datasets. On top of LangNav we build LangNavBench, a benchmark that measures how well current semantic-navigation methods understand and act on these descriptions while moving toward their targets. LangNavBench allows us to systematically compare models on their handling of attributes, spatial and relational cues, and category hierarchies, offering the first thorough, language-centric evaluation of embodied navigation systems. We also present Multi-Layered Feature Map (MLFM), a method that builds a queryable multi-layered semantic map, particularly effective when dealing with small objects or instructions involving spatial relations. MLFM outperforms state-of-the-art mapping-based navigation baselines on the LangNav dataset.
cs.CR [Back]
[111] May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks
Nishit V. Pandya,Andrey Labunets,Sicun Gao,Earlence Fernandes
Main category: cs.CR
TL;DR: 该论文针对基于微调的prompt注入防御方法,提出了一种新型的注意力机制攻击算法,成功攻破了两种最新防御方法,攻击成功率高达70%。
Details
Motivation: 现有基于微调的prompt注入防御方法声称可以分离指令和数据以防止LLM执行恶意指令,但其实际安全性尚未经过充分验证。论文旨在评估这类防御方法的鲁棒性。Contribution: 1. 提出了一种新颖的注意力机制攻击算法;2. 成功攻破了两种最新的防御方法(SecAlign和StruQ);3. 揭示了现有防御方法在白盒设定下的局限性。
Method: 通过优化方法构造攻击,利用注意力机制设计攻击算法,针对防御模型的弱点进行针对性攻击。
Result: 攻击成功率达到70%,攻击成本(token数量)仅小幅增加。
Insight: 基于微调的防御方法在白盒设定下容易被攻破,需要对更稳健的防御机制展开研究。
Abstract: A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data, so that the LLM does not follow instructions that might be present with data. There are several academic systems and production-level implementations of this idea. We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks and showing that the defenses do not provide the claimed security properties. Specifically, we construct a novel attention-based attack algorithm for text-based LLMs and apply it to two recent whitebox defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks with success rates of up to 70% with modest increase in attacker budget in terms of tokens. Our findings make fundamental progress towards understanding the robustness of prompt injection defenses in the whitebox setting. We release our code and attacks at https://github.com/nishitvp/better_opts_attacks
[112] Rainbow Artifacts from Electromagnetic Signal Injection Attacks on Image Sensors
Youqian Zhang,Xinyu Ji,Zhihao Wang,Qinhong Jiang
Main category: cs.CR
TL;DR: 该论文研究了一种针对图像传感器的电磁信号注入攻击,揭示了CMOS图像传感器在电磁干扰下会产生彩虹状伪影,并分析了这些攻击对目标检测模型的负面影响。
Details
Motivation: 图像传感器广泛应用于安全关键系统(如监控、自动驾驶等),其数据完整性对系统决策至关重要。然而,电磁信号注入攻击能够绕过传统数字完整性检查,直接干扰模拟域数据,引发潜在安全问题。Contribution: 1. 发现了一种新型电磁信号注入攻击现象,即CMOS图像传感器在特定干扰下会产生彩虹伪影。2. 证明了这些伪影会通过图像信号处理管道并影响目标检测模型的性能。3. 揭示了视觉感知系统中物理层攻击的未被充分探索的脆弱性。
Method: 通过精心调制的电磁干扰信号对CMOS图像传感器进行攻击实验,记录并分析了彩虹伪影的产生及其对图像数据的影响。进一步评估了这些干扰对目标检测模型的误导效果。
Result: 实验表明,电磁干扰导致的彩虹伪影能够显著降低目标检测模型的准确性,导致错误的预测结果。
Insight: 视觉感知系统的物理层(模拟域)攻击是一个重要的安全漏洞,传统数字防护措施无法完全覆盖,需要开发新的防御机制以应对此类威胁。
Abstract: Image sensors are integral to a wide range of safety- and security-critical systems, including surveillance infrastructure, autonomous vehicles, and industrial automation. These systems rely on the integrity of visual data to make decisions. In this work, we investigate a novel class of electromagnetic signal injection attacks that target the analog domain of image sensors, allowing adversaries to manipulate raw visual inputs without triggering conventional digital integrity checks. We uncover a previously undocumented attack phenomenon on CMOS image sensors: rainbow-like color artifacts induced in images captured by image sensors through carefully tuned electromagnetic interference. We further evaluate the impact of these attacks on state-of-the-art object detection models, showing that the injected artifacts propagate through the image signal processing pipeline and lead to significant mispredictions. Our findings highlight a critical and underexplored vulnerability in the visual perception stack, highlighting the need for more robust defenses against physical-layer attacks in such systems.