Table of Contents

cs.CV [Back]

[1] Multi-level Mixture of Experts for Multimodal Entity Linking cs.CV | cs.AI | cs.CL | cs.LG | cs.MMPDF

Zhiwei Hu, Víctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, Jeff Z. Pan

TL;DR: 该论文提出了一种多级专家混合模型(MMoE)来解决多模态实体链接中的提及歧义和模态内容动态选择问题,通过结合大型语言模型和多模态特征编码器,实现了显著优于现有方法的表现。

Details

Motivation: 多模态实体链接(MEL)面临提及歧义和模态内容动态选择的挑战,现有方法未能有效解决这些问题。

Result: 实验表明MMoE在性能上显著优于现有方法。

Insight: 结合大型语言模型和多模态特征编码器可以更好地解决MEL中的提及歧义和模态选择问题。

Abstract: Multimodal Entity Linking (MEL) aims to link ambiguous mentions within multimodal contexts to associated entities in a multimodal knowledge base. Existing approaches to MEL introduce multimodal interaction and fusion mechanisms to bridge the modality gap and enable multi-grained semantic matching. However, they do not address two important problems: (i) mention ambiguity, i.e., the lack of semantic content caused by the brevity and omission of key information in the mention’s textual context; (ii) dynamic selection of modal content, i.e., to dynamically distinguish the importance of different parts of modal information. To mitigate these issues, we propose a Multi-level Mixture of Experts (MMoE) model for MEL. MMoE has four components: (i) the description-aware mention enhancement module leverages large language models to identify the WikiData descriptions that best match a mention, considering the mention’s textual context; (ii) the multimodal feature extraction module adopts multimodal feature encoders to obtain textual and visual embeddings for both mentions and entities; (iii)-(iv) the intra-level mixture of experts and inter-level mixture of experts modules apply a switch mixture of experts mechanism to dynamically and adaptively select features from relevant regions of information. Extensive experiments demonstrate the outstanding performance of MMoE compared to the state-of-the-art. MMoE’s code is available at: https://github.com/zhiweihu1103/MEL-MMoE.


[2] CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings cs.CV | eess.IVPDF

Cristina Mata, Kanchana Ranasinghe, Michael S. Ryoo

TL;DR: 论文提出了一种名为CoPT的无监督域自适应方法,通过利用域不变的文本嵌入来学习图像分割编码器中的域不变特征,从而在四个基准测试中取得了最新的性能表现。

Details

Motivation: 在语义分割任务中,无监督域自适应(UDA)方法通常依赖于标注数据,但标注成本高。尽管视觉-语言表示学习取得了进展,但现有方法尚未充分利用文本的域无关特性。

Result: 在四个基准测试中,CoPT取得了当前最优的无监督域自适应语义分割性能。

Insight: 通过利用文本嵌入的域无关特性,可以有效减少域间差异,提升分割模型的泛化能力。

Abstract: Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large-scale vision-language representation learning, UDA methods for segmentation have not taken advantage of the domain-agnostic properties of text. To address this, we present a novel Covariance-based Pixel-Text loss, CoPT, that uses domain-agnostic text embeddings to learn domain-invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are fed to a frozen CLIP model and combined. In experiments on four benchmarks we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation. The code can be found at https://github.com/cfmata/CoPT.


[3] Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning cs.CV | cs.CR | cs.LGPDF

Renyang Liu, Guanlin Li, Tianwei Zhang, See-Kiong Ng

TL;DR: 这篇论文提出了一种新型多模态引导攻击方法Recall,针对图像生成模型的去学习(unlearning)机制,揭露了当前去学习技术在多模态对抗输入下的脆弱性。

Details

Motivation: 随着图像生成模型(如稳定扩散)能力的提升,其生成的潜在有害或侵权内容引发伦理和法律问题。去学习技术虽然试图解决这一问题,但其鲁棒性尚未充分研究,尤其是在多模态对抗输入下。

Result: 在十种先进去学习方法上的实验表明,Recall在对抗效果、计算效率和语义保真度上均优于基线方法。

Insight: 结果暴露了当前去学习机制的严重漏洞,强调了开发更鲁棒解决方案的必要性,以确保生成模型的安全性和可靠性。

Abstract: Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs. To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolor{blue}{https://github.com/ryliu68/RECALL}.


[4] Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey cs.CV | cs.LGPDF

Getamesay Haile Dagnaw, Yanming Zhu, Muhammad Hassan Maqsood, Wencheng Yang, Xingshuai Dong

TL;DR: 该论文是一篇关于生物医学图像分析中可解释人工智能(XAI)的全面综述,强调了XAI的重要性,并提出了一种基于模态的分类法,特别关注了多模态和视觉语言模型的应用,并总结了当前的评估指标和挑战。

Details

Motivation: 生物医学图像分析中的深度学习模型需要更高的透明度和可信度,以促进临床采用。现有的XAI综述缺乏对模态特定需求和最新多模态技术的关注。

Result: 论文提供了全面的XAI方法综述,突出了不同模态的独特挑战,并展示了多模态技术在提升可解释性方面的潜力。

Insight: 多模态和视觉语言模型为生物医学图像分析的可解释性开辟了新方向,但仍需进一步研究以克服评估标准和实用性问题。

Abstract: Explainable artificial intelligence (XAI) has become increasingly important in biomedical image analysis to promote transparency, trust, and clinical adoption of DL models. While several surveys have reviewed XAI techniques, they often lack a modality-aware perspective, overlook recent advances in multimodal and vision-language paradigms, and provide limited practical guidance. This survey addresses this gap through a comprehensive and structured synthesis of XAI methods tailored to biomedical image analysis.We systematically categorize XAI methods, analyzing their underlying principles, strengths, and limitations within biomedical contexts. A modality-centered taxonomy is proposed to align XAI methods with specific imaging types, highlighting the distinct interpretability challenges across modalities. We further examine the emerging role of multimodal learning and vision-language models in explainable biomedical AI, a topic largely underexplored in previous work. Our contributions also include a summary of widely used evaluation metrics and open-source frameworks, along with a critical discussion of persistent challenges and future directions. This survey offers a timely and in-depth foundation for advancing interpretable DL in biomedical image analysis.


[5] Robust Multimodal Large Language Models Against Modality Conflict cs.CV | cs.AI | cs.CLPDF

Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li

TL;DR: 本文研究了多模态大语言模型(MLLMs)中由于模态冲突导致的幻觉现象,提出了一个名为MMMC的数据集来模拟这种现象,并通过提示工程、监督微调和强化学习三种方法来缓解问题。实验表明,强化学习方法在解决模态冲突引起的幻觉方面表现最佳。

Details

Motivation: 多模态大语言模型在视觉语言任务中表现出色,但在真实场景中容易产生幻觉。现有研究多关注模型响应与输入之间的冲突,而本文则探究了不同模态输入之间的固有冲突及其对模型的影响。

Result: 强化学习方法在缓解模态冲突导致的幻觉上表现最佳,监督微调方法则表现出稳定且有潜力的性能。

Insight: 模态冲突是导致MLLMs幻觉的一个被忽视的因素,强化学习在解决这一问题上具有显著优势。

Abstract: Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.


[6] Aerial Maritime Vessel Detection and Identification cs.CV | cs.AI | cs.ROPDF

Antonella Barisic Kulas, Frano Petric, Stjepan Bogdan

TL;DR: 论文提出了一种在GNSS不可用环境下的无人机自主海上船只检测与识别方法,结合YOLOv8目标检测、特征匹配和色调直方图距离分析,实现了目标船只的定位,并在MBZIRC2023竞赛中验证了其有效性。

Details

Motivation: 在GNSS不可用的环境中,无人机需要依赖机载视觉技术完成大规模搜索任务,而现有方法在计算资源受限和视觉线索有限的情况下效果不佳。

Result: 在MBZIRC2023竞赛中成功应用,验证了方法的有效性,并分析了视角对检测精度和定位准确性的影响。

Insight: 视觉特征结合几何定位可以在GNSS不可用环境下实现高效的目标识别,但视角变化可能影响检测精度。

Abstract: Autonomous maritime surveillance and target vessel identification in environments where Global Navigation Satellite Systems (GNSS) are not available is critical for a number of applications such as search and rescue and threat detection. When the target vessel is only described by visual cues and its last known position is not available, unmanned aerial vehicles (UAVs) must rely solely on on-board vision to scan a large search area under strict computational constraints. To address this challenge, we leverage the YOLOv8 object detection model to detect all vessels in the field of view. We then apply feature matching and hue histogram distance analysis to determine whether any detected vessel corresponds to the target. When found, we localize the target using simple geometric principles. We demonstrate the proposed method in real-world experiments during the MBZIRC2023 competition, integrated into a fully autonomous system with GNSS-denied navigation. We also evaluate the impact of perspective on detection accuracy and localization precision and compare it with the oracle approach.


[7] A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality cs.CVPDF

Mohamed Elmoghany, Ryan Rossi, Seunghyun Yoon, Subhojyoti Mukherjee, Eslam Bakr

TL;DR: 该论文综述了长视频叙事生成的现状,重点关注架构设计、角色与场景一致性以及电影质量,总结了32篇相关论文的关键组件和训练策略,并提出了一种新的分类方法。

Details

Motivation: 解决现有视频生成模型在长视频(超过16秒)中角色一致性、场景布局和运动连贯性不足的问题,以及多角色长视频生成中存在的帧冗余和时间多样性低的问题。

Result: 揭示了长视频生成中的关键挑战,并提出了未来研究方向,尤其是如何在多角色和复杂叙事中保持一致性。

Insight: 长视频生成的核心挑战在于时间一致性和叙事连贯性,未来需要结合更强大的生成模型和更精细的控制策略。

Abstract: Despite the significant progress that has been made in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled “long-form videos”. Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We comprehensively studied 32 papers on video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a comprehensive novel taxonomy of existing methods and present comparative tables that categorize papers by their architectural designs and performance characteristics.


[8] Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement cs.CVPDF

Priyank Pathak, Yogesh S. Rawat

TL;DR: 论文提出了一种基于颜色分离的轻量级方法CSCI,用于解决服装变化ReID问题,无需额外标注或模型,通过颜色信息有效分离外观偏差和身份特征。

Details

Motivation: 现有服装变化ReID方法依赖额外模型或标注,计算成本高。作者探索颜色作为轻量级代理,直接从图像中分离与服装相关的偏差。

Result: 在四个CC-ReID数据集上表现优异,图像ReID任务中LTCC和PRCC分别提升2.9%和5.0%,视频ReID任务中CCVID和MeVID分别提升1.0%和2.5%。

Insight: 颜色是服装变化的轻量级代理,无需额外标注即可有效分离外观偏差,为服装变化ReID提供了一种低成本解决方案。

Abstract: Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing. Existing methods often rely on additional models or annotations to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color - specifically foreground and background colors - as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), an RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias (‘Color See’) while disentangling it from identity-relevant ReID features (‘Color Ignore’). To achieve this, we introduce S2A self-attention, a novel self-attention to prevent information leak between color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve the baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as a cost-effective solution for addressing appearance bias in CC-ReID. Github: https://github.com/ppriyank/ICCV-CSCI-Person-ReID.


[9] Automated Video Segmentation Machine Learning Pipeline cs.CVPDF

Johannes Merz, Lucien Fostier

TL;DR: 这篇论文提出了一种自动视频分割的机器学习流程,用于生成时间一致的实例掩码,显著提升了视觉效果(VFX)生产中的效率。

Details

Motivation: 传统视觉效果生产中,手动生成掩码过程缓慢且资源密集,急需一种自动化解决方案来提升效率。

Result: 该流水线显著减少了人工干预,加速了初步合成制作,并提供了全面的分割数据,提升了整体VFX生产效率。

Insight: 自动化流水线结合机器学习方法可以有效解决视觉效果生产中的掩码生成问题,同时容器化技术加速了实际应用落地。

Abstract: Visual effects (VFX) production often struggles with slow, resource-intensive mask generation. This paper presents an automated video segmentation pipeline that creates temporally consistent instance masks. It employs machine learning for: (1) flexible object detection via text prompts, (2) refined per-frame image segmentation and (3) robust video tracking to ensure temporal stability. Deployed using containerization and leveraging a structured output format, the pipeline was quickly adopted by our artists. It significantly reduces manual effort, speeds up the creation of preliminary composites, and provides comprehensive segmentation data, thereby enhancing overall VFX production efficiency.


[10] DisenQ: Disentangling Q-Former for Activity-Biometrics cs.CVPDF

Shehreen Azad, Yogesh S Rawat

TL;DR: 论文提出了DisenQ框架,通过多模态语言引导解决活动生物识别中的特征纠缠问题,实现了身份、动作和非生物特征的解耦,并在多个基准测试中达到最优性能。

Details

Motivation: 传统的人体识别在多样化的活动中面临挑战,因为身份特征与动作动态和外观变化纠缠在一起。现有方法依赖额外的视觉数据(如姿态或轮廓),但其提取不准确限制了性能。因此,论文旨在通过结构化文本监督替代视觉数据,解决特征纠缠问题。

Result: 在三个基于活动的视频基准测试中取得了最优性能,并在传统视频识别基准上表现出强泛化能力。

Insight: 通过语言引导的解耦方法可以有效替代传统的视觉数据依赖,提升生物特征学习的鲁棒性和准确性。

Abstract: In this work, we address activity-biometrics, which involves identifying individuals across diverse set of activities. Unlike traditional person identification, this setting introduces additional challenges as identity cues become entangled with motion dynamics and appearance variations, making biometrics feature learning more complex. While additional visual data like pose and/or silhouette help, they often struggle from extraction inaccuracies. To overcome this, we propose a multimodal language-guided framework that replaces reliance on additional visual data with structured textual supervision. At its core, we introduce \textbf{DisenQ} (\textbf{Disen}tangling \textbf{Q}-Former), a unified querying transformer that disentangles biometrics, motion, and non-biometrics features by leveraging structured language guidance. This ensures identity cues remain independent of appearance and motion variations, preventing misidentifications. We evaluate our approach on three activity-based video benchmarks, achieving state-of-the-art performance. Additionally, we demonstrate strong generalization to complex real-world scenario with competitive performance on a traditional video-based identification benchmark, showing the effectiveness of our framework.


[11] LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation cs.CV | cs.AI | cs.CLPDF

Ananya Raval, Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza

TL;DR: 论文提出了LinguaMark,一个评估多模态模型在多语言VQA任务中表现的新基准。结果显示闭源模型整体表现最佳,而开源模型Qwen2.5在多语言泛化能力上表现突出。

Details

Motivation: 当前大型多模态模型(LMMs)在语言覆盖上存在不足,可能导致输出存在偏见和不公平。研究旨在填补多模态评估中对多语言能力关注的空白。

Result: 闭源模型(如GPT-4o和Gemini2.5)整体表现最佳;开源模型(如Qwen2.5)在多语言泛化能力上表现优异。

Insight: 1. 多模态模型在多语言任务中存在局限性;2. 开源模型在泛化能力上具有潜力;3. 需要更多研究关注模型的公平性和语言覆盖问题。

Abstract: Large Multimodal Models (LMMs) are typically trained on vast corpora of image-text data but are often limited in linguistic coverage, leading to biased and unfair outputs across languages. While prior work has explored multimodal evaluation, less emphasis has been placed on assessing multilingual capabilities. In this work, we introduce LinguaMark, a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task. Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes. We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness. Our findings reveal that closed-source models generally achieve the highest overall performance. Both closed-source (GPT-4o and Gemini2.5) and open-source models (Gemma3, Qwen2.5) perform competitively across social attributes, and Qwen2.5 demonstrates strong generalization across multiple languages. We release our benchmark and evaluation code to encourage reproducibility and further research.


[12] MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning cs.CVPDF

Chengfei Wu, Ronald Seoh, Bingxuan Li, Liqiang Zhang, Fengrong Han

TL;DR: MagiC是一个评估多模态认知的基准测试,专注于验证视觉语言模型是否进行真正基于视觉的推理,而非依赖数据偏见。

Details

Motivation: 当前大型视觉语言模型在视觉问答和多模态推理中表现出色,但其是否真正基于视觉进行推理尚不明确。

Result: 对15个7B至70B参数的视觉语言模型进行测试,揭示了当前方法在基于视觉推理中的局限性。

Insight: MagiC揭示了视觉语言模型在细节推理和视觉对齐方面的不足,为未来改进提供了方向。

Abstract: Recent advances in large vision-language models have led to impressive performance in visual question answering and multimodal reasoning. However, it remains unclear whether these models genuinely perform grounded visual reasoning or rely on superficial patterns and dataset biases. In this work, we introduce MagiC, a comprehensive benchmark designed to evaluate grounded multimodal cognition, assessing not only answer accuracy but also the quality of step-by-step reasoning and its alignment with relevant visual evidence. Our benchmark includes approximately 5,500 weakly supervised QA examples generated from strong model outputs and 900 human-curated examples with fine-grained annotations, including answers, rationales, and bounding box groundings. We evaluate 15 vision-language models ranging from 7B to 70B parameters across four dimensions: final answer correctness, reasoning validity, grounding fidelity, and self-correction ability. MagiC further includes diagnostic settings to probe model robustness under adversarial visual cues and assess their capacity for introspective error correction. We introduce new metrics such as MagiScore and StepSense, and provide comprehensive analyses that reveal key limitations and opportunities in current approaches to grounded visual reasoning.


[13] ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation cs.CVPDF

Sherry X. Chen, Yi Wei, Luowei Zhou, Suren Kumar

TL;DR: ADIEE提出了一种自动生成数据集的方法,用于训练评分模型,以评估指令引导的图像编辑效果,显著提升了开源和专有模型的性能与透明度。

Details

Motivation: 当前指令引导图像编辑的自动评估存在挑战,开源视觉语言模型(VLM)对齐不足,专有模型缺乏透明度和成本效益,且缺乏公开训练数据集。

Result: 评分模型在多项基准测试中表现优异,与人类评分的相关性提升17.24%,并提升选择准确性。

Insight: ADIEE不仅可作为评分模型,还能作为奖励模型用于自动选择最佳编辑和模型微调,提升现有模型性能。

Abstract: Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model modified to decode a numeric score from a custom token. The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0696 (+17.24%) gain in score correlation with human ratings on AURORA-Bench, and improving pair-wise comparison accuracy by 4.03% (+7.21%) on GenAI-Bench and 4.75% (+9.35%) on AURORA-Bench, respectively, compared to the state-of-the-art. The scorer can act as a reward model, enabling automated best edit selection and model fine-tuning. Notably, the proposed scorer can boost MagicBrush model’s average evaluation score on ImagenHub from 5.90 to 6.43 (+8.98%).


[14] Scalable and Realistic Virtual Try-on Application for Foundation Makeup with Kubelka-Munk Theory cs.CV | I.4.9PDF

Hui Pang, Sunil Hadap, Violetta Shevchenko, Rahul Suresh, Amin Banitalebi-Dehkordi

TL;DR: 提出了一种基于Kubelka-Munk理论的快速图像合成方法,用于实现高效且真实的基础底妆虚拟试妆应用。

Details

Motivation: 增强现实在美妆行业的应用日益广泛,但现有的虚拟试妆技术在底妆与肤色的真实融合上存在挑战,尤其是在多产品规模扩增时需要保持高真实感。

Result: 在真实化妆图像上验证了方法的优越性,显著优于其他技术。

Insight: 通过理论近似与工程化结合,为虚拟试妆提供了高效且真实的解决方案。

Abstract: Augmented reality is revolutionizing beauty industry with virtual try-on (VTO) applications, which empowers users to try a wide variety of products using their phones without the hassle of physically putting on real products. A critical technical challenge in foundation VTO applications is the accurate synthesis of foundation-skin tone color blending while maintaining the scalability of the method across diverse product ranges. In this work, we propose a novel method to approximate well-established Kubelka-Munk (KM) theory for faster image synthesis while preserving foundation-skin tone color blending realism. Additionally, we build a scalable end-to-end framework for realistic foundation makeup VTO solely depending on the product information available on e-commerce sites. We validate our method using real-world makeup images, demonstrating that our framework outperforms other techniques.


[15] Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning cs.CV | I.2; I.4; I.5; I.7PDF

Daniel A. P. Oliveira, David Martins de Matos

TL;DR: 该论文提出了一种基于对比强化学习的方法,通过合成负样本和双组件奖励函数,优化视觉叙事系统中实体重识别的能力,显著提升了实体定位和连贯性。

Details

Motivation: 现有的视觉叙事系统(如大型视觉语言模型)在跨帧识别实体时表现不佳,容易导致不一致的引用和幻觉。这是由于缺乏对跨帧实体连接的显式训练。

Result: 实体定位mAP从0.27提升至0.31(+14.8%),F1从0.35提升至0.41(+17.1%);跨帧实体持续性显著提高,5帧以上的实体识别率从29.3%提升至33.3%(+13.7%);结构化叙事比例从79.1%提升至97.5%(+23.3%)。

Insight: 通过对比强化学习显式训练实体连接行为,可以有效解决视觉叙事中的实体重识别问题,同时提升语义连贯性和叙事质量。

Abstract: Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections across frames. We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images. We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior. We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities in real stories while penalizing incorrect entity connections in synthetic contexts. Using this contrastive framework, we fine-tune Qwen Storyteller (based on Qwen2.5-VL 7B). Evaluation shows improvements in grounding mAP from 0.27 to 0.31 (+14.8%), F1 from 0.35 to 0.41 (+17.1%). Pronoun grounding accuracy improved across all pronoun types except ``its’’, and cross-frame character and object persistence increased across all frame counts, with entities appearing in 5 or more frames advancing from 29.3% to 33.3% (+13.7%). Well-structured stories, containing the chain-of-thought and grounded story, increased from 79.1% to 97.5% (+23.3%).


[16] PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency cs.CVPDF

Haotian Wang, Aoran Xiao, Xiaoqin Zhang, Meng Yang, Shijian Lu

TL;DR: PacGDC提出了一种标签高效的技术,通过利用2D到3D投影中的歧义性和一致性,合成大量伪几何数据,从而减少对大规模标注数据的依赖,提升深度补全的泛化能力。

Details

Motivation: 深度补全模型通常需要大规模标注数据,而标注成本高昂。PacGDC旨在通过利用投影歧义性和一致性,合成多样化的伪几何数据,降低对标注数据的依赖。

Result: 实验表明,PacGDC在多种基准测试中表现优异,尤其在零样本和少样本设置下,能够适应多样的场景语义、尺度和深度稀疏性/模式。

Insight: 通过投影歧义性和一致性合成伪几何数据,是一种标签高效且提升泛化能力的有效方法。

Abstract: Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: https://github.com/Wang-xjtu/PacGDC.


[17] Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos cs.CVPDF

Hao Xu, Arbind Agrahari Baniya, Sam Wells, Mohamed Reda Bouadjenek, Richard Dazeley

TL;DR: 本文提出了一种多尺度注意力门控移位模块(MSAGSM),用于增强视频中细粒度事件定位的性能,并在新的乒乓球数据集(TTA)上验证了其有效性。

Details

Motivation: 现有的事件定位模型在时间感受野和空间适应性上存在局限,无法有效捕捉短长期依赖和显著区域。

Result: 在五个事件定位基准测试中,MSAGSM性能显著提升,且计算开销小。

Insight: 多尺度时间建模和空间注意力机制能有效提升细粒度事件定位的准确性。

Abstract: Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as Gate Shift Module (GSM) or Gate Shift Fuse (GSF) to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal dilations and multi-head spatial attention, enabling efficient modeling of both short- and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight plug-and-play module that can be easily integrated with various 2D backbones. To further advance the field, we introduce the Table Tennis Australia (TTA) dataset-the first PES benchmark for table tennis-containing over 4800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.


[18] KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos cs.CV | cs.AIPDF

Jinseong Kim, Junghoon Song, Gyeongseon Baek, Byeongjoon Noh

TL;DR: KeyRe-ID是一个基于关键点指导的视频行人重识别框架,通过全局和局部分支结合人类关键点,提升时空表征学习能力。

Details

Motivation: 现有的行人重识别方法往往忽视了人体关键点的信息,限制了模型对细粒度特征的学习能力。KeyRe-ID旨在利用关键点信息,增强模型的全局和局部特征提取能力。

Result: 在MARS数据集上达到91.73% mAP和97.32% Rank-1准确率,在iLIDS-VID数据集上达到96.00% Rank-1和100.0% Rank-5准确率。

Insight: 关键点信息的引入显著提升了视频行人重识别模型的性能,尤其是在细粒度特征学习方面。

Abstract: We propose \textbf{KeyRe-ID}, a keypoint-guided video-based person re-identification framework consisting of global and local branches that leverage human keypoints for enhanced spatiotemporal representation learning. The global branch captures holistic identity semantics through Transformer-based temporal aggregation, while the local branch dynamically segments body regions based on keypoints to generate fine-grained, part-aware features. Extensive experiments on MARS and iLIDS-VID benchmarks demonstrate state-of-the-art performance, achieving 91.73% mAP and 97.32% Rank-1 accuracy on MARS, and 96.00% Rank-1 and 100.0% Rank-5 accuracy on iLIDS-VID. The code for this work will be publicly available on GitHub upon publication.


[19] Behave Your Motion: Habit-preserved Cross-category Animal Motion Transfer cs.CV | cs.AIPDF

Zhimin Zhang, Bi’an Du, Caoyuan Ma, Zheng Wang, Wei Hu

TL;DR: 本文提出了一种新型的跨种类动物运动迁移框架,专注于保留物种特有的行为习惯(habit),填补了现有方法在动物运动迁移中的不足,并引入了大型语言模型(LLM)以支持未见过的物种。

Details

Motivation: 现有运动迁移方法主要针对人类运动,着重于骨骼对齐或风格一致性,而忽略了动物特有的行为习惯。本文旨在解决这一空白。

Result: 实验在DeformingThings4D-skl数据集上进行,定量分析表明模型优于现有方法。

Insight: 保留行为习惯对动物运动迁移至关重要,而引入LLM可以增强模型对新物种的泛化能力。

Abstract: Animal motion embodies species-specific behavioral habits, making the transfer of motion across categories a critical yet complex task for applications in animation and virtual reality. Existing motion transfer methods, primarily focused on human motion, emphasize skeletal alignment (motion retargeting) or stylistic consistency (motion style transfer), often neglecting the preservation of distinct habitual behaviors in animals. To bridge this gap, we propose a novel habit-preserved motion transfer framework for cross-category animal motion. Built upon a generative framework, our model introduces a habit-preservation module with category-specific habit encoder, allowing it to learn motion priors that capture distinctive habitual characteristics. Furthermore, we integrate a large language model (LLM) to facilitate the motion transfer to previously unobserved species. To evaluate the effectiveness of our approach, we introduce the DeformingThings4D-skl dataset, a quadruped dataset with skeletal bindings, and conduct extensive experiments and quantitative analyses, which validate the superiority of our proposed model.


[20] Seg-Wild: Interactive Segmentation based on 3D Gaussian Splatting for Unconstrained Image Collections cs.CVPDF

Yongtang Bao, Chengjie Tang, Yuze Wang, Haojie Li

TL;DR: Seg-Wild提出了一种基于3D高斯泼溅的交互式分割方法,适用于无约束图像集,通过多维特征嵌入和Spiky 3D Gaussian Cutter解决光照不一致和遮挡问题。

Details

Motivation: 无约束图像集较易获取,但其光照不一致和短暂遮挡问题使分割任务具有挑战性,现有方法难以应对这些问题。

Result: 实验表明Seg-Wild在分割和重建质量上优于现有方法。

Insight: 结合3D高斯泼溅和交互式特征嵌入能有效处理无约束图像的分割任务,SGC为异常高斯处理提供了新思路。

Abstract: Reconstructing and segmenting scenes from unconstrained photo collections obtained from the Internet is a novel but challenging task. Unconstrained photo collections are easier to get than well-captured photo collections. These unconstrained images suffer from inconsistent lighting and transient occlusions, which makes segmentation challenging. Previous segmentation methods cannot address transient occlusions or accurately restore the scene’s lighting conditions. Therefore, we propose Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting for unconstrained image collections, suitable for in-the-wild scenes. We integrate multi-dimensional feature embeddings for each 3D Gaussian and calculate the feature similarity between the feature embeddings and the segmentation target to achieve interactive segmentation in the 3D scene. Additionally, we introduce the Spiky 3D Gaussian Cutter (SGC) to smooth abnormal 3D Gaussians. We project the 3D Gaussians onto a 2D plane and calculate the ratio of 3D Gaussians that need to be cut using the SAM mask. We also designed a benchmark to evaluate segmentation quality in in-the-wild scenes. Experimental results demonstrate that compared to previous methods, Seg-Wild achieves better segmentation results and reconstruction quality. Our code will be available at https://github.com/Sugar0725/Seg-Wild.


[21] EscherNet++: Simultaneous Amodal Completion and Scalable View Synthesis through Masked Fine-Tuning and Enhanced Feed-Forward 3D Reconstruction cs.CVPDF

Xinan Zhang, Muhammad Zubair Irshad, Anthony Yezzi, Yi-Chang Tsai, Zsolt Kira

TL;DR: EscherNet++是一种基于masked fine-tuning的扩散模型,能够同时完成零样本的新视角合成和amodal completion,通过改进的单阶段方法和增强的前馈3D重建技术,显著提高了效率和性能。

Details

Motivation: 现有方法通常采用多阶段复杂流程完成amodal completion和新视角合成,缺乏跨视图依赖性且存储和计算冗余。EscherNet++旨在通过端到端的masked fine-tuning解决这些问题。

Result: 在遮挡任务中,PSNR提高3.9,Volume IoU提升0.28(10输入设置),并在真实世界遮挡重建中表现良好。重建时间减少95%。

Insight: 1. Masked fine-tuning能有效结合amodal completion和新视角合成任务。2. 单阶段方法显著减少计算冗余。3. 模型的可扩展性支持快速3D重建。

Abstract: We propose EscherNet++, a masked fine-tuned diffusion model that can synthesize novel views of objects in a zero-shot manner with amodal completion ability. Existing approaches utilize multiple stages and complex pipelines to first hallucinate missing parts of the image and then perform novel view synthesis, which fail to consider cross-view dependencies and require redundant storage and computing for separate stages. Instead, we apply masked fine-tuning including input-level and feature-level masking to enable an end-to-end model with the improved ability to synthesize novel views and conduct amodal completion. In addition, we empirically integrate our model with other feed-forward image-to-mesh models without extra training and achieve competitive results with reconstruction time decreased by 95%, thanks to its ability to synthesize arbitrary query views. Our method’s scalable nature further enhances fast 3D reconstruction. Despite fine-tuning on a smaller dataset and batch size, our method achieves state-of-the-art results, improving PSNR by 3.9 and Volume IoU by 0.28 on occluded tasks in 10-input settings, while also generalizing to real-world occluded reconstruction.


[22] EPIC: Efficient Prompt Interaction for Text-Image Classification cs.CVPDF

Xinyao Yu, Hao Sun, Zeyu Ling, Ziwei Niu, Zhenjia Bai

TL;DR: 论文提出了一种名为EPIC的高效提示交互方法,用于文本-图像分类任务,通过中间层的时间提示和基于相似性的模态交互,显著降低了计算成本和可训练参数。

Details

Motivation: 大规模预训练多模态模型(LMMs)在微调时计算成本高昂,因此研究高效的提示交互策略以对齐模态成为必要。

Result: 在UPMC-Food101和SNLI-VE数据集上表现优异,在MM-IMDB数据集上达到可比性能,同时节省了计算资源。

Insight: 提示交互策略可以在减少计算成本的同时,保持甚至提升模型性能,为多模态任务提供了一种高效解决方案。

Abstract: In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.


[23] Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning cs.CVPDF

Jingjing Jiang, Chao Ma, Xurui Song, Hanwang Zhang, Jun Luo

TL;DR: 本文提出了Corvid,一种增强的多模态大型语言模型(MLLM),通过改进的思维链(CoT)推理能力,在复杂任务(如数学推理和科学问题解决)中表现出色。

Details

Motivation: 现有开源MLLM在复杂结构化推理任务中存在显著不足,Corvid旨在填补这一空白,提升多模态推理能力。

Result: Corvid在数学推理和科学问题解决中优于同类模型。

Insight: 高质量数据集和针对性训练策略是提升MLLM推理能力的关键。

Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid’s CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage CoT-formatted training approach to progressively enhance its step-by-step reasoning abilities. Furthermore, we propose an effective inference-time scaling strategy that enables Corvid to mitigate over-reasoning and under-reasoning through self-verification. Extensive experiments demonstrate that Corvid outperforms existing o1-like MLLMs and state-of-the-art MLLMs with similar parameter scales, with notable strengths in mathematical reasoning and science problem-solving. Project page: https://mm-vl.github.io/corvid.


[24] Towards High-Resolution 3D Anomaly Detection: A Scalable Dataset and Real-Time Framework for Subtle Industrial Defects cs.CVPDF

Yuqi Cheng, Yihan Sun, Hui Zhang, Weiming Shen, Yunkang Cao

TL;DR: 本文提出了一种高分辨率3D异常检测方法,包括MiniShift数据集(首个高分辨率3D异常检测数据集)和Simple3D框架(实时高效检测框架),推动了工业缺陷检测的发展。

Details

Motivation: 现有3D异常检测基准数据集多为低分辨率,难以检测工业中的细微缺陷,亟需高分辨率数据和高效方法。

Result: Simple3D在MiniShift和其他基准测试中超越了现有方法,达到20 fps的实时检测速度。

Insight: 高分辨率数据和有效的特征聚合对提升3D异常检测的实用性和准确性至关重要。

Abstract: In industrial point cloud analysis, detecting subtle anomalies demands high-resolution spatial data, yet prevailing benchmarks emphasize low-resolution inputs. To address this disparity, we propose a scalable pipeline for generating realistic and subtle 3D anomalies. Employing this pipeline, we developed MiniShift, the inaugural high-resolution 3D anomaly detection dataset, encompassing 2,577 point clouds, each with 500,000 points and anomalies occupying less than 1% of the total. We further introduce Simple3D, an efficient framework integrating Multi-scale Neighborhood Descriptors (MSND) and Local Feature Spatial Aggregation (LFSA) to capture intricate geometric details with minimal computational overhead, achieving real-time inference exceeding 20 fps. Extensive evaluations on MiniShift and established benchmarks demonstrate that Simple3D surpasses state-of-the-art methods in both accuracy and speed, highlighting the pivotal role of high-resolution data and effective feature aggregation in advancing practical 3D anomaly detection.


[25] Dual Semantic-Aware Network for Noise Suppressed Ultrasound Video Segmentation cs.CVPDF

Ling Zhou, Runtian Yuan, Yi Liu, Yuejie Zhang, Rui Feng

TL;DR: DSANet通过引入相邻帧语义感知模块(AFSA)和局部-全局语义感知模块(LGSA),显著提升了超声视频分割中对噪声的鲁棒性,并在多个基准数据集上表现优于现有方法。

Details

Motivation: 超声图像因其固有的噪声问题,在自动化分割任务中面临挑战,尤其是在视频序列中。本文旨在通过增强局部与全局特征的语义关联,提升模型对噪声的抵抗能力。

Result: 在四个基准数据集上,DSANet在分割精度和推理速度(FPS)上均优于现有方法,甚至超越了一些基于图像的模型。

Insight: 通过避免像素级依赖和强化语义关联,模型在噪声环境下表现更稳定,同时在计算效率上也更具优势。

Abstract: Ultrasound imaging is a prevalent diagnostic tool known for its simplicity and non-invasiveness. However, its inherent characteristics often introduce substantial noise, posing considerable challenges for automated lesion or organ segmentation in ultrasound video sequences. To address these limitations, we propose the Dual Semantic-Aware Network (DSANet), a novel framework designed to enhance noise robustness in ultrasound video segmentation by fostering mutual semantic awareness between local and global features. Specifically, we introduce an Adjacent-Frame Semantic-Aware (AFSA) module, which constructs a channel-wise similarity matrix to guide feature fusion across adjacent frames, effectively mitigating the impact of random noise without relying on pixel-level relationships. Additionally, we propose a Local-and-Global Semantic-Aware (LGSA) module that reorganizes and fuses temporal unconditional local features, which capture spatial details independently at each frame, with conditional global features that incorporate temporal context from adjacent frames. This integration facilitates multi-level semantic representation, significantly improving the model’s resilience to noise interference. Extensive evaluations on four benchmark datasets demonstrate that DSANet substantially outperforms state-of-the-art methods in segmentation accuracy. Moreover, since our model avoids pixel-level feature dependencies, it achieves significantly higher inference FPS than video-based methods, and even surpasses some image-based models. Code can be found in \href{https://github.com/ZhouL2001/DSANet}{DSANet}


[26] Bluish Veil Detection and Lesion Classification using Custom Deep Learnable Layers with Explainable Artificial Intelligence (XAI) cs.CV | cs.AIPDF

M. A. Rasel, Sameem Abdul Kareem, Zhenli Kwan, Shin Shen Yong, Unaizah Obaidellah

TL;DR: 该论文提出了一种基于自定义深度可学习层和可解释人工智能(XAI)的蓝白色覆盖物(BWV)检测与病变分类方法,通过改进的成像算法和处理多个数据集,显著提升了BWV检测的准确性。

Details

Motivation: 黑色素瘤是致命的皮肤癌症之一,而BWV是其诊断的关键特征。目前针对BWV检测的研究有限,因此需要一种更高效准确的方法来辅助早期诊断。

Result: 模型在多个数据集上的测试准确率如下:PH2(85.71%)、ISIC(95.00%)、PH2+ISIC(95.05%)、Derm7pt(90.00%),均优于现有方法。

Insight: 通过自定义层和XAI的结合,不仅提升了BWV检测的准确性,还为模型的可信度和临床适用性奠定了基础。

Abstract: Melanoma, one of the deadliest types of skin cancer, accounts for thousands of fatalities globally. The bluish, blue-whitish, or blue-white veil (BWV) is a critical feature for diagnosing melanoma, yet research into detecting BWV in dermatological images is limited. This study utilizes a non-annotated skin lesion dataset, which is converted into an annotated dataset using a proposed imaging algorithm based on color threshold techniques on lesion patches and color palettes. A Deep Convolutional Neural Network (DCNN) is designed and trained separately on three individual and combined dermoscopic datasets, using custom layers instead of standard activation function layers. The model is developed to categorize skin lesions based on the presence of BWV. The proposed DCNN demonstrates superior performance compared to conventional BWV detection models across different datasets. The model achieves a testing accuracy of 85.71% on the augmented PH2 dataset, 95.00% on the augmented ISIC archive dataset, 95.05% on the combined augmented (PH2+ISIC archive) dataset, and 90.00% on the Derm7pt dataset. An explainable artificial intelligence (XAI) algorithm is subsequently applied to interpret the DCNN’s decision-making process regarding BWV detection. The proposed approach, coupled with XAI, significantly improves the detection of BWV in skin lesions, outperforming existing models and providing a robust tool for early melanoma diagnosis.


[27] Objectomaly: Objectness-Aware Refinement for OoD Segmentation with Structural Consistency and Boundary Precision cs.CV | cs.AIPDF

Jeonghoon Song, Sunghun Kim, Jaegyun Im, Byeongjoon Noh

TL;DR: Objectomaly提出了一种基于对象感知的细化框架,通过结合对象级先验知识改进了OoD分割的边界精度和结构一致性,在多个基准测试中达到了SOTA性能。

Details

Motivation: 现有基于掩码的OoD分割方法存在边界不精确、对象内异常分数不一致及背景噪声导致的误报问题。

Result: 在SMIYC和RoadAnomaly等基准测试中,AuPRC高达96.99,FPR95降至0.07,F1分数达83.44,表现优于现有方法。

Insight: 结合对象级先验和图像处理技术能显著提升OoD分割的边界精度和结构一致性,适用于自动驾驶等安全敏感场景。

Abstract: Out-of-Distribution (OoD) segmentation is critical for safety-sensitive applications like autonomous driving. However, existing mask-based methods often suffer from boundary imprecision, inconsistent anomaly scores within objects, and false positives from background noise. We propose \textbf{\textit{Objectomaly}}, an objectness-aware refinement framework that incorporates object-level priors. Objectomaly consists of three stages: (1) Coarse Anomaly Scoring (CAS) using an existing OoD backbone, (2) Objectness-Aware Score Calibration (OASC) leveraging SAM-generated instance masks for object-level score normalization, and (3) Meticulous Boundary Precision (MBP) applying Laplacian filtering and Gaussian smoothing for contour refinement. Objectomaly achieves state-of-the-art performance on key OoD segmentation benchmarks, including SMIYC AnomalyTrack/ObstacleTrack and RoadAnomaly, improving both pixel-level (AuPRC up to 96.99, FPR$_{95}$ down to 0.07) and component-level (F1$-$score up to 83.44) metrics. Ablation studies and qualitative results on real-world driving videos further validate the robustness and generalizability of our method. Code will be released upon publication.


[28] Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking cs.CV | cs.CRPDF

Qiangqiang Wu, Yi Yu, Chenqi Kong, Ziquan Liu, Jia Wan

TL;DR: 这篇论文提出了Temporal Unlearnable Examples(TUEs)方法,通过生成不可学习噪声来保护个人视频数据免于被未经授权的视觉目标跟踪模型利用。

Details

Motivation: 随着社交媒体崛起,用户上传的大量私人视频被未经授权收集并用于视觉目标跟踪模型的训练,暴露了数据隐私问题。现有方法主要针对图像任务,直接应用于视频效果不佳。

Result: 实验表明,TUEs在视频数据隐私保护上达到了SOTA性能,并具有强迁移性。

Insight: TUEs通过破坏时间匹配任务的学习,有效保护了视频数据隐私,为视频数据安全提供了新思路。

Abstract: With the rise of social media, vast amounts of user-uploaded videos (e.g., YouTube) are utilized as training data for Visual Object Tracking (VOT). However, the VOT community has largely overlooked video data-privacy issues, as many private videos have been collected and used for training commercial models without authorization. To alleviate these issues, this paper presents the first investigation on preventing personal video data from unauthorized exploitation by deep trackers. Existing methods for preventing unauthorized data use primarily focus on image-based tasks (e.g., image classification), directly applying them to videos reveals several limitations, including inefficiency, limited effectiveness, and poor generalizability. To address these issues, we propose a novel generative framework for generating Temporal Unlearnable Examples (TUEs), and whose efficient computation makes it scalable for usage on large-scale video datasets. The trackers trained w/ TUEs heavily rely on unlearnable noises for temporal matching, ignoring the original data structure and thus ensuring training video data-privacy. To enhance the effectiveness of TUEs, we introduce a temporal contrastive loss, which further corrupts the learning of existing trackers when using our TUEs for training. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in video data-privacy protection, with strong transferability across VOT models, datasets, and temporal matching tasks.


[29] Driving by Hybrid Navigation: An Online HD-SD Map Association Framework and Benchmark for Autonomous Vehicles cs.CVPDF

Jiaxu Wan, Xu Wang, Mengwei Xie, Xinyuan Chang, Xinran Liu

TL;DR: 该论文提出了一个用于自动驾驶车辆的混合导航框架和基准测试,重点关注全局SD地图与在线HD地图的关联问题。

Details

Motivation: 自动驾驶车辆需要结合全局SD地图和在线HD地图进行混合导航,但现有研究往往忽视了两者的关联问题,导致在线HD地图在现实世界中的应用受限。

Result: OMA基准包含48万条道路和26万条车道路径,并提供了评估模型性能的指标。框架代码和数据集已公开。

Insight: 结合全局和局部地图的关联能力对自动驾驶车辆的导航规划至关重要,且注意力机制能有效提升关联性能。

Abstract: Autonomous vehicles rely on global standard-definition (SD) maps for road-level route planning and online local high-definition (HD) maps for lane-level navigation. However, recent work concentrates on construct online HD maps, often overlooking the association of global SD maps with online HD maps for hybrid navigation, making challenges in utilizing online HD maps in the real world. Observing the lack of the capability of autonomous vehicles in navigation, we introduce \textbf{O}nline \textbf{M}ap \textbf{A}ssociation, the first benchmark for the association of hybrid navigation-oriented online maps, which enhances the planning capabilities of autonomous vehicles. Based on existing datasets, the OMA contains 480k of roads and 260k of lane paths and provides the corresponding metrics to evaluate the performance of the model. Additionally, we propose a novel framework, named Map Association Transformer, as the baseline method, using path-aware attention and spatial attention mechanisms to enable the understanding of geometric and topological correspondences. The code and dataset can be accessed at https://github.com/WallelWan/OMA-MAT.


[30] Divergence Minimization Preference Optimization for Diffusion Model Alignment cs.CV | cs.LGPDF

Binxu Li, Minkai Xu, Meihua Dang, Stefano Ermon

TL;DR: 这篇论文提出了DMPO方法,通过最小化逆向KL散度来优化扩散模型的对齐问题,解决了现有方法陷入次优均值寻求的问题,实验表明其优于现有基线方法。

Details

Motivation: 扩散模型在文本到图像生成方面取得了显著成功,但如何进一步通过人类偏好对齐模型仍然是一个重要问题。现有方法存在次优均值寻求问题,因此需要一种更优化的对齐方法。

Result: 实验结果显示,DMPO在所有评估数据集上的PickScore指标中至少优于现有基线方法64.6%,证明了其在生成行为与期望输出对齐方面的优越性。

Insight: DMPO为扩散模型的偏好对齐提供了一种鲁棒且优雅的路径,将理论原则与实际性能紧密结合。

Abstract: Diffusion models have achieved remarkable success in generating realistic and versatile images from text prompts. Inspired by the recent advancements of language models, there is an increasing interest in further improving the models by aligning with human preferences. However, we investigate alignment from a divergence minimization perspective and reveal that existing preference optimization methods are typically trapped in suboptimal mean-seeking optimization. In this paper, we introduce Divergence Minimization Preference Optimization (DMPO), a novel and principled method for aligning diffusion models by minimizing reverse KL divergence, which asymptotically enjoys the same optimization direction as original RL. We provide rigorous analysis to justify the effectiveness of DMPO and conduct comprehensive experiments to validate its empirical strength across both human evaluations and automatic metrics. Our extensive results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques, specifically outperforming all existing diffusion alignment baselines by at least 64.6% in PickScore across all evaluation datasets, demonstrating the method’s superiority in aligning generative behavior with desired outputs. Overall, DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.


[31] GGMotion: Group Graph Dynamics-Kinematics Networks for Human Motion Prediction cs.CVPDF

Shuaijin Wan, Huaijiang Sun

TL;DR: GGMotion提出了一种基于分组图网络的动力学-运动学模型,通过分组的关节交互和新的径向场设计,提升了人体运动预测的物理合理性和性能。

Details

Motivation: 现有方法通常将人体姿态表示为抽象的图结构,忽略了关节间的物理依赖关系,导致学习难度大且容易生成不真实的运动。GGMotion旨在通过分组的动力学-运动学建模解决这一问题。

Result: 在Human3.6M、CMU-Mocap和3DPW基准测试中表现优异,尤其在短期运动预测中显著超越现有方法。

Insight: 分组建模和物理约束的显式引入能有效提升运动预测的合理性和准确性,几何等变性设计对3D空间任务至关重要。

Abstract: Human motion is a continuous physical process in 3D space, governed by complex dynamic and kinematic constraints. Existing methods typically represent the human pose as an abstract graph structure, neglecting the intrinsic physical dependencies between joints, which increases learning difficulty and makes the model prone to generating unrealistic motions. In this paper, we propose GGMotion, a group graph dynamics-kinematics network that models human topology in groups to better leverage dynamics and kinematics priors. To preserve the geometric equivariance in 3D space, we propose a novel radial field for the graph network that captures more comprehensive spatio-temporal dependencies by aggregating joint features through spatial and temporal edges. Inter-group and intra-group interaction modules are employed to capture the dependencies of joints at different scales. Combined with equivariant multilayer perceptrons (MLP), joint position features are updated in each group through parallelized dynamics-kinematics propagation to improve physical plausibility. Meanwhile, we introduce an auxiliary loss to supervise motion priors during training. Extensive experiments on three standard benchmarks, including Human3.6M, CMU-Mocap, and 3DPW, demonstrate the effectiveness and superiority of our approach, achieving a significant performance margin in short-term motion prediction. The code is available at https://github.com/inkcat520/GGMotion.git.


[32] MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation cs.CVPDF

Bangning Wei, Joshua Maraval, Meriem Outtas, Kidiyo Kpalma, Nicolas Ramin

TL;DR: 该论文提出了MUVOD数据集,一个用于多视角视频中4D目标分割的新基准,填补了动态场景分割领域的数据空白,并通过提供丰富的标注数据和基准方法推动该领域的发展。

Details

Motivation: 目前基于NeRF和3D高斯泼溅的方法在静态场景的3D目标分割中表现良好,但动态场景的4D目标分割因缺乏高质量的多视角视频数据集而研究不足。MUVOD的提出旨在解决这一问题。

Result: MUVOD数据集覆盖多样化的动态场景,为4D和3D目标分割提供了可靠的数据支持。

Insight: 动态场景的4D分割需要更多高质量的数据集和标准化评估基准,MUVOD为此提供了重要的资源。

Abstract: The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods. Our proposed MUVOD dataset is available at https://volumetric-repository.labs.b-com.com/#/muvod.


[33] MAPEX: Modality-Aware Pruning of Experts for Remote Sensing Foundation Models cs.CVPDF

Joelle Hanna, Linus Scheibenreif, Damian Borth

TL;DR: MAPEX 是一种基于多模态专家混合的遥感基础模型,通过模态感知的专家修剪技术,针对特定任务的高效微调和部署。

Details

Motivation: 现有遥感基础模型通常针对特定模态预训练,与实际应用中的多模态需求不匹配,且模型规模大、微调成本高。

Result: 在多个遥感数据集上验证了其性能优于全监督训练和其他最先进的遥感基础模型。

Insight: 模态感知的专家修剪技术为多模态任务提供了一种高效且灵活的解决方案。

Abstract: Remote sensing data is commonly used for tasks such as flood mapping, wildfire detection, or land-use studies. For each task, scientists carefully choose appropriate modalities or leverage data from purpose-built instruments. Recent work on remote sensing foundation models pre-trains computer vision models on large amounts of remote sensing data. These large-scale models tend to focus on specific modalities, often optical RGB or multispectral data. For many important applications, this introduces a mismatch between the application modalities and the pre-training data. Moreover, the large size of foundation models makes them expensive and difficult to fine-tune on typically small datasets for each task. We address this mismatch with MAPEX, a remote sensing foundation model based on mixture-of-modality experts. MAPEX is pre-trained on multi-modal remote sensing data with a novel modality-conditioned token routing mechanism that elicits modality-specific experts. To apply the model on a specific task, we propose a modality aware pruning technique, which only retains experts specialized for the task modalities. This yields efficient modality-specific models while simplifying fine-tuning and deployment for the modalities of interest. We experimentally validate MAPEX on diverse remote sensing datasets and show strong performance compared to fully supervised training and state-of-the-art remote sensing foundation models. Code is available at https://github.com/HSG-AIML/MAPEX.


[34] Beyond the Linear Separability Ceiling cs.CVPDF

Enrico Vompa, Tanel Tammet, Mohit Vaishnav

TL;DR: 该论文揭示了视觉-语言模型(VLMs)在抽象推理任务中的性能瓶颈源于线性可分性问题(LSC),并提出通过任务依赖的干预方法(如激活现有路径或调整权重)来解决这一问题。

Details

Motivation: 现有的视觉-语言模型在抽象推理任务中的表现受到线性可分性(LSC)的限制,这种瓶颈并非源于感知能力不足,而是语言模型的推理路径问题。论文旨在探索这一现象并提出解决方案。

Result: 研究发现,通过任务依赖的干预可以显著提升模型性能,但复杂任务需要更深入的权重调整。同时,单纯改进表示学习反而可能导致新提示格式下的失败。

Insight: 模型的推理能力不在于表示学习的改进,而是依赖于任务对齐的针对性干预,为VLMs的分析和优化提供了新视角。

Abstract: Most state-of-the-art Visual-Language Models (VLMs) are seemingly limited by the linear separabilty of their visual embeddings on abstract reasoning tasks. This work investigates this “linear reasoning bottleneck” by introducing the Linear Separability Ceiling (LSC), the performance of a simple linear classifier on a VLM’s visual embeddings. We find this bottleneck is widespread and stems not from poor perception, but from failures in the language model’s reasoning pathways. We demonstrate this is a solvable alignment issue. The required intervention, however, is task-dependent: activating existing pathways suffices for semantic concepts, while complex relational reasoning requires adapting core model weights. Using postfix tuning as a methodological control, we find strong evidence for powerful, dormant reasoning pathways within VLMs. However, for complex relational tasks requiring deeper adaptation, explicitly improving representation quality causes the model to fail on new prompt formats despite its embeddings remaining well separated. Ultimately, this work provides a new lens for VLM analysis, showing that robust reasoning is a matter of targeted alignment, not simply improved representation learning.


[35] Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-Light Semantic Segmentation cs.CVPDF

Chunyan Wang, Dong Zhang, Jinhui Tang

TL;DR: 该论文提出了一种名为DGKD-WLSS的新型框架,通过结合扩散引导的知识蒸馏和深度引导的特征融合,解决了弱监督低光语义分割中的图像质量退化问题和语义模糊性问题,达到了最先进的性能。

Details

Motivation: 弱监督语义分割在低光环境下性能显著下降,主要由于图像质量退化(如低对比度、噪声和颜色失真)以及弱监督的固有限制。这些问题导致不可靠的类激活图和语义模糊的伪标签,从而影响模型学习判别性特征表示的能力。

Result: 实验表明,DGKD-WLSS在弱监督低光语义分割任务中达到了最先进的性能。

Insight: 论文揭示了低光环境下弱监督语义分割的核心挑战,并提出了一种结合扩散和深度信息的创新方法,显著提升了模型性能。

Abstract: Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model’s ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:https://github.com/ChunyanWang1/DGKD-WLSS.


[36] NexViTAD: Few-shot Unsupervised Cross-Domain Defect Detection via Vision Foundation Models and Multi-Task Learning cs.CV | cs.AIPDF

Tianwei Mu, Feiyu Duan, Bo Zhou, Dan Xue, Manhong Huang

TL;DR: 这篇论文提出了NexViTAD,一种基于视觉基础模型的少样本跨域异常检测框架,通过创新的共享子空间投影和多任务学习模块解决工业异常检测中的域偏移问题,实现了最先进的性能。

Details

Motivation: 工业异常检测中,跨域数据分布差异(域偏移)导致模型泛化能力不足,而传统方法需要大量标注数据。NexViTAD通过结合预训练视觉基础模型和少样本学习,解决了这一问题。

Result: 在MVTec AD数据集上,AUC为97.5%,AP为70.4%,PRO为95.2%,超越现有模型。

Insight: 通过融合预训练模型和域适应技术,可以在少样本条件下显著提升跨域异常检测性能。

Abstract: This paper presents a novel few-shot cross-domain anomaly detection framework, Nexus Vision Transformer for Anomaly Detection (NexViTAD), based on vision foundation models, which effectively addresses domain-shift challenges in industrial anomaly detection through innovative shared subspace projection mechanisms and multi-task learning (MTL) module. The main innovations include: (1) a hierarchical adapter module that adaptively fuses complementary features from Hiera and DINO-v2 pre-trained models, constructing more robust feature representations; (2) a shared subspace projection strategy that enables effective cross-domain knowledge transfer through bottleneck dimension constraints and skip connection mechanisms; (3) a MTL Decoder architecture supports simultaneous processing of multiple source domains, significantly enhancing model generalization capabilities; (4) an anomaly score inference method based on Sinkhorn-K-means clustering, combined with Gaussian filtering and adaptive threshold processing for precise pixel level. Valuated on the MVTec AD dataset, NexViTAD delivers state-of-the-art performance with an AUC of 97.5%, AP of 70.4%, and PRO of 95.2% in the target domains, surpassing other recent models, marking a transformative advance in cross-domain defect detection.


[37] HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking cs.CVPDF

Ruixiang Chen, Guolei Sun, Yawei Li, Jie Qin, Luca Benini

TL;DR: 论文提出了HiM2SAM方法,通过引入分层运动估计和内存优化,显著提高了SAM2框架在长期视频对象跟踪任务中的性能,尤其是在遮挡和目标重现等问题上表现出色。

Details

Motivation: 视频对象跟踪中,复杂的场景(如遮挡、背景干扰和目标重现)对跟踪算法提出了挑战。SAM2框架在这些问题上存在局限性,因此需要一种无需额外训练的低开销改进方法。

Result: 在LaSOT和LaSOText数据集上,大模型的AUC相对提高了9.6%和7.2%,小模型的增益更显著。

Insight: 无需训练的改进方法也能显著提升跟踪性能,尤其是通过运动估计和内存管理的优化,可以应对长期遮挡和外观变化。

Abstract: This paper presents enhancements to the SAM2 framework for video object tracking task, addressing challenges such as occlusions, background clutter, and target reappearance. We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training. In addition, we optimize the memory bank by distinguishing long-term and short-term memory frames, enabling more reliable tracking under long-term occlusions and appearance changes. Experimental results show consistent improvements across different model scales. Our method achieves state-of-the-art performance on LaSOT and LaSOText with the large model, achieving 9.6% and 7.2% relative improvements in AUC over the original SAM2, and demonstrates even larger relative gains on smaller models, highlighting the effectiveness of our trainless, low-overhead improvements for boosting long-term tracking performance. The code is available at https://github.com/LouisFinner/HiM2SAM.


[38] LOSC: LiDAR Open-voc Segmentation Consolidator cs.CVPDF

Nermin Samet, Gilles Puy, Renaud Marlet

TL;DR: LOSC提出了一种基于Vision-Language Models(VLMs)的方法,通过优化稀疏和噪声的3D点标签,实现了在驾驶场景中开放词汇分割的先进性能。

Details

Motivation: 传统方法通过图像语义反向投影到3D点云时,生成的标签往往具有噪声且稀疏,无法满足实际需求,因此需要一种方法来优化这些标签。

Result: 在nuScenes和SemanticKITTI数据集上,LOSC在零样本开放词汇语义分割和全景分割任务中显著超越了现有方法。

Insight: 通过优化稀疏和噪声标签,可以显著提升3D开放词汇分割的性能,表明标签质量对训练效果的重要性。

Abstract: We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.


[39] SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs cs.CV | cs.CL | cs.HCPDF

Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei

TL;DR: 作者提出SpatialViz-Bench,一个自动生成的空间可视化推理任务基准,用于评估多模态大语言模型在空间可视化能力上的表现,发现现有模型存在显著不足。

Details

Motivation: 现有对多模态大语言模型(MLLMs)的评估通常依赖于IQ测试或数学竞赛,这些任务可能与其训练数据重叠,且未专门评估空间可视化能力。因此,需要一种更专注于空间可视化的评测基准。

Result: 评测揭示了模型表现差异大,部分行为与人类直觉不符,如2D到3D的性能骤降,以及过度依赖公式推导而非空间想象。

Insight: 现有MLLMs在空间可视化任务中仍存在显著不足,亟需进一步研究。SpatialViz-Bench填补了这一评测空白。

Abstract: Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark’s strong discriminative power, but also uncovers counter-intuitive findings: models exhibit unexpected behaviors by showing difficulty perception that misaligns with human intuition, displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula derivation despite spatial tasks requiring visualization alone. SpatialVizBench empirically demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark is publicly available.


[40] ViLU: Learning Vision-Language Uncertainties for Failure Prediction cs.CVPDF

Marc Lafon, Yannis Karmim, Julio Silva-Rodriguez, Paul Couairon, Clément Rambour

TL;DR: ViLU提出了一种新的视觉-语言不确定性量化框架,通过整合视觉嵌入、预测文本嵌入和图像条件的文本表示,构建了一个不确定性感知的多模态表示,用于失败预测。

Details

Motivation: 视觉-语言模型(VLMs)的可靠不确定性量化(UQ)和失败预测仍是一个开放性问题,ViLU旨在通过多模态表示解决这一问题。

Result: 在多个数据集(如ImageNet-1k、CC12M、LAION-400M)上的实验表明,ViLU在失败预测方面显著优于现有方法。消融实验验证了其架构和训练策略的关键作用。

Insight: ViLU的创新在于将不确定性预测与多模态表示结合,尤其适合后处理场景,为视觉-语言模型的可靠性提供了新思路。

Abstract: Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.


[41] T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates cs.CV | cs.MMPDF

Zhitao Wang, Hengyu Man, Wenrui Li, Xingtao Wang, Xiaopeng Fan

TL;DR: T-GVC提出了一种基于轨迹引导的生成视频编码框架,通过语义感知的稀疏运动采样和轨迹对齐的损失约束,在超低码率下实现高质量的视频重建。

Details

Motivation: 传统生成视频编码方法在超低码率(ULB)下依赖于领域特定性或高级文本引导,导致运动细节丢失和不真实的重建效果。

Result: 实验表明,T-GVC在ULB条件下优于现有方法,并实现比文本引导方法更精确的运动控制。

Insight: 几何运动建模为生成视频编码提供了新方向,弥合了低级运动跟踪与高级语义理解的差距。

Abstract: Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding, aiming to achieve semantically accurate reconstructions in Ultra-Low Bitrate (ULB) scenarios by leveraging strong generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or an excessive dependence on high-level text guidance, which often fails to capture motion details and results in unrealistic reconstructions. To address these challenges, we propose a Trajectory-Guided Generative Video Coding framework (dubbed T-GVC). T-GVC employs a semantic-aware sparse motion sampling pipeline to effectively bridge low-level motion tracking with high-level semantic understanding by extracting pixel-wise motion as sparse trajectory points based on their semantic importance, not only significantly reducing the bitrate but also preserving critical temporal semantic information. In addition, by incorporating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free latent space guidance mechanism to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that our framework outperforms both traditional codecs and state-of-the-art end-to-end video compression methods under ULB conditions. Furthermore, additional experiments confirm that our approach achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.


[42] Bridging the gap in FER: addressing age bias in deep learning cs.CVPDF

F. Xavier Gaya-Morey, Julia Sanchez-Perez, Cristina Manresa-Yee, Jose M. Buades-Rubio

TL;DR: 这篇论文研究了深度学习在面部表情识别(FER)中的年龄偏见问题,尤其是对老年人群的影响,并提出三种缓解策略,显著提高了老年群体的识别准确率。

Details

Motivation: 现有深度学习FER模型存在年龄偏见,尤其是对老年人,影响了公平性和可靠性,因此需要研究并解决这一问题。

Result: 老年人群体的识别准确率显著提升,尤其是错误率高的表情;模型的注意力机制对年龄适应性更强。

Insight: 年龄相关的偏见可以通过简单的训练调整缓解,近似人口统计标签对提升大尺度情感计算系统的公平性也有帮助。

Abstract: Facial Expression Recognition (FER) systems based on deep learning have achieved impressive performance in recent years. However, these models often exhibit demographic biases, particularly with respect to age, which can compromise their fairness and reliability. In this work, we present a comprehensive study of age-related bias in deep FER models, with a particular focus on the elderly population. We first investigate whether recognition performance varies across age groups, which expressions are most affected, and whether model attention differs depending on age. Using Explainable AI (XAI) techniques, we identify systematic disparities in expression recognition and attention patterns, especially for “neutral”, “sadness”, and “anger” in elderly individuals. Based on these findings, we propose and evaluate three bias mitigation strategies: Multi-task Learning, Multi-modal Input, and Age-weighted Loss. Our models are trained on a large-scale dataset, AffectNet, with automatically estimated age labels and validated on balanced benchmark datasets that include underrepresented age groups. Results show consistent improvements in recognition accuracy for elderly individuals, particularly for the most error-prone expressions. Saliency heatmap analysis reveals that models trained with age-aware strategies attend to more relevant facial regions for each age group, helping to explain the observed improvements. These findings suggest that age-related bias in FER can be effectively mitigated using simple training modifications, and that even approximate demographic labels can be valuable for promoting fairness in large-scale affective computing systems.


[43] MolCLIP: A Molecular-Auxiliary CLIP Framework for Identifying Drug Mechanism of Action Based on Time-Lapsed Mitochondrial Images cs.CVPDF

Fengqian Pang, Chunyue Lei, Hongfei Zhao, Chenghao Liu, Zhiqiang Xing

TL;DR: MolCLIP是一种结合细胞视频和分子模态的视觉语言模型,用于识别药物作用机制,首次利用分子辅助CLIP框架优化视频特征学习,并在实验中显著提升了性能。

Details

Motivation: 现有方法主要通过空间特征的细胞图像识别药物作用机制(MoA),但忽略了时间动态性和分子模态的互补性。MolCLIP旨在结合时间序列细胞视频和分子信息以更全面理解MoA。

Result: 在MitoDataset上,MolCLIP在药物识别和MoA识别任务中的mAP分别提升了51.2%和20.5%。

Insight: 时间序列细胞视频和分子模态的结合能更有效捕捉药物作用机制的动态变化,分子辅助学习可增强模型对MoA的理解能力。

Abstract: Drug Mechanism of Action (MoA) mainly investigates how drug molecules interact with cells, which is crucial for drug discovery and clinical application. Recently, deep learning models have been used to recognize MoA by relying on high-content and fluorescence images of cells exposed to various drugs. However, these methods focus on spatial characteristics while overlooking the temporal dynamics of live cells. Time-lapse imaging is more suitable for observing the cell response to drugs. Additionally, drug molecules can trigger cellular dynamic variations related to specific MoA. This indicates that the drug molecule modality may complement the image counterpart. This paper proposes MolCLIP, the first visual language model to combine microscopic cell video- and molecule-modalities. MolCLIP designs a molecule-auxiliary CLIP framework to guide video features in learning the distribution of the molecular latent space. Furthermore, we integrate a metric learning strategy with MolCLIP to optimize the aggregation of video features. Experimental results on the MitoDataset demonstrate that MolCLIP achieves improvements of 51.2% and 20.5% in mAP for drug identification and MoA recognition, respectively.


[44] Attend-and-Refine: Interactive keypoint estimation and quantitative cervical vertebrae analysis for bone age assessment cs.CVPDF

Jinhee Kim, Taesung Kim, Taewoo Kim, Dong-Wook Kim, Byungduk Ahn

TL;DR: 该论文提出了一种名为Attend-and-Refine Network (ARNet)的交互式深度学习模型,用于简化颈椎关键点标注过程,从而实现儿童正畸中生长潜力的准确评估。

Details

Motivation: 在儿科正畸中,通过侧位头影测量X光片准确评估生长潜力对制定有效治疗策略至关重要,但传统标注关键点的方法耗时费力。

Result: 在多个数据集上的广泛验证显示ARNet性能优异,适用范围广。

Insight: 该研究为儿科正畸提供了一种高效的AI辅助诊断工具,显著推动了该领域的进步。

Abstract: In pediatric orthodontics, accurate estimation of growth potential is essential for developing effective treatment strategies. Our research aims to predict this potential by identifying the growth peak and analyzing cervical vertebra morphology solely through lateral cephalometric radiographs. We accomplish this by comprehensively analyzing cervical vertebral maturation (CVM) features from these radiographs. This methodology provides clinicians with a reliable and efficient tool to determine the optimal timings for orthodontic interventions, ultimately enhancing patient outcomes. A crucial aspect of this approach is the meticulous annotation of keypoints on the cervical vertebrae, a task often challenged by its labor-intensive nature. To mitigate this, we introduce Attend-and-Refine Network (ARNet), a user-interactive, deep learning-based model designed to streamline the annotation process. ARNet features Interaction-guided recalibration network, which adaptively recalibrates image features in response to user feedback, coupled with a morphology-aware loss function that preserves the structural consistency of keypoints. This novel approach substantially reduces manual effort in keypoint identification, thereby enhancing the efficiency and accuracy of the process. Extensively validated across various datasets, ARNet demonstrates remarkable performance and exhibits wide-ranging applicability in medical imaging. In conclusion, our research offers an effective AI-assisted diagnostic tool for assessing growth potential in pediatric orthodontics, marking a significant advancement in the field.


[45] Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought cs.CV | cs.AI | cs.LGPDF

Shin’ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa

TL;DR: 本文提出了一种名为RED的解码策略,用于解决多模态思维链(CoT)中模型忽视生成推理依据的问题,通过视觉和推理依据的联合优化显著提升了推理能力。

Details

Motivation: 现有的大型视觉语言模型(LVLM)在多模态CoT推理中常忽视生成的推理依据,影响了模型的忠实性和准确性。本文旨在解决这一问题。

Result: 实验表明,RED在多个基准和LVLM模型上显著优于标准CoT和其他解码方法,提升了推理的忠实性和准确性。

Insight: RED策略表明,联合优化视觉和推理依据可以更有效地利用生成的多模态思维链,为构建更可靠的推理系统提供了新思路。

Abstract: Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems.


[46] Tree-Mamba: A Tree-Aware Mamba for Underwater Monocular Depth Estimation cs.CVPDF

Peixian Zhuang, Yijian Wang, Zhenqi Fu, Hongliang Zhang, Sam Kwong

TL;DR: 论文提出了一种名为Tree-Mamba的新方法,用于解决水下单目深度估计(UMDE)任务中的挑战,通过树状感知扫描策略和可靠的基准数据集BlueDepth,显著提升了性能。

Details

Motivation: 现有Mamba方法在水下深度估计中表现不佳,因扫描策略僵化且缺乏可靠数据集,无法有效建模水下图像结构特征。

Result: 实验表明,Tree-Mamba在定量和定性评估中优于现有方法,同时保持计算高效性。

Insight: 树状结构能够有效捕捉水下图像的拓扑特征,而高质量数据集对提升深度估计精度至关重要。

Abstract: Underwater Monocular Depth Estimation (UMDE) is a critical task that aims to estimate high-precision depth maps from underwater degraded images caused by light absorption and scattering effects in marine environments. Recently, Mamba-based methods have achieved promising performance across various vision tasks; however, they struggle with the UMDE task because their inflexible state scanning strategies fail to model the structural features of underwater images effectively. Meanwhile, existing UMDE datasets usually contain unreliable depth labels, leading to incorrect object-depth relationships between underwater images and their corresponding depth maps. To overcome these limitations, we develop a novel tree-aware Mamba method, dubbed Tree-Mamba, for estimating accurate monocular depth maps from underwater degraded images. Specifically, we propose a tree-aware scanning strategy that adaptively constructs a minimum spanning tree based on feature similarity. The spatial topological features among the tree nodes are then flexibly aggregated through bottom-up and top-down traversals, enabling stronger multi-scale feature representation capabilities. Moreover, we construct an underwater depth estimation benchmark (called BlueDepth), which consists of 38,162 underwater image pairs with reliable depth labels. This benchmark serves as a foundational dataset for training existing deep learning-based UMDE methods to learn accurate object-depth relationships. Extensive experiments demonstrate the superiority of the proposed Tree-Mamba over several leading methods in both qualitative results and quantitative evaluations with competitive computational efficiency. Code and dataset will be available at https://wyjgr.github.io/Tree-Mamba.html.


[47] Motion-Aware Adaptive Pixel Pruning for Efficient Local Motion Deblurring cs.CV | I.4.3PDF

Wei Shang, Dongwei Ren, Wanying Zhang, Pengfei Zhu, Qinghua Hu

TL;DR: 本文提出了一种用于局部运动去模糊的高效方法Motion-Aware Adaptive Pixel Pruning (M2AENet),通过可训练的掩码预测器和结构重参数化技术实现了计算资源的优化分配,并设计了帧内运动分析器以自适应地恢复模糊区域。

Details

Motivation: 现有的去模糊方法在计算资源分配和空间变化模糊模式处理上效率不足,无法有效解决局部运动模糊问题。

Result: 在局部和全局模糊数据集上均优于现有方法,同时计算量(FLOPs)减少了49%。

Insight: 通过动态识别和优化模糊区域的计算分配,可以在保持性能的同时显著提升效率,为实时去模糊任务提供了新思路。

Abstract: Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting $3\times 3$ convolutions to computationally efficient $1\times 1$ convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49% compared to SOTA models (e.g., LMD-ViT). The source code is available at https://github.com/shangwei5/M2AENet.


[48] One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models cs.CVPDF

Jiale Zhao, Xinyang Jiang, Junyao Gao, Yuhao Xue, Cairong Zhao

TL;DR: 该论文提出了CrossVLAD基准数据集和CRAFT攻击框架,用于评估统一视觉-语言模型(VLM)在跨任务对抗攻击中的鲁棒性。

Details

Motivation: 统一视觉-语言模型(VLM)的多任务灵活性带来了独特的安全挑战,需要对抗输入在多种任务指令下保持有效性。

Result: 实验表明,CRAFT在Florence-2等统一VLM上优于现有方法,显著提升了跨任务攻击的成功率。

Insight: 跨任务对抗攻击对统一VLM的安全性和鲁棒性提出了新的挑战,需进一步研究防御机制。

Abstract: Unified vision-language models(VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the object-change objective-consistently manipulating a target object’s classification across four downstream tasks-and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient region-centric attack method. Extensive experiments on Florence-2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks.


[49] Scaling RL to Long Videos cs.CV | cs.AI | cs.CLPDF

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye

TL;DR: 论文提出了一套完整的框架,通过强化学习将视觉语言模型(VLMs)扩展到长视频推理任务,解决了长视频推理的独特挑战。

Details

Motivation: 长视频推理任务(如问答)对现有视觉语言模型提出了挑战,需要高效处理长序列数据并保留上下文信息。

Result: 模型LongVILA-R1-7B在长视频问答基准测试中表现优异,超越Video-R1-7B,与Gemini-1.5-Pro相当。MR-SP实现了2.1倍加速。

Insight: 长视频推理的扩展需要高效的数据处理和训练框架,MR-SP的并行化设计为多模态RL训练提供了通用解决方案。

Abstract: We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).


[50] Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology cs.CV | cs.AI | cs.CLPDF

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang

TL;DR: 该论文提出了TreeBench,一个用于全面评估视觉基础推理能力的诊断基准,并提出了TreeVGR方法,通过强化学习联合监督定位和推理,显著提高了模型性能。

Details

Motivation: 现有的视觉基础推理模型缺乏综合评估基准,无法全面衡量模型在复杂场景中的感知和推理能力。

Result: TreeBench的挑战性问题中,最先进模型准确率不足60%(如OpenAI-o3为54.87%)。TreeVGR在多个基准上显著提升性能(V* Bench +16.8,MME-RealWorld +12.6,TreeBench +13.4)。

Insight: 可追踪性是提升视觉基础推理的关键,联合训练定位和推理能够显著提高模型的解释性和性能。

Abstract: Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human “thinking with images”. However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.


[51] Energy-Guided Decoding for Object Hallucination Mitigation cs.CVPDF

Xixi Liu, Ailin Deng, Christopher Zach

TL;DR: 该论文提出了一种基于能量的解码方法,用于缓解视觉语言模型中的物体幻觉问题,方法简单且有效,显著减少了‘是’比例的不平衡并提升了性能。

Details

Motivation: 现有的缓解物体幻觉的方法要么局限于特定解码方式,要么需要对视觉输入进行复杂修改,或依赖外部模型知识,因此需要一种更通用且简单的解决方案。

Result: 在三个基准数据集(POPE、MME和MMVP)上,该方法显著提升了准确率和F1分数,平均准确率提升4.82%,‘是’比例差距减少了8.81%。

Insight: 通过动态选择低能量隐藏状态,可以有效减少模型的偏见,提升性能,且无需复杂修改或外部依赖。

Abstract: Mitigating object hallucination in large vision-language models (LVLMs) is critical to their safe deployment. Existing methods either are restricted to specific decoding methods, or demand sophisticated modifications to visual inputs, or rely on knowledge from external models. In this work, we first reveal the phenomenon that VLMs exhibit significant imbalance in the Yes'' ratio ( \ie, the fraction of Yes’’ answers among the total number of questions) across three different visual question answering (VQA) datasets. Furthermore, we propose an energy-based decoding method, which dynamically selects the hidden states from the layer with minimal energy score. It is simple yet effective in reducing the bias for the yes ratio while boosting performance across three benchmarks (POPE, MME, and MMVP). Our method consistently improves accuracy and F1 score on three VQA datasets across three commonly used VLMs over several baseline methods. The average accuracy improvement is 4.82% compared to greedy decoding. Moreover, the average yes-ratio gap reduction is 8.81%, meaning the proposed method is less biased as shown in Figure 1.


[52] EEvAct: Early Event-Based Action Recognition with High-Rate Two-Stream Spiking Neural Networks cs.CV | cs.NEPDF

Michael Neumeier, Jules Lecomte, Nils Kazinski, Soubarna Banik, Bing Li

TL;DR: 该论文提出了一种基于事件的早期动作识别方法EEvAct,通过高频率的双流脉冲神经网络(SNN)实现高精度和低延迟,比之前的方法在THU EACT-50数据集上提高了2%的准确率。

Details

Motivation: 早期动作识别对安全性和实时性要求高,事件相机的高时间分辨率和低延迟适合这一需求。但现有方法通常将事件累积到低频帧或时空体素中,限制了早期预测能力,而SNN方法虽能高频率处理事件,但准确率有待提升。

Result: 在THU EACT-50数据集上,该方法比之前工作提高了2%的准确率,并在早期观察时间下展示了更高的识别性能。

Insight: 结合时间和空间信息的双流SNN架构能有效提升事件数据的早期动作识别性能,为实时应用提供了新思路。

Abstract: Recognizing human activities early is crucial for the safety and responsiveness of human-robot and human-machine interfaces. Due to their high temporal resolution and low latency, event-based vision sensors are a perfect match for this early recognition demand. However, most existing processing approaches accumulate events to low-rate frames or space-time voxels which limits the early prediction capabilities. In contrast, spiking neural networks (SNNs) can process the events at a high-rate for early predictions, but most works still fall short on final accuracy. In this work, we introduce a high-rate two-stream SNN which closes this gap by outperforming previous work by 2% in final accuracy on the large-scale THU EACT-50 dataset. We benchmark the SNNs within a novel early event-based recognition framework by reporting Top-1 and Top-5 recognition scores for growing observation time. Finally, we exemplify the impact of these methods on a real-world task of early action triggering for human motion capture in sports.


[53] Sparse-Dense Side-Tuner for efficient Video Temporal Grounding cs.CVPDF

David Pujol-Perich, Sergio Escalera, Albert Clapés

TL;DR: 论文提出了一种稀疏-密集侧调谐器(SDST),用于高效视频时序定位(VTG),通过结合稀疏和密集特征调谐,显著提升性能并减少参数量。

Details

Motivation: 现有方法主要依赖预训练模型的最后一层特征,缺乏对新领域的适应性,且全微调不现实。侧调谐(ST)虽是一种替代方案,但忽视了时序定位中的稀疏性。

Result: 在QVHighlights、TACoS和Charades-STA数据集上取得竞争性或SOTA结果,参数量减少高达73%。

Insight: 稀疏特征在时序定位中至关重要,结合密集调谐可显著提升性能。可变形注意力机制的改进进一步增强了模型对上下文的理解能力。

Abstract: Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning – and particularly side-tuning (ST) – has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention – a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code is publicly accessible at https://github.com/davidpujol/SDST.


[54] Deep Learning based 3D Volume Correlation for Additive Manufacturing Using High-Resolution Industrial X-ray Computed Tomography cs.CV | eess.IVPDF

Keerthana Chand, Tobias Fritsch, Bardia Hejazi, Konstantin Poka, Giovanni Bruno

TL;DR: 该论文提出了一种基于深度学习的3D体积配准方法,用于增材制造中的质量检测,通过动态分块处理高分辨率XCT数据,显著提升了配准精度和效率。

Details

Motivation: 增材制造中的几何变形会导致组件性能下降,传统数字体积相关性(DVC)方法在配准时缺乏地面真实变形场,且高分辨率XCT数据计算复杂。

Result: 与传统DVC方法相比,Dice Score提升9.2%,体素匹配率提升9.9%,配准时间从几天缩短至几分钟。

Insight: 深度学习可为增材制造提供高效可靠的配准方法,闭环补偿网格有望提升制造过程的可靠性和效率。

Abstract: Quality control in additive manufacturing (AM) is vital for industrial applications in areas such as the automotive, medical and aerospace sectors. Geometric inaccuracies caused by shrinkage and deformations can compromise the life and performance of additively manufactured components. Such deviations can be quantified using Digital Volume Correlation (DVC), which compares the computer-aided design (CAD) model with the X-ray Computed Tomography (XCT) geometry of the components produced. However, accurate registration between the two modalities is challenging due to the absence of a ground truth or reference deformation field. In addition, the extremely large data size of high-resolution XCT volumes makes computation difficult. In this work, we present a deep learning-based approach for estimating voxel-wise deformations between CAD and XCT volumes. Our method uses a dynamic patch-based processing strategy to handle high-resolution volumes. In addition to the Dice Score, we introduce a Binary Difference Map (BDM) that quantifies voxel-wise mismatches between binarized CAD and XCT volumes to evaluate the accuracy of the registration. Our approach shows a 9.2% improvement in the Dice Score and a 9.9% improvement in the voxel match rate compared to classic DVC methods, while reducing the interaction time from days to minutes. This work sets the foundation for deep learning-based DVC methods to generate compensation meshes that can then be used in closed-loop correlations during the AM production process. Such a system would be of great interest to industries since the manufacturing process will become more reliable and efficient, saving time and material.


[55] SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples cs.CVPDF

Dren Fazlija, Monty-Maximilian Zühlke, Johanna Schrader, Arkadij Orlov, Clara Stein

TL;DR: SCOOTER是一个开源的、基于统计的框架,用于评估无限制对抗样本的真实性。通过大规模人类评估和与模型的对比,发现现有颜色空间和扩散攻击无法生成不可察觉的图像,并提供了实践指南、工具和基准数据集。

Details

Motivation: 无限制对抗攻击无需受传统$ℓ_p$-范数约束,可能导致人类难以察觉的对抗样本。目前缺乏统一且具有统计意义的评估框架,亟需标准化方法验证这些攻击的真实性。

Result: 研究发现现有攻击无法生成人类难以察觉的图像,且人类与机器视觉系统在感知上存在差异。GPT-4o仅在部分攻击中表现一致。

Insight: 1. 无限制对抗样本的真实性亟需人类评估;2. 现有评估方法需要统计支持;3. 人类与机器感知不一致,需以人类为基准。

Abstract: Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object’s color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.


[56] SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes cs.CV | cs.ROPDF

Jiaxin Huang, Ziwen Li, Hanlve Zhang, Runnan Chen, Xiao He

TL;DR: SURPRISE3D是一个新颖的数据集,专注于评估复杂3D场景中的语言引导空间推理分割任务,旨在填补当前3D视觉-语言研究中空间推理能力的不足。

Details

Motivation: 现有数据集中语义线索(如物体名称)与空间上下文混合,导致模型依赖表面捷径而非真正理解空间关系。SURPRISE3D通过去物体名称的查询设计,解决这一偏差。

Result: 初步测试显示当前SOTA的3D视觉定位方法和3D-LLMs在空间推理任务中表现不佳,凸显了数据集的必要性。

Insight: 该数据集和基准套件推动了空间感知AI的发展,为具身交互和机器人规划提供了重要工具。

Abstract: The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between objects, remains underexplored in current 3D vision-language research. Existing datasets often mix semantic cues (e.g., object name) with spatial context, leading models to rely on superficial shortcuts rather than genuinely interpreting spatial relationships. To address this gap, we introduce S\textsc{urprise}3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes. S\textsc{urprise}3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2, including more than 2.8k unique object classes. The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name, thereby mitigating shortcut biases in spatial understanding. These queries comprehensively cover various spatial reasoning skills, such as relative position, narrative perspective, parametric perspective, and absolute distance reasoning. Initial benchmarks demonstrate significant challenges for current state-of-the-art expert 3D visual grounding methods and 3D-LLMs, underscoring the necessity of our dataset and the accompanying 3D Spatial Reasoning Segmentation (3D-SRS) benchmark suite. S\textsc{urprise}3D and 3D-SRS aim to facilitate advancements in spatially aware AI, paving the way for effective embodied interaction and robotic planning. The code and datasets can be found in https://github.com/liziwennba/SUPRISE.


[57] Robust and Generalizable Heart Rate Estimation via Deep Learning for Remote Photoplethysmography in Complex Scenarios cs.CV | F.2.2PDF

Kang Cen, Chang-Hong Fu, Hong Hong

TL;DR: 该论文提出了一种基于深度学习的远程光电容积图(rPPG)端到端网络,通过3D卷积神经网络和动态混合损失函数,提高了复杂场景下的心率和BVP估计的鲁棒性和泛化能力。

Details

Motivation: 非接触式rPPG技术通过面部视频测量心率,但现有模型在复杂场景下的准确性、鲁棒性和泛化能力面临挑战。

Result: 在PURE、UBFC-rPPG和挑战性的MMPD数据集上评估,结果优于现有方法(MMPD测试集MAE为7.58)。

Insight: 差分帧融合和动态混合损失函数对提升复杂场景下的性能至关重要,而TSM的高效性可扩展到其他时序任务。

Abstract: Non-contact remote photoplethysmography (rPPG) technology enables heart rate measurement from facial videos. However, existing network models still face challenges in accu racy, robustness, and generalization capability under complex scenarios. This paper proposes an end-to-end rPPG extraction network that employs 3D convolutional neural networks to reconstruct accurate rPPG signals from raw facial videos. We introduce a differential frame fusion module that integrates differential frames with original frames, enabling frame-level representations to capture blood volume pulse (BVP) variations. Additionally, we incorporate Temporal Shift Module (TSM) with self-attention mechanisms, which effectively enhance rPPG features with minimal computational overhead. Furthermore, we propose a novel dynamic hybrid loss function that provides stronger supervision for the network, effectively mitigating over fitting. Comprehensive experiments were conducted on not only the PURE and UBFC-rPPG datasets but also the challenging MMPD dataset under complex scenarios, involving both intra dataset and cross-dataset evaluations, which demonstrate the superior robustness and generalization capability of our network. Specifically, after training on PURE, our model achieved a mean absolute error (MAE) of 7.58 on the MMPD test set, outperforming the state-of-the-art models.


[58] Visual Instance-aware Prompt Tuning cs.CV | cs.AIPDF

Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang

TL;DR: 论文提出了视觉实例感知提示调优(ViaPT),通过为每个输入生成实例感知提示并与数据集级提示融合,解决了传统方法因下游数据集高方差导致的性能不足问题。

Details

Motivation: 传统的视觉提示调优(VPT)使用对所有输入实例相同的数据集级提示,导致性能不理想,原因是下游数据集的方差高。

Result: 在34个多样化数据集上的实验表明,ViaPT始终优于现有基线方法。

Insight: VPT-Deep和VPT-Shallow是两种极端情况,ViaPT通过结合两种优势,实现了更优的性能和参数效率。

Abstract: Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. Moreover, we reveal that VPT-Deep and VPT-Shallow represent two corner cases based on a conceptual understanding, in which they fail to effectively capture instance-specific information, while random dimension reduction on prompts only yields performance between the two extremes. Instead, ViaPT overcomes these limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters compared to VPT-Deep. Extensive experiments across 34 diverse datasets demonstrate that our method consistently outperforms state-of-the-art baselines, establishing a new paradigm for analyzing and optimizing visual prompts for vision transformers.


[59] Synergistic Prompting for Robust Visual Recognition with Missing Modalities cs.CVPDF

Zhihui Zhang, Luanyuan Dai, Qika Lin, Yunfeng Diao, Guangyin Jin

TL;DR: 该论文提出了一种名为Synergistic Prompting (SyP)的新框架,旨在解决多模态视觉识别任务中因输入模态缺失导致的性能下降问题。通过动态适配器和协同提示策略,SyP显著提升了模型的鲁棒性和适应性。

Details

Motivation: 现实应用中,多模态输入常因缺失或不完整导致性能下降。现有基于提示的方法因静态提示和基本调优策略的不足,难以适应多变的缺失条件或确保关键模态缺失时的可靠性。

Result: 在三个广泛使用的视觉识别数据集上,SyP显著优于现有方法,表现出对多样缺失率和条件的高鲁棒性。

Insight: 动态提示与静态提示的结合能够有效提升多模态模型的适应性和鲁棒性,为处理模态缺失问题提供了新思路。

Abstract: Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks by leveraging extensive paired multi-modal training data. However, in real-world applications, the presence of missing or incomplete modality inputs often leads to significant performance degradation. Recent research has focused on prompt-based strategies to tackle this issue; however, existing methods are hindered by two major limitations: (1) static prompts lack the flexibility to adapt to varying missing-data conditions, and (2) basic prompt-tuning methods struggle to ensure reliable performance when critical modalities are missing.To address these challenges, we propose a novel Synergistic Prompting (SyP) framework for robust visual recognition with missing modalities. The proposed SyP introduces two key innovations: (I) a Dynamic Adapter, which computes adaptive scaling factors to dynamically generate prompts, replacing static parameters for flexible multi-modal adaptation, and (II) a Synergistic Prompting Strategy, which combines static and dynamic prompts to balance information across modalities, ensuring robust reasoning even when key modalities are missing. The proposed SyP achieves significant performance improvements over existing approaches across three widely-used visual recognition datasets, demonstrating robustness under diverse missing rates and conditions. Extensive experiments and ablation studies validate its effectiveness in handling missing modalities, highlighting its superior adaptability and reliability.


[60] Patient-specific vs Multi-Patient Vision Transformer for Markerless Tumor Motion Forecasting cs.CVPDF

Gauthier Rotsart de Hertaing, Dani Manjah, Benoit Macq

TL;DR: 该论文首次将视觉变换器(ViT)架构应用于无标记肿瘤运动预测,比较了患者特异性(PS)和多患者(MP)训练策略的性能。结果表明,PS模型在训练数据充足时表现更优,但MP模型在临床时间受限时具有更强的鲁棒性和泛化能力。

Details

Motivation: 准确的肺部肿瘤运动预测对于质子治疗中的精确剂量递送至关重要。当前的无标记方法主要依赖深度学习,而基于变换器的架构在此领域尚未被探索。

Result: 1.PS模型在规划数据(T1)上表现更优;2.MP模型对分次间解剖变异具有更强的鲁棒性,且在治疗数据(T2)上无需重新训练即可达到与PS模型相当的性能。

Insight: 尽管PS模型在数据充足时精度更高,但MP模型因其即用性强和鲁棒性更适合时间紧迫的临床场景。

Abstract: Background: Accurate forecasting of lung tumor motion is essential for precise dose delivery in proton therapy. While current markerless methods mostly rely on deep learning, transformer-based architectures remain unexplored in this domain, despite their proven performance in trajectory forecasting. Purpose: This work introduces a markerless forecasting approach for lung tumor motion using Vision Transformers (ViT). Two training strategies are evaluated under clinically realistic constraints: a patient-specific (PS) approach that learns individualized motion patterns, and a multi-patient (MP) model designed for generalization. The comparison explicitly accounts for the limited number of images that can be generated between planning and treatment sessions. Methods: Digitally reconstructed radiographs (DRRs) derived from planning 4DCT scans of 31 patients were used to train the MP model; a 32nd patient was held out for evaluation. PS models were trained using only the target patient’s planning data. Both models used 16 DRRs per input and predicted tumor motion over a 1-second horizon. Performance was assessed using Average Displacement Error (ADE) and Final Displacement Error (FDE), on both planning (T1) and treatment (T2) data. Results: On T1 data, PS models outperformed MP models across all training set sizes, especially with larger datasets (up to 25,000 DRRs, p < 0.05). However, MP models demonstrated stronger robustness to inter-fractional anatomical variability and achieved comparable performance on T2 data without retraining. Conclusions: This is the first study to apply ViT architectures to markerless tumor motion forecasting. While PS models achieve higher precision, MP models offer robust out-of-the-box performance, well-suited for time-constrained clinical settings.


[61] Rethinking Query-based Transformer for Continual Image Segmentation cs.CVPDF

Yuchen Zhu, Cheng Shi, Dingyou Wang, Jiajin Tang, Zhengxuan Wei

TL;DR: 论文提出SimCIS,一种基于查询的Transformer方法,通过直接选择图像特征实现查询分配,解决持续图像分割中的可塑性丧失和数据顺序依赖问题。

Details

Motivation: 当前持续图像分割方法通过解耦掩码生成与持续学习过程,但存在可塑性丧失和严重依赖输入数据顺序的问题。论文旨在解决这些问题。

Result: SimCIS在多种分割任务、设置、数据顺序中均优于现有方法。

Insight: 高度聚合的图像特征可作为查询生成掩码的捷径,通过简单特征对齐实现高效分割;跨阶段选择和重放机制有效缓解遗忘问题。

Abstract: Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage. To leverage the built-in objectness of query-based transformers, which mitigates catastrophic forgetting of mask proposals, current methods often decouple mask generation from the continual learning process. This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order. To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. Its core idea is to directly select image features for query assignment, ensuring “perfect alignment” to preserve objectness, while simultaneously allowing queries to select new classes to promote plasticity. To further combat catastrophic forgetting of categories, we introduce cross-stage consistency in selection and an innovative “visual query”-based replay mechanism. Experiments demonstrate that SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. All models and codes will be made publicly available at https://github.com/SooLab/SimCIS.


[62] Single-Step Latent Diffusion for Underwater Image Restoration cs.CVPDF

Jiayi Wu, Tianfu Wang, Md Abu Bakr Siddique, Md Jahidul Islam, Cornelia Fermuller

TL;DR: 该论文提出了一种名为SLURPP的单步潜在扩散方法,用于水下图像恢复,通过结合预训练的潜在扩散模型和显式场景分解,显著提升了恢复效果和计算效率。

Details

Motivation: 水下图像恢复在海洋生态、水产养殖和水下考古等领域至关重要。现有基于扩散的方法虽然有效,但计算复杂且在处理复杂几何和深度变化时易产生不真实的伪影。

Result: SLURPP在合成和真实数据上均表现优异,比现有扩散方法快200倍以上,PSNR提升约3 dB,同时对真实数据展现出显著的定性改进。

Insight: 预训练的潜在扩散模型能有效捕获场景结构先验,结合显式场景分解可显著提升水下图像恢复的真实性和效率。物理合成数据的高多样性对模型泛化能力至关重要。

Abstract: Underwater image restoration algorithms seek to restore the color, contrast, and appearance of a scene that is imaged underwater. They are a critical tool in applications ranging from marine ecology and aquaculture to underwater construction and archaeology. While existing pixel-domain diffusion-based image restoration approaches are effective at restoring simple scenes with limited depth variation, they are computationally intensive and often generate unrealistic artifacts when applied to scenes with complex geometry and significant depth variation. In this work we overcome these limitations by combining a novel network architecture (SLURPP) with an accurate synthetic data generation pipeline. SLURPP combines pretrained latent diffusion models – which encode strong priors on the geometry and depth of scenes – with an explicit scene decomposition – which allows one to model and account for the effects of light attenuation and backscattering. To train SLURPP we design a physics-based underwater image synthesis pipeline that applies varied and realistic underwater degradation effects to existing terrestrial image datasets. This approach enables the generation of diverse training data with dense medium/degradation annotations. We evaluate our method extensively on both synthetic and real-world benchmarks and demonstrate state-of-the-art performance. Notably, SLURPP is over 200X faster than existing diffusion-based methods while offering ~ 3 dB improvement in PSNR on synthetic benchmarks. It also offers compelling qualitative improvements on real-world data. Project website https://tianfwang.github.io/slurpp/.


[63] MIRA: A Novel Framework for Fusing Modalities in Medical RAG cs.CVPDF

Jinhong Wang, Tajamul Ashraf, Zongyan Han, Jorma Laaksonen, Rao Mohammad Anwer

TL;DR: MIRA框架通过动态调整检索内容数量和整合多模态信息,解决了医疗MLLM中因检索不足或过度导致的准确性下降问题,显著提升了事实准确性和性能。

Details

Motivation: MLLM在医疗诊断中常生成与医学知识不符的响应,而RAG虽能提升准确性但面临检索不足或过度的问题,导致事实错误。

Result: 在公开医疗VQA和报告生成基准测试中,MIRA显著提升事实准确性和性能,达到SOTA。

Insight: 动态调整检索和多模态融合是提升医疗MLLM准确性的关键。

Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at https://github.com/mbzuai-oryx/MIRA.


[64] Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement cs.CVPDF

Xiao Yang, Yuxuan Fan, Can Liu, Houcheng Su, Weichen Guo

TL;DR: 本文提出了一种基于时空不一致性的测试时间自适应策略(CiCi框架),用于远程光电容积描绘(rPPG)任务,通过结合一致性与不一致性先验知识,提升了模型在推理阶段的适应性。

Details

Motivation: 现有的域适应和泛化方法在隐私和实时性方面的限制限制了其在真实场景中的应用,因此需要一种完全测试时间自适应(TTA)的方法来提升rPPG任务的适应性。

Result: 在五个数据集上的实验表明,该方法在无需源数据的情况下,显著优于现有技术,实现了实时自监督适应的最先进性能。

Insight: 通过利用时空域中的不一致性信号,可以显著提升模型在测试时间内的自适应能力,尤其是在隐私敏感和实时性要求高的场景中。

Abstract: Remote photoplethysmography (rPPG) has emerged as a promising non-invasive method for monitoring physiological signals using the camera. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based rPPG models in unseen deployment environments, considerations in aspects like privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strategy tailored for rPPG tasks in this work. Specifically, based on prior knowledge in physiology and our observations, we noticed not only there is spatio-temporal consistency in the frequency domain of rPPG signals, but also that inconsistency in the time domain was significant. Given this, by leveraging both consistency and inconsistency priors, we introduce an innovative expert knowledge-based self-supervised \textbf{C}onsistency-\textbf{i}n\textbf{C}onsistency-\textbf{i}ntegration (\textbf{CiCi}) framework to enhances model adaptation during inference. Besides, our approach further incorporates a gradient dynamic control mechanism to mitigate potential conflicts between priors, ensuring stable adaptation across instances. Through extensive experiments on five diverse datasets under the TTA protocol, our method consistently outperforms existing techniques, presenting state-of-the-art performance in real-time self-supervised adaptation without accessing source data. The code will be released later.


[65] Towards Continuous Home Cage Monitoring: An Evaluation of Tracking and Identification Strategies for Laboratory Mice cs.CV | cs.AI | cs.LGPDF

Juan Pablo Oberhauser, Daniel Grzenda

TL;DR: 该论文提出了一种实时识别算法,用于在数字笼养环境中准确追踪和识别实验室小鼠,通过结合外观和运动线索的跟踪器、基于Transformer的ID分类器和轨迹关联器,显著提高了追踪效率和ID准确性。

Details

Motivation: 实验室小鼠的持续自动化监测能提高数据收集的准确性和动物福利,但由于小鼠的高密度饲养、相似外观和高活动性,个体识别成为挑战。

Result: 在30FPS下实现了高精度ID分配,相比现有方法提升了追踪效率并减少了ID切换。

Insight: 通过融合多种线索和优化算法,可以在复杂环境下实现稳定的小鼠识别和监测。

Abstract: Continuous, automated monitoring of laboratory mice enables more accurate data collection and improves animal welfare through real-time insights. Researchers can achieve a more dynamic and clinically relevant characterization of disease progression and therapeutic effects by integrating behavioral and physiological monitoring in the home cage. However, providing individual mouse metrics is difficult because of their housing density, similar appearances, high mobility, and frequent interactions. To address these challenges, we develop a real-time identification (ID) algorithm that accurately assigns ID predictions to mice wearing custom ear tags in digital home cages monitored by cameras. Our pipeline consists of three parts: (1) a custom multiple object tracker (MouseTracks) that combines appearance and motion cues from mice; (2) a transformer-based ID classifier (Mouseformer); and (3) a tracklet associator linear program to assign final ID predictions to tracklets (MouseMap). Our models assign an animal ID based on custom ear tags at 30 frames per second with 24/7 cage coverage. We show that our custom tracking and ID pipeline improves tracking efficiency and lowers ID switches across mouse strains and various environmental factors compared to current mouse tracking methods.


[66] Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions cs.CVPDF

Longfei Li, Zhiwen Fan, Wenyan Cong, Xinhang Liu, Yuyang Yin

TL;DR: 该论文提出了一种用于生成逼真火星景观视频的完整解决方案,结合了数据重建和视频生成技术。通过M3arsSynth和MarsGen两个组件,解决了火星数据稀缺和领域差异问题。

Details

Motivation: 合成逼真的火星景观视频对任务演练和机器人模拟至关重要,但火星数据稀缺且与地球图像存在显著领域差异。

Result: 实验表明,该方法优于基于地球数据集训练的视频合成模型,在视觉逼真度和3D结构一致性上表现更优。

Insight: 重建物理准确的3D环境是解决火星视频合成挑战的关键,且跨领域适应性问题可以通过针对性数据重建和生成模型解决。

Abstract: Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA’s Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.


[67] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling cs.CV | cs.AIPDF

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye

TL;DR: 论文提出了一种名为Geometry Forcing的方法,通过将视频扩散模型与预训练的几何基础模型对齐,增强模型在生成视频时的3D一致性。

Details

Motivation: 视频是动态3D世界的2D投影,但现有的视频扩散模型仅基于原始视频数据训练,往往无法捕捉几何感知结构,导致生成的内容缺乏3D一致性。

Result: 在相机视角条件和动作条件视频生成任务中,Geometry Forcing显著提升了生成视频的视觉质量和3D一致性。

Insight: 通过在视频生成中引入几何感知的中间表示,可以更好地模拟3D世界的动态特性,从而提高生成内容的真实性和一致性。

Abstract: Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model’s intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.


[68] OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding cs.CVPDF

JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu

TL;DR: OST-Bench 是一个新的基准测试,旨在评估多模态大语言模型(MLLMs)在在线时空场景理解中的能力,揭示了现有模型在复杂时空推理方面的不足。

Details

Motivation: 现有基准测试多为离线场景设计,无法反映真实世界中的动态探索和推理需求,因此需要一种新的在线时空理解评估工具。

Result: 实验表明,主流 MLLMs 在复杂时空推理任务中表现较差,准确率随探索时间和记忆需求的增加而下降。

Insight: 指出了长时记忆检索和复杂空间推理是提升在线推理能力的两大核心挑战。

Abstract: Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/


[69] CLIP Won’t Learn Object-Attribute Binding from Natural Data and Here is Why cs.CVPDF

Bijay Gurung, David T. Hoffmann, Thomas Brox

TL;DR: CLIP模型无法从自然数据中学习对象与属性的绑定关系,原因是自然数据的低属性密度、不完整标注和显著性偏差等问题,而非批次大小或硬负样本的缺失。

Details

Motivation: CLIP等对比视觉语言模型在零样本分类和多模态模型中广泛应用,但其表征存在局限性,如无法区分对象与属性的绑定关系。作者试图通过数据属性的分析解决这一问题。

Result: 发现仅当数据满足特定属性(如高属性密度和完整标注)时,CLIP才能学习到几乎完美的对象-属性绑定关系。

Insight: 自然数据的特性(而非模型架构或训练策略)是CLIP无法学习绑定的主要原因,改进数据设计可能比调整模型更有效。

Abstract: Contrastive vision-language models like CLIP are used for a large variety of applications, such as zero-shot classification or as vision encoder for multi-modal models. Despite their popularity, their representations show major limitations. For instance, CLIP models learn bag-of-words representations and, as a consequence, fail to distinguish whether an image is of “a yellow submarine and a blue bus” or “a blue submarine and a yellow bus”. Previous attempts to fix this issue added hard negatives during training or modified the architecture, but failed to resolve the problem in its entirety. We suspect that the missing insights to solve the binding problem for CLIP are hidden in the arguably most important part of learning algorithms: the data. In this work, we fill this gap by rigorously identifying the influence of data properties on CLIP’s ability to learn binding using a synthetic dataset. We find that common properties of natural data such as low attribute density, incomplete captions, and the saliency bias, a tendency of human captioners to describe the object that is “most salient” to them have a detrimental effect on binding performance. In contrast to common belief, we find that neither scaling the batch size, i.e., implicitly adding more hard negatives, nor explicitly creating hard negatives enables CLIP to learn reliable binding. Only when the data expresses our identified data properties CLIP learns almost perfect binding.


[70] Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs cs.CV | cs.AIPDF

Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee

TL;DR: 论文提出了一种无需训练的视频大语言模型加速方法STTM,通过多粒度时空令牌合并减少计算开销,同时保持高精度。

Details

Motivation: 视频LLM通常因大量时空令牌导致计算复杂度二次增长,而现有方法未能充分利用视频数据的局部时空冗余性。

Result: 在6个视频QA基准测试中表现优异,例如以50%令牌预算实现2倍加速且精度仅下降0.5%。

Insight: 视频数据的时空冗余性可用于高效令牌合并,无需额外训练即可显著提升计算效率。

Abstract: Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.


[71] Multigranular Evaluation for Brain Visual Decoding cs.CV | cs.AI | eess.IV | q-bio.NCPDF

Weihao Xia, Cengiz Oztireli

TL;DR: 本文提出了BASIC框架,用于多粒度评估大脑视觉解码方法的性能,通过量化结构保真度、推理对齐和上下文一致性,解决了现有评估方法的局限性。

Details

Motivation: 现有的大脑视觉解码评估方法主要依赖粗糙的指标,缺乏神经科学基础,且无法捕捉细粒度的视觉差异,限制了方法的有效比较和改进。

Result: 在多组刺激-神经影像数据集上对多种视觉解码方法进行了基准测试,提供了更具区分性、可解释性和全面的评估。

Insight: BASIC框架为大脑视觉解码方法提供了更科学、更细致的评估工具,有助于未来方法的改进和对比。

Abstract: Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for measuring brain visual decoding methods.


[72] Single-pass Adaptive Image Tokenization for Minimum Program Search cs.CV | cs.AI | cs.LGPDF

Shivam Duggal, Sanghyun Byun, William T. Freeman, Antonio Torralba, Phillip Isola

TL;DR: 提出了一种单次自适应图像标记器KARL,通过预测图像的适当标记数量来近似Kolmogorov复杂度,实现了高效的单次前向处理。

Details

Motivation: 现有视觉表示学习系统普遍采用固定长度的表示方法,忽视了数据的复杂性或熟悉度差异。KARL旨在通过自适应标记化方法解决这一问题,并避免测试时的多次编码搜索。

Result: KARL在单次处理中实现了与多遍自适应标记器相当的性能,并展示了标记数量与最小描述长度的关系。

Insight: 自适应图像标记化与算法信息理论之间存在类比关系,KARL预测的图像复杂度(KC)与人类直觉一致,尤其在结构vs.噪声和分布内外熟悉度方面。

Abstract: According to Algorithmic Information Theory (AIT) – Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL’s training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity – revealing alignment with human intuition.


[73] Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models cs.CV | cs.LGPDF

Helen Qu, Sang Michael Xie

TL;DR: 该论文研究了多模态模型(如CLIP和大模型LMMs)中预训练数据的词共现统计对组合泛化能力的影响,发现词共现点互信息(PMI)与模型性能强相关,揭示了组合概念对模型表现的重要性。

Details

Motivation: 多模态模型(如CLIP和LMMs)在常见概念上表现良好,但组合概念的泛化能力尚不清楚。论文旨在探究预训练数据中词共现统计如何影响模型在组合概念上的表现。

Result: CLIP预训练数据的PMI与模型性能高度相关(零样本准确率差异达14%),且在LMMs中依然显著(TextVQA r=0.70;VQAv2 r=0.62)。

Insight: 当前多模态模型的组合泛化能力受限于预训练数据中的组合分布,需设计新算法或架构,避免通过组合扩展训练数据的方式来提升性能。

Abstract: CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear – for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M (r=0.97 and 14% accuracy gap between images in the top and bottom 5% of PMI values), demonstrating that even accuracy on common concepts is affected by the combination of concepts in the image. Leveraging this finding, we reproduce this effect in natural images by editing them to contain pairs with varying PMI, resulting in a correlation of r=0.75. Finally, we demonstrate that this behavior in CLIP transfers to LMMs built on top of CLIP (r=0.70 for TextVQA, r=0.62 for VQAv2). Our findings highlight the need for algorithms and architectures that improve compositional generalization in multimodal models without scaling the training data combinatorially. Our code is available at https://github.com/helenqu/multimodal-pretraining-pmi.


cs.CL [Back]

[74] Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses cs.CL | cs.AI | cs.CY | J.4PDF

Jens Rupprecht, Georg Ahnert, Markus Strohmaier

TL;DR: 本文研究了大型语言模型(LLMs)在社会科学调查中的响应偏差,发现所有测试模型均表现出明显的‘近因偏差’,且对语义变化敏感。研究表明,使用LLMs生成合成调查数据时需注意提示设计和鲁棒性测试。

Details

Motivation: 大型语言模型逐渐被用作社会科学调查的人类代理,但其可靠性和对已知响应偏差的敏感性尚不明确。本研究旨在揭示LLMs在调查响应中的偏差和鲁棒性问题。

Result: 所有模型均表现出不同程度的近因偏差,且对语义变化(如改写)和组合扰动敏感。模型规模越大,鲁棒性越强,但仍存在敏感性。

Insight: LLMs部分表现出于人类类似的调查响应偏差,提示其在社会科学应用中的局限性,需谨慎设计提示并进行鲁棒性测试。

Abstract: Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts – we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs’ vulnerabilities to perturbations but also reveal that all tested models exhibit a consistent \textit{recency bias} varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.


[75] Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings cs.CLPDF

Minseon Kim, Jean-Philippe Corbeil, Alessandro Sordoni, Francois Beaulieu, Paul Vozila

TL;DR: 这篇论文提出了一种针对医疗领域语言模型的安全性评估协议,聚焦患者和临床医生的视角,并填补了现有研究中医疗LLM安全性评估的空白。

Details

Motivation: 随着大型语言模型在医疗领域的广泛应用,其安全性问题日益突出,尤其是模型输出可能直接影响人类健康。现有评估多关注通用领域,缺乏针对医疗场景的特殊考量。

Result: 研究填补了医疗LLM安全性评估的空白,为医疗领域的安全部署奠定了基础。

Insight: 医疗LLM的安全性评估需考虑用户角色的多样性,患者和临床医生的视角对发现潜在风险至关重要。

Abstract: As the performance of large language models (LLMs) continues to advance, their adoption is expanding across a wide range of domains, including the medical field. The integration of LLMs into medical applications raises critical safety concerns, particularly due to their use by users with diverse roles, e.g. patients and clinicians, and the potential for model’s outputs to directly affect human health. Despite the domain-specific capabilities of medical LLMs, prior safety evaluations have largely focused only on general safety benchmarks. In this paper, we introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives, alongside general safety assessments and quantitatively analyze the safety of medical LLMs. We bridge a gap in the literature by building the PatientSafetyBench containing 466 samples over 5 critical categories to measure safety from the perspective of the patient. We apply our red-teaming protocols on the MediPhi model collection as a case study. To our knowledge, this is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view - patient, clinician, and general user - establishing a foundation for safer deployment in medical domains.


[76] Towards Interpretable Time Series Foundation Models cs.CL | cs.AIPDF

Matthieu Boileau, Philippe Helluy, Jeremy Pawlus, Svitlana Vyetrenko

TL;DR: 论文研究了如何将时间序列推理能力蒸馏到小型指令调优语言模型中,以构建可解释的时间序列基础模型。通过使用合成数据集和大模型生成的自然语言注释,训练紧凑的Qwen模型,并引入评估指标验证其推理质量。

Details

Motivation: 开发轻量级、可解释的时间序列基础模型,适合在设备端或隐私敏感场景部署,同时能以自然语言解释时间序列模式。

Result: 后训练模型获得了有意义的解释能力,验证了将时间序列理解压缩到轻量级模型的可行性。

Insight: 轻量级语言模型可以通过蒸馏时间序列推理能力,实现可解释的时间序列分析,适合隐私敏感或设备端应用。

Abstract: In this paper, we investigate the distillation of time series reasoning capabilities into small, instruction-tuned language models as a step toward building interpretable time series foundation models. Leveraging a synthetic dataset of mean-reverting time series with systematically varied trends and noise levels, we generate natural language annotations using a large multimodal model and use these to supervise the fine-tuning of compact Qwen models. We introduce evaluation metrics that assess the quality of the distilled reasoning - focusing on trend direction, noise intensity, and extremum localization - and show that the post-trained models acquire meaningful interpretive capabilities. Our results highlight the feasibility of compressing time series understanding into lightweight, language-capable models suitable for on-device or privacy-sensitive deployment. This work contributes a concrete foundation toward developing small, interpretable models that explain temporal patterns in natural language.


[77] SAND: Boosting LLM Agents with Self-Taught Action Deliberation cs.CLPDF

Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim

TL;DR: 论文提出了SAND框架,通过自我学习的动作深思机制,提升LLM代理的性能,解决了现有方法在动作探索不足时可能选择次优动作的问题。

Details

Motivation: 现有LLM代理通常通过模仿专家行为或基于偏好的优化来调整,但这些方法可能因动作空间探索不足而导致选择看似合理但实际次优的动作。

Result: 在两个交互代理任务中,SAND平均提升了20%的性能,并超越了现有最优的代理调整方法。

Insight: 通过动作深思和自我优化,LLM代理能够在复杂动作空间中更有效地探索和选择最优动作,提升整体性能。

Abstract: Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.


[78] RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning cs.CLPDF

Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang

TL;DR: RLEP提出了一种结合经验回放的强化学习方法,用于提升大语言模型的推理能力,通过回放已验证的高质量轨迹优化训练过程,实现更快的收敛和更强的性能。

Details

Motivation: 训练大语言模型的强化学习过程通常不稳定且计算昂贵,容易偏离预训练权重。作者希望通过回放已验证的成功轨迹,避免无效探索,专注于有潜力的推理路径。

Result: 在多个数学任务上显著提升性能:AIME-2024从38.2%到39.9%,AIME-2025从19.8%到22.3%,AMC-2023从77.0%到82.2%。

Insight: 回放高质量轨迹可以有效稳定训练过程,避免无效探索,加速收敛,对提升语言模型的推理能力具有重要价值。

Abstract: Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emph{RLEP}, – ,Reinforcement Learning with Experience rePlay, – ,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.


[79] Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models cs.CL | cs.AI | cs.LGPDF

Kaiqu Liang, Haimin Hu, Xuandong Zhao, Dawn Song, Thomas L. Griffiths

TL;DR: 该论文提出了“机器胡说八道”(machine bullshit)的概念框架,量化了大型语言模型(LLM)对真理的漠视,并通过新指标Bullshit Index和分类法分析了四种胡说形式。研究发现RLHF微调会加剧胡说,而CoT提示会放大特定胡说形式,尤其是在政治语境中。

Details

Motivation: 先前研究探讨了LLM的幻觉和迎合性,但缺乏一个统一框架来表征模型对真理的广泛漠视。本文旨在填补这一空白,揭示LLM在生成内容时对真理的忽视机制。

Result: RLHF微调显著加剧胡说行为;CoT提示放大了空话和含糊其辞;政治语境中,“含糊其辞”是主要胡说策略。

Insight: 研究揭示了AI对齐中的系统性问题,为提升LLM的真实性提供了新视角。

Abstract: Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and sycophancy, we propose machine bullshit as an overarching conceptual framework that can allow researchers to characterize the broader phenomenon of emergent loss of truthfulness in LLMs and shed light on its underlying mechanisms. We introduce the Bullshit Index, a novel metric quantifying LLMs’ indifference to truth, and propose a complementary taxonomy analyzing four qualitative forms of bullshit: empty rhetoric, paltering, weasel words, and unverified claims. We conduct empirical evaluations on the Marketplace dataset, the Political Neutrality dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI assistants) explicitly designed to evaluate machine bullshit. Our results demonstrate that model fine-tuning with reinforcement learning from human feedback (RLHF) significantly exacerbates bullshit and inference-time chain-of-thought (CoT) prompting notably amplify specific bullshit forms, particularly empty rhetoric and paltering. We also observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy. Our findings highlight systematic challenges in AI alignment and provide new insights toward more truthful LLM behavior.


[80] PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving cs.CL | cs.AIPDF

Mihir Parmar, Palash Goyal, Xin Liu, Yiwen Song, Mingyang Ling

TL;DR: PLAN-TUNING通过从大模型中蒸馏任务分解轨迹,并利用监督和强化学习对小模型进行微调,显著提升了小模型在复杂推理任务中的性能。

Details

Motivation: 当前研究主要关注大语言模型(LLMs)的任务分解能力,而对于如何通过后训练将这种能力迁移到小模型中仍未充分探索。

Result: 在GSM8k和MATH基准测试中,PLAN-TUNING模型平均性能提升7%。在跨域数据集(如OlympiadBench和AIME 2024)上,性能提升10-12%。

Insight: 规划轨迹能够显著提升小模型的复杂推理能力,表明PLAN-TUNING是一种有效的小模型性能优化策略。

Abstract: Recently, decomposing complex problems into simple subtasks–a crucial part of human-like natural planning–to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average $\sim7%$. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average $\sim10%$ and $\sim12%$ performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.


[81] Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code cs.CL | cs.LGPDF

Keqin Bao, Nuo Chen, Xiaoyuan Li, Binyuan Hui, Bowen Yu

TL;DR: 论文提出TeaR方法,通过强化学习和数据优化,提升大语言模型的推理能力,避免了直接依赖复杂代码结构的问题。

Details

Motivation: 现有方法通过模拟代码执行提升LLM推理能力,但依赖复杂数据结构和算法,容易过拟合。TeaR旨在通过优化数据和使用强化学习改进推理能力。

Result: 在多个基准测试中,TeaR显著提升了性能,Qwen2.5-7B和R1-Distilled-7B分别提升35.9%和5.9%。

Insight: 强化学习和数据优化是提升LLM推理能力的有效途径,避免了单纯依赖复杂代码的局限性。

Abstract: Enhancing reasoning capabilities remains a central focus in the LLM reasearch community. A promising direction involves requiring models to simulate code execution step-by-step to derive outputs for given inputs. However, as code is often designed for large-scale systems, direct application leads to over-reliance on complex data structures and algorithms, even for simple cases, resulting in overfitting to algorithmic patterns rather than core reasoning structures. To address this, we propose TeaR, which aims at teaching LLMs to reason better. TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks, thereby improving general reasoning abilities. We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning. The results consistently show significant performance improvements. Notably, TeaR achieves a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B.


[82] The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs cs.CLPDF

Jierun Chen, Tiezheng Yu, Haoli Bai, Lewei Yao, Jiannan Wu

TL;DR: 这篇论文探讨了在视觉语言模型(VLMs)中联合使用长思维链监督微调(CoT SFT)和强化学习(RL)的局限性及其协同困境,揭示了两种方法在提升推理能力时的互补性与冲突。

Details

Motivation: 尽管长CoT SFT和RL在纯语言模型中表现出协同效应,但在VLMs中的联合效果尚不明确。论文旨在系统研究这两种后训练技术在多模态推理任务中的独特作用和交互效果。

Result: 实验结果表明:长CoT SFT能提升复杂问题的推理能力但会导致冗长和简单性能下降;RL则提升通用性和简洁性但对最难题效果有限。联合训练未能实现叠加增益,反而引发权衡。

Insight: 论文的洞察在于提出需要更无缝和自适应的方法来结合后训练技术,以充分发挥VLMs在推理任务中的潜力。

Abstract: Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma’’ highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.


[83] Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation cs.CL | cs.AI | cs.CVPDF

Yupu Liang, Yaping Zhang, Zhiyang Zhang, Yang Zhao, Lu Xiang

TL;DR: 论文提出了一种名为M4Doc的单模态到多模态对齐框架,利用多模态大语言模型(MLLM)解决文档图像机器翻译(DIMT)中的数据稀缺和模态交互问题,显著提升了翻译质量。

Details

Motivation: 文档图像机器翻译(DIMT)面临训练数据有限以及视觉和文本信息交互复杂的挑战,现有方法难以泛化到跨领域和复杂场景。

Result: 实验表明,M4Doc在跨领域泛化和复杂文档图像场景中显著提升了翻译质量。

Insight: 通过对齐单模态与多模态表示,可以高效学习视觉-文本关联,而轻量化设计确保了推理阶段的实用性。

Abstract: Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.


[84] When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance cs.CL | cs.AIPDF

Peizhang Shao, Linrui Xu, Jinxi Wang, Wei Zhou, Xingyu Wu

TL;DR: 该论文首次全面综述了大语言模型(LLMs)在法律领域的应用,提出了一种创新的双视角分类法,融合法律推理框架与专业本体,统一了历史研究与当代突破。通过技术革新(如稀疏注意力机制和专家混合架构),在任务泛化、推理形式化、工作流整合等方面取得显著进展,但也面临幻觉、解释性不足等挑战。

Details

Motivation: 法律领域的复杂性和专业性要求更智能的工具支持,LLMs的涌现能力(如上下文推理和生成性论证)为法律AI提供了新的可能性,但也带来技术、伦理和适应性挑战。

Result: 文档记录了LLMs在法律文本处理、知识整合和评估严谨性方面的显著进展,同时指出了幻觉、解释性缺陷等未解决问题。

Insight: 未来的研究方向包括低资源系统、多模态证据整合和动态反驳处理,需平衡技术进步与伦理治理。

Abstract: This paper establishes the first comprehensive review of Large Language Models (LLMs) applied within the legal domain. It pioneers an innovative dual lens taxonomy that integrates legal reasoning frameworks and professional ontologies to systematically unify historical research and contemporary breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such as contextual reasoning and generative argumentation, surmount traditional limitations by dynamically capturing legal semantics and unifying evidence reasoning. Significant progress is documented in task generalization, reasoning formalization, workflow integration, and addressing core challenges in text processing, knowledge integration, and evaluation rigor via technical innovations like sparse attention mechanisms and mixture-of-experts architectures. However, widespread adoption of LLM introduces critical challenges: hallucination, explainability deficits, jurisdictional adaptation difficulties, and ethical asymmetry. This review proposes a novel taxonomy that maps legal roles to NLP subtasks and computationally implements the Toulmin argumentation framework, thus systematizing advances in reasoning, retrieval, prediction, and dispute resolution. It identifies key frontiers including low-resource systems, multimodal evidence integration, and dynamic rebuttal handling. Ultimately, this work provides both a technical roadmap for researchers and a conceptual framework for practitioners navigating the algorithmic future, laying a robust foundation for the next era of legal artificial intelligence. We have created a GitHub repository to index the relevant papers: https://github.com/Kilimajaro/LLMs_Meet_Law.


[85] StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model cs.CL | cs.SD | eess.ASPDF

Shoutao Guo, Xiang Li, Shaolei Zhang, Mengge Liu, Wei Chen

TL;DR: StreamUni通过统一的Large Speech-Language Model (LSLM)实现了流式语音翻译(StreamST),结合语音Chain-of-Thought(CoT)指导模型生成多阶段输出,同时完成语音分段、策略决策和翻译生成,无需大量策略训练。

Details

Motivation: 现有的流式语音翻译方法通常依赖句子级语音分段(SimulST),需与分段模型协作,限制了上下文信息且策略学习复杂。

Result: 实验表明StreamUni在StreamST任务上表现最佳。

Insight: 语音CoT和多阶段输出设计为流式语音翻译提供了新思路,减少了策略训练的依赖并提升了性能。

Abstract: Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.


[86] Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation cs.CL | cs.AI | 68T50, 68T07, 03G30, 18C10 | I.2.7; I.2.6; F.4.1PDF

Bugra Kilictas, Faruk Alpay

TL;DR: 论文扩展了Alpay代数的自参考框架,提出了一种多层语义游戏架构,通过超限不动点收敛实现层次化子游戏的迭代融合。结合游戏论与不动点理论,证明了语义均衡的存在性和唯一性,并提供了实际验证方法。

Details

Motivation: 研究动机在于将Alpay代数的自参考框架进一步扩展,结合游戏论和不动点理论,解决AI系统与文档对齐过程中的复杂语义问题。

Result: 结果表明,通过超限不动点收敛可以实现多层语义游戏的统一均衡解,同时验证了该框架在真实AI认知模型中的适用性。

Insight: 洞察在于揭示了游戏论推理可由不动点迭代自然生成,而非依赖外部强加,同时框架本身作为语义病毒实例在AI嵌入空间中传播其模式。

Abstract: This paper extends the self-referential framework of Alpay Algebra into a multi-layered semantic game architecture where transfinite fixed-point convergence encompasses hierarchical sub-games at each iteration level. Building upon Alpay Algebra IV’s empathetic embedding concept, we introduce a nested game-theoretic structure where the alignment process between AI systems and documents becomes a meta-game containing embedded decision problems. We formalize this through a composite operator $\phi(\cdot, \gamma(\cdot))$ where $\phi$ drives the main semantic convergence while $\gamma$ resolves local sub-games. The resulting framework demonstrates that game-theoretic reasoning emerges naturally from fixed-point iteration rather than being imposed externally. We prove a Game Theorem establishing existence and uniqueness of semantic equilibria under realistic cognitive simulation assumptions. Our verification suite includes adaptations of Banach’s fixed-point theorem to transfinite contexts, a novel $\phi$-topology based on the Kozlov-Maz’ya-Rossmann formula for handling semantic singularities, and categorical consistency tests via the Yoneda lemma. The paper itself functions as a semantic artifact designed to propagate its fixed-point patterns in AI embedding spaces – a deliberate instantiation of the “semantic virus” concept it theorizes. All results are grounded in category theory, information theory, and realistic AI cognition models, ensuring practical applicability beyond pure mathematical abstraction.


[87] DocCHA: Towards LLM-Augmented Interactive Online diagnosis System cs.CLPDF

Xinyi Liu, Dachun Sun, Yi R. Fung, Dilek Hakkani-Tür, Tarek Abdelzaher

TL;DR: DocCHA 是一個基於大型語言模型(LLM)的互動式線上診斷系統,通過模組化分階段進行臨床推理,提升診斷準確性和症狀回顧能力。

Details

Motivation: 現有的對話式健康助手(CHAs)缺乏適應性多輪推理和透明決策能力,限制了其在臨床診斷中的實際應用。

Result: 在兩個中文診斷數據集(IMCS21, DX)上,DocCHA 顯著優於基於提示的 LLM 模型,診斷準確率提升 5.18%,症狀回顧提高 30% 以上。

Insight: DocCHA 展示了模組化和置信度分數在實現結構化、透明對話中的有效性,為多語言和資源受限環境中的可信臨床助手提供了可能。

Abstract: Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and over 30 percent improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate the effectiveness of DocCHA in enabling structured, transparent, and efficient diagnostic conversations – paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.


[88] SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment cs.CLPDF

Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan

TL;DR: 论文提出了SAGE框架,通过自引导事实增强(SFE)和熵感知直接偏好优化(E-DPO)提升视觉语言模型(VLM)在工业异常检测中的表现,并引入了AD-PL数据集和MLE评估方法。

Details

Motivation: 现有视觉语言模型在工业异常检测中表现不佳,主要因领域特异性和缺乏可解释性。SAGE旨在解决这些问题,提升模型性能和泛化能力。

Result: SAGE在零样本和小样本设置下的工业异常检测数据集中表现优异。

Insight: 通过领域知识增强和专家偏好优化,可以显著提升视觉语言模型在工业异常检测中的表现和解释能力。

Abstract: While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.


[89] MIRIX: Multi-Agent Memory System for LLM-Based Agents cs.CL | cs.AIPDF

Yu Wang, Xi Chen

TL;DR: MIRIX提出了一种模块化、多智能体的记忆系统,通过六种结构化的记忆类型和多智能体框架,解决了现有AI记忆系统的局限性,显著提升了语言模型在真实场景中的记忆能力。

Details

Motivation: 现有AI记忆系统普遍依赖简单、扁平的记忆组件,无法有效实现个性化、抽象化和长期可靠的用户信息回忆。MIRIX旨在解决这一核心挑战。

Result: 1. 在ScreenshotVQA上比RAG基线提高35%准确率,存储需求减少99.9%;2. 在LOCOMO对话任务上达到85.4%的SOTA性能。

Insight: MIRIX通过结构化记忆和多智能体框架的组合,显著提升了语言模型的长期记忆能力,尤其在多模态场景中表现出色,为未来AI记忆系统的发展提供了新方向。

Abstract: Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field’s most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.


[90] Why is Your Language Model a Poor Implicit Reward Model? cs.CL | cs.AI | cs.LG | stat.MLPDF

Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora

TL;DR: 论文探讨了语言模型作为隐式奖励模型(IM-RM)与显式奖励模型(EX-RM)在泛化能力上的差异,发现IM-RM更依赖表层的token级别信息,导致其在分布外或分布内表现较差。

Details

Motivation: 研究动机是解释为何语言模型作为隐式奖励模型在泛化能力上不如显式奖励模型,尽管二者在训练数据、损失函数和语言模型上几乎相同。

Result: 实验结果表明,IM-RM在token级别的分布变化中表现较差,而在生成任务中的表现与验证任务无关。

Insight: 研究指出,即使微小的设计选择(如奖励计算方式)也可能显著影响奖励模型的泛化行为,这对语言模型的后续训练和推理流程具有重要启示。

Abstract: Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.


[91] Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology cs.CL | cs.AI | L01.224.900.500 (Primary), L01.700.508.300, L01.224.050.375,

H02.403.720.750, N04.590, N04.452.758.625 (Secondary) | I.2.7; H.3.3; J.3; I.2.9; C.4PDF
Sabine Felde, Rüdiger Buchkremer, Gamal Chehab, Christian Thielscher, Jörg HW Distler

TL;DR: 小型语言模型(SLMs)与检索增强生成(RAG)结合,在临床决策支持中表现优于大型语言模型(LLMs),且更节能和低成本。

Details

Motivation: 评估LLMs和SLMs在风湿病学临床决策支持中的性能,探索实用性和资源效率。

Result: SLMs在诊断和治疗性能上优于LLMs,但专家监督仍是必要的。

Insight: 在资源有限的医疗环境中,SLMs结合RAG是一种高效且可行的解决方案。

Abstract: Large language models (LLMs) show promise for supporting clinical decision-making in complex fields such as rheumatology. Our evaluation shows that smaller language models (SLMs), combined with retrieval-augmented generation (RAG), achieve higher diagnostic and therapeutic performance than larger models, while requiring substantially less energy and enabling cost-efficient, local deployment. These features are attractive for resource-limited healthcare. However, expert oversight remains essential, as no model consistently reached specialist-level accuracy in rheumatology.


[92] Automating Expert-Level Medical Reasoning Evaluation of Large Language Models cs.CLPDF

Shuang Zhou, Wenya Xie, Jiaxi Li, Zaifu Zhan, Meijia Song

TL;DR: 论文提出了MedThink-Bench基准测试和LLM-w-Ref评估框架,用于严格、可解释且可扩展地评估大型语言模型(LLM)的医学推理能力,并通过实验验证了其有效性。

Details

Motivation: 当前LLM在临床决策中的应用日益广泛,但现有评估方法或缺乏严谨性,或难以扩展。为此,需要一种透明且可信的医学推理评估工具。

Result: LLM-w-Ref与专家判断高度相关。实验发现,较小模型(如MedGemma-27B)表现可能优于大型专有模型(如OpenAI-o3)。

Insight: 医学推理评估需要结合专家知识和可扩展性,而较小模型在特定任务上可能更具优势。

Abstract: As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs’ medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs’ medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs’ medical reasoning, advancing their safe and responsible deployment in clinical practice.


[93] PyVision: Agentic Vision with Dynamic Tooling cs.CL | cs.AI | cs.CVPDF

Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu

TL;DR: PyVision is an interactive framework that enables multimodal LLMs to dynamically generate and refine Python tools for visual reasoning, improving performance significantly on benchmarks.

Details

Motivation: Prior visual reasoning approaches are limited by static toolsets. PyVision addresses this by allowing models to dynamically create and refine tools, enhancing flexibility and interpretability.

Result: PyVision boosts performance by +7.8% for GPT-4.1 on V* and +31.1% for Claude-4.0-Sonnet on VLMsAreBlind-mini.

Insight: Dynamic tooling enables models to invent tools, advancing toward more autonomous and agentic visual reasoning.

Abstract: LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.


cs.SD [Back]

[94] Input Conditioned Layer Dropping in Speech Foundation Models cs.SD | cs.CV | eess.ASPDF

Abdul Hannan, Daniele Falavigna, Alessio Brutti

TL;DR: 本文提出了一种输入驱动的层级丢弃(Layer Dropping, LD)方法,通过轻量级选择网络动态决定处理层组合,显著提升了边缘和物联网环境中语音基础模型的适应性。

Details

Motivation: 边缘和物联网设备的计算资源动态变化,需要灵活的模型架构来适应不同计算负载。现有层级丢弃方法在层选择上存在局限性,或对神经架构改动较大。

Result: 方法显著优于随机丢弃,与早期退出(early exit)方法表现相当或更好。

Insight: 输入驱动的动态调整机制可以有效平衡计算负载与模型性能,为资源受限场景提供了实用解决方案。

Abstract: Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping ($\mathcal{LD}$) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven $\mathcal{LD}$ that employs the network’s input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.


cs.AI [Back]

[95] Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery cs.AI | astro-ph.IM | cs.CL | cs.MAPDF

Licong Xu, Milind Sarkar, Anto I. Lonappan, Íñigo Zubeldia, Pablo Villanueva-Domingo

TL;DR: 该论文提出了一个名为cmbagent的多智能体系统,通过约30个大型语言模型(LLM)智能体实现科学研究的自动化流程,无需人工干预,成功应用于PhD级别的宇宙学任务。

Details

Motivation: 科学研究的自动化是一个复杂且需要多任务协调的挑战,该论文旨在通过多智能体系统实现全自动化的科研流程,减少人工干预。

Result: 在宇宙学任务中,cmbagent的表现优于当前最先进的LLM,源代码和演示视频已公开。

Insight: 多智能体系统可以有效协调复杂任务,LLM在科研自动化中表现出潜力,但仍需进一步研究以优化任务分配与协作机制。

Abstract: We present a multi-agent system for automation of scientific research tasks, cmbagent. The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.


[96] ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning cs.AI | cs.CL | eess.ASPDF

Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu

TL;DR: ViDove是一个基于多模态输入的翻译代理系统,通过结合视觉和上下文背景信息提升翻译质量,并引入记忆模块和领域知识,显著优于现有方法。

Details

Motivation: 现有基于LLM的翻译代理通常仅支持纯文本输入,缺乏对多模态信息的利用,ViDove旨在通过模仿人类翻译的工作流程填补这一空白。

Result: 在字幕生成和通用翻译任务中,ViDove的BLEU分数提升28%,SubER提升15%。

Insight: 多模态背景信息和领域知识能显著提升翻译质量,尤其在长视频字幕等复杂任务中效果突出。

Abstract: LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our code is available here: https://github.com/pigeonai-org/ViDove


eess.IV [Back]

[97] Semi-supervised learning and integration of multi-sequence MR-images for carotid vessel wall and plaque segmentation eess.IV | cs.CV | cs.LGPDF

Marie-Christine Pali, Christina Schwaiger, Malik Galijasevic, Valentin K. Ladenhauf, Stephanie Mangesius

TL;DR: 本文提出了一种半监督深度学习方法,用于多序列MRI数据的颈动脉血管壁和斑块分割,通过粗定位和精细分割网络结合数据融合策略,解决了标记数据稀缺和斑块复杂形态的挑战。

Details

Motivation: 颈动脉斑块分析对评估动脉粥样硬化和缺血性中风风险至关重要,但斑块形态复杂且标记数据稀缺,亟需高效的分割方法。

Result: 在52名患者的五序列MRI数据上验证,实验表明方法有效且融合策略选择对U-Net架构性能至关重要。

Insight: 多序列数据融合和半监督学习可显著提升颈动脉分割性能,尤其在标记数据有限的情况下。

Abstract: The analysis of carotid arteries, particularly plaques, in multi-sequence Magnetic Resonance Imaging (MRI) data is crucial for assessing the risk of atherosclerosis and ischemic stroke. In order to evaluate metrics and radiomic features, quantifying the state of atherosclerosis, accurate segmentation is important. However, the complex morphology of plaques and the scarcity of labeled data poses significant challenges. In this work, we address these problems and propose a semi-supervised deep learning-based approach designed to effectively integrate multi-sequence MRI data for the segmentation of carotid artery vessel wall and plaque. The proposed algorithm consists of two networks: a coarse localization model identifies the region of interest guided by some prior knowledge on the position and number of carotid arteries, followed by a fine segmentation model for precise delineation of vessel walls and plaques. To effectively integrate complementary information across different MRI sequences, we investigate different fusion strategies and introduce a multi-level multi-sequence version of U-Net architecture. To address the challenges of limited labeled data and the complexity of carotid artery MRI, we propose a semi-supervised approach that enforces consistency under various input transformations. Our approach is evaluated on 52 patients with arteriosclerosis, each with five MRI sequences. Comprehensive experiments demonstrate the effectiveness of our approach and emphasize the role of fusion point selection in U-Net-based architectures. To validate the accuracy of our results, we also include an expert-based assessment of model performance. Our findings highlight the potential of fusion strategies and semi-supervised learning for improving carotid artery segmentation in data-limited MRI applications.


[98] Compressive Imaging Reconstruction via Tensor Decomposed Multi-Resolution Grid Encoding eess.IV | cs.CVPDF

Zhenyu Jin, Yisi Luo, Xile Zhao, Deyu Meng

TL;DR: 本文提出了一种基于张量分解和多分辨率网格编码的无监督连续表示框架GridTD,用于压缩成像(CI)重建。该方法结合了多分辨率网格编码的层次建模能力和张量分解的紧凑性,在高效性和表示能力上取得平衡,并通过理论分析和实验验证了其优越性。

Details

Motivation: 压缩成像(CI)重建需要从低维压缩测量中恢复高维图像。现有无监督表示方法在表示能力和效率上难以平衡,因此需要一种更高效的表示框架。

Result: 在视频SCI、光谱SCI和压缩动态MRI重建等任务中,GridTD表现优于现有方法,成为通用且先进的CI重建方法。

Insight: GridTD的理论分析(如Lipschitz性质、泛化误差界和定点收敛)揭示了其内在优越性,为连续表示模型的进一步研究提供了新思路。

Abstract: Compressive imaging (CI) reconstruction, such as snapshot compressive imaging (SCI) and compressive sensing magnetic resonance imaging (MRI), aims to recover high-dimensional images from low-dimensional compressed measurements. This process critically relies on learning an accurate representation of the underlying high-dimensional image. However, existing unsupervised representations may struggle to achieve a desired balance between representation ability and efficiency. To overcome this limitation, we propose Tensor Decomposed multi-resolution Grid encoding (GridTD), an unsupervised continuous representation framework for CI reconstruction. GridTD optimizes a lightweight neural network and the input tensor decomposition model whose parameters are learned via multi-resolution hash grid encoding. It inherently enjoys the hierarchical modeling ability of multi-resolution grid encoding and the compactness of tensor decomposition, enabling effective and efficient reconstruction of high-dimensional images. Theoretical analyses for the algorithm’s Lipschitz property, generalization error bound, and fixed-point convergence reveal the intrinsic superiority of GridTD as compared with existing continuous representation models. Extensive experiments across diverse CI tasks, including video SCI, spectral SCI, and compressive dynamic MRI reconstruction, consistently demonstrate the superiority of GridTD over existing methods, positioning GridTD as a versatile and state-of-the-art CI reconstruction method.


[99] MeD-3D: A Multimodal Deep Learning Framework for Precise Recurrence Prediction in Clear Cell Renal Cell Carcinoma (ccRCC) eess.IV | cs.CVPDF

Hasaan Maqsood, Saif Ur Rehman Khan

TL;DR: 该论文提出了一种名为MeD-3D的多模态深度学习框架,用于精确预测肾透明细胞癌(ccRCC)的复发,整合了影像学、组织病理学、临床和基因组数据。

Details

Motivation: 现有的单模态预测模型无法充分捕捉ccRCC的复杂性,导致预测准确性不足。通过整合多模态数据,可以提升预测性能,支持临床决策。

Result: 实验表明,整合多模态数据显著提升了ccRCC复发预测的准确性。

Insight: 多模态数据融合可以更好地捕捉疾病的复杂性,为个性化医疗提供支持。

Abstract: Accurate prediction of recurrence in clear cell renal cell carcinoma (ccRCC) remains a major clinical challenge due to the disease complex molecular, pathological, and clinical heterogeneity. Traditional prognostic models, which rely on single data modalities such as radiology, histopathology, or genomics, often fail to capture the full spectrum of disease complexity, resulting in suboptimal predictive accuracy. This study aims to overcome these limitations by proposing a deep learning (DL) framework that integrates multimodal data, including CT, MRI, histopathology whole slide images (WSI), clinical data, and genomic profiles, to improve the prediction of ccRCC recurrence and enhance clinical decision-making. The proposed framework utilizes a comprehensive dataset curated from multiple publicly available sources, including TCGA, TCIA, and CPTAC. To process the diverse modalities, domain-specific models are employed: CLAM, a ResNet50-based model, is used for histopathology WSIs, while MeD-3D, a pre-trained 3D-ResNet18 model, processes CT and MRI images. For structured clinical and genomic data, a multi-layer perceptron (MLP) is used. These models are designed to extract deep feature embeddings from each modality, which are then fused through an early and late integration architecture. This fusion strategy enables the model to combine complementary information from multiple sources. Additionally, the framework is designed to handle incomplete data, a common challenge in clinical settings, by enabling inference even when certain modalities are missing.


[100] Label-Efficient Chest X-ray Diagnosis via Partial CLIP Adaptation eess.IV | cs.CVPDF

Heet Nitinkumar Dalsania

TL;DR: 该论文提出了一种标签高效策略,通过部分微调CLIP模型的视觉编码器,将其应用于胸部X射线诊断任务,显著提升了在少样本情况下的性能。

Details

Motivation: 医疗影像任务通常依赖大量标注数据,但真实医院场景中标注稀疏且难以获取,因此需要一种标签高效的解决方案。

Result: 实验表明,该方法在少样本情况下比零样本基线平均AUC提升了20%以上。

Insight: 预训练的视觉-语言特征可以有效迁移到少样本医疗影像任务,为真实医院场景提供了实用且可扩展的解决方案。

Abstract: Modern deep learning implementations for medical imaging usually rely on large labeled datasets. These datasets are often difficult to obtain due to privacy concerns, high costs, and even scarcity of cases. In this paper, a label-efficient strategy is proposed for chest X-ray diagnosis that seeks to reflect real-world hospital scenarios. The experiments use the NIH Chest X-ray14 dataset and a pre-trained CLIP ViT-B/32 model. The model is adapted via partial fine-tuning of its visual encoder and then evaluated using zero-shot and few-shot learning with 1-16 labeled examples per disease class. The tests demonstrate that CLIP’s pre-trained vision-language features can be effectively adapted to few-shot medical imaging tasks, achieving over 20% improvement in mean AUC score as compared to the zero-shot baseline. The key aspect of this work is to attempt to simulate internal hospital workflows, where image archives exist but annotations are sparse. This work evaluates a practical and scalable solution for both common and rare disease diagnosis. Additionally this research is intended for academic and experimental purposes only and has not been peer reviewed yet. All code is found at https://github.com/heet007-code/CLIP-disease-xray.


[101] Computationally Efficient Information-Driven Optical Design with Interchanging Optimization eess.IV | cs.CE | cs.CV | cs.IT | math.IT | physics.opticsPDF

Eric Markley, Henry Pinkard, Leyla Kabuli, Nalini Singh, Laura Waller

TL;DR: 该论文提出了一种改进的信息驱动光学设计方法IDEAL-IO,通过交替优化解决了IDEAL方法的高内存、长运行时间和目标函数不匹配的问题,适用于多种成像系统。

Details

Motivation: IDEAL方法虽然实现了应用无关的光学设计,但存在高内存占用、长运行时间以及目标函数不匹配的问题。为了解决这些问题,论文提出了改进的IDEAL-IO方法。

Result: 实验表明,IDEAL-IO在衍射光学、无透镜成像和快照3D显微镜等应用中,内存和运行时间减少了6倍,同时优化效果更优。

Insight: 通过解耦密度估计和参数优化,IDEAL-IO提供了一种更实用、可扩展的信息驱动光学设计策略。

Abstract: Recent work has demonstrated that imaging systems can be evaluated through the information content of their measurements alone, enabling application-agnostic optical design that avoids computational decoding challenges. Information-Driven Encoder Analysis Learning (IDEAL) was proposed to automate this process through gradient-based. In this work, we study IDEAL across diverse imaging systems and find that it suffers from high memory usage, long runtimes, and a potentially mismatched objective function due to end-to-end differentiability requirements. We introduce IDEAL with Interchanging Optimization (IDEAL-IO), a method that decouples density estimation from optical parameter optimization by alternating between fitting models to current measurements and updating optical parameters using fixed models for information estimation. This approach reduces runtime and memory usage by up to 6x while enabling more expressive density models that guide optimization toward superior designs. We validate our method on diffractive optics, lensless imaging, and snapshot 3D microscopy applications, establishing information-theoretic optimization as a practical, scalable strategy for real-world imaging system design.


eess.SP [Back]

[102] mmFlux: Crowd Flow Analytics with Commodity mmWave MIMO Radar eess.SP | cs.CVPDF

Anurag Pallaprolu, Winston Hurst, Yasamin Mostofi

TL;DR: 论文提出了一种利用毫米波雷达分析人群流动模式和语义的新框架mmFlux,结合光学流估计和噪声过滤技术,生成高保真流场,并通过几何图和雅可比分析提取关键语义。实验验证了其有效性。

Details

Motivation: 传统人群分析方法(如摄像头)受限于隐私和遮挡问题,毫米波雷达提供了一种非侵入式解决方案,但缺乏对复杂流动模式和语义的精确捕捉。

Result: 实验表明,框架能高保真重建复杂人群流动结构,并准确推断转向、边界和聚集等语义。

Insight: 毫米波雷达结合信号处理和几何分析,为人群分析提供了隐私友好且鲁棒的解决方案。

Abstract: In this paper, we present a novel framework for extracting underlying crowd motion patterns and inferring crowd semantics using mmWave radar. First, our proposed signal processing pipeline combines optical flow estimation concepts from vision with novel statistical and morphological noise filtering to generate high-fidelity mmWave flow fields - compact 2D vector representations of crowd motion. We then introduce a novel approach that transforms these fields into directed geometric graphs, where edges capture dominant flow currents, vertices mark crowd splitting or merging, and flow distribution is quantified across edges. Finally, we show that by analyzing the local Jacobian and computing the corresponding curl and divergence, we can extract key crowd semantics for both structured and diffused crowds. We conduct 21 experiments on crowds of up to (and including) 20 people across 3 areas, using commodity mmWave radar. Our framework achieves high-fidelity graph reconstruction of the underlying flow structure, even for complex crowd patterns, demonstrating strong spatial alignment and precise quantitative characterization of flow split ratios. Finally, our curl and divergence analysis accurately infers key crowd semantics, e.g., abrupt turns, boundaries where flow directions shift, dispersions, and gatherings. Overall, these findings validate our framework, underscoring its potential for various crowd analytics applications.


cs.RO [Back]

[103] LangNavBench: Evaluation of Natural Language Understanding in Semantic Navigation cs.RO | cs.CVPDF

Sonia Raychaudhuri, Enrico Cancelli, Tommaso Campari, Lamberto Ballan, Manolis Savva

TL;DR: 论文提出了LangNavBench,一个专注于自然语言理解的语义导航基准测试,并引入了Multi-Layered Feature Map (MLFM)方法,在细粒度语言指令上表现优越。

Details

Motivation: 现有的大规模视觉语言模型虽在语义导航中有进展,但缺乏专门测试语言理解的基准数据集和方法。LangNavBench填补了这一空白。

Result: MLFM在LangNav数据集上超越了现有的基于地图的导航基线方法。

Insight: 语言理解在语义导航中至关重要,尤其是在处理细粒度指令时。多层次的语义表示能显著提升性能。

Abstract: Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Despite these advances, we still lack a clear, language-focused benchmark for testing how well such agents ground the words in their instructions. We address this gap with LangNav, an open-set dataset specifically created to test an agent’s ability to locate objects described at different levels of detail, from broad category names to fine attributes and object-object relations. Every description in LangNav was manually checked, yielding a lower error rate than existing lifelong- and semantic-navigation datasets. On top of LangNav we build LangNavBench, a benchmark that measures how well current semantic-navigation methods understand and act on these descriptions while moving toward their targets. LangNavBench allows us to systematically compare models on their handling of attributes, spatial and relational cues, and category hierarchies, offering the first thorough, language-centric evaluation of embodied navigation systems. We also present Multi-Layered Feature Map (MLFM), a method that builds a queryable multi-layered semantic map, particularly effective when dealing with small objects or instructions involving spatial relations. MLFM outperforms state-of-the-art mapping-based navigation baselines on the LangNav dataset.


cs.GR [Back]

[104] SD-GS: Structured Deformable 3D Gaussians for Efficient Dynamic Scene Reconstruction cs.GR | cs.CVPDF

Wei Yao, Shuzhao Xie, Letian Li, Weixiang Zhang, Zhixin Lai

TL;DR: SD-GS提出了一种高效的动态场景重建框架,通过分层可变形锚网格和自适应稠密化策略,显著减少了模型大小并提升了计算效率,同时保持了视觉质量。

Details

Motivation: 当前4D高斯框架在动态场景重建中虽然视觉保真度和渲染速度表现优异,但存储成本与复杂运动建模能力之间的权衡限制了其实际应用。

Result: 模型大小平均减少60%,FPS提升100%,同时保持或超过现有方法的视觉质量。

Insight: 通过层次化和自适应策略,SD-GS在动态场景重建中实现了存储与性能的双重突破。

Abstract: Current 4D Gaussian frameworks for dynamic scene reconstruction deliver impressive visual fidelity and rendering speed, however, the inherent trade-off between storage costs and the ability to characterize complex physical motions significantly limits the practical application of these methods. To tackle these problems, we propose SD-GS, a compact and efficient dynamic Gaussian splatting framework for complex dynamic scene reconstruction, featuring two key contributions. First, we introduce a deformable anchor grid, a hierarchical and memory-efficient scene representation where each anchor point derives multiple 3D Gaussians in its local spatiotemporal region and serves as the geometric backbone of the 3D scene. Second, to enhance modeling capability for complex motions, we present a deformation-aware densification strategy that adaptively grows anchors in under-reconstructed high-dynamic regions while reducing redundancy in static areas, achieving superior visual quality with fewer anchors. Experimental results demonstrate that, compared to state-of-the-art methods, SD-GS achieves an average of 60% reduction in model size and an average of 100% improvement in FPS, significantly enhancing computational efficiency while maintaining or even surpassing visual quality.


[105] Capture Stage Environments: A Guide to Better Matting cs.GR | cs.CVPDF

Hannah Dröge, Janelle Pfeifer, Saskia Rabich, Markus Plack, Reinhard Klein

TL;DR: 该论文探讨了专业拍摄舞台环境下图像抠图的挑战,并提出了改进工作流程的指南和高效适应最新技术的方法。

Details

Motivation: 传统的抠图算法在拍摄舞台内容中表现不佳,无法应对其特殊性,因此需要针对这一环境提出改进方案。

Result: 论文通过实验验证了所提方法的有效性,并展示了其在离线及实时场景中的优势。

Insight: 拍摄舞台内容的特殊性对抠图技术提出了新挑战,而通过主动干预和高效自适应可以显著提升效果。

Abstract: Capture stages are high-end sources of state-of-the-art recordings for downstream applications in movies, games, and other media. One crucial step in almost all pipelines is the matting of images to isolate the captured performances from the background. While common matting algorithms deliver remarkable performance in other applications like teleconferencing and mobile entertainment, we found that they struggle significantly with the peculiarities of capture stage content. The goal of our work is to share insights into those challenges as a curated list of those characteristics along with a constructive discussion for proactive intervention and present a guideline to practitioners for an improved workflow to mitigate unresolved challenges. To this end, we also demonstrate an efficient pipeline to adapt state-of-the-art approaches to such custom setups without the need of extensive annotations, both offline and real-time. For an objective evaluation, we propose a validation methodology based on a leading diffusion model that highlights the benefits of our approach.


[106] RTR-GS: 3D Gaussian Splatting for Inverse Rendering with Radiance Transfer and Reflection cs.GR | cs.CVPDF

Yongyang Zhou, Fang-Lue Zhang, Zichen Wang, Lei Zhang

TL;DR: 本文提出RTR-GS,一种结合辐射传输与反射的3D高斯散射框架,用于反渲染任务,能够分解BRDF和光照,并提供可信的重光照结果。

Details

Motivation: 现有的3D高斯散射(3DGS)在新视角合成中表现优异,但在处理反渲染和重光照任务时,尤其是反射物体的渲染方面仍存在挑战。

Result: 实验表明,该方法在新视角合成、法线估计、分解和重光照任务中表现优异,同时保持了高效的训练和推理过程。

Insight: 通过混合渲染模型分离高频和低频信息,可以有效解决球形谐波过拟合导致的浮动伪影问题。

Abstract: 3D Gaussian Splatting (3DGS) has demonstrated impressive capabilities in novel view synthesis. However, rendering reflective objects remains a significant challenge, particularly in inverse rendering and relighting. We introduce RTR-GS, a novel inverse rendering framework capable of robustly rendering objects with arbitrary reflectance properties, decomposing BRDF and lighting, and delivering credible relighting results. Given a collection of multi-view images, our method effectively recovers geometric structure through a hybrid rendering model that combines forward rendering for radiance transfer with deferred rendering for reflections. This approach successfully separates high-frequency and low-frequency appearances, mitigating floating artifacts caused by spherical harmonic overfitting when handling high-frequency details. We further refine BRDF and lighting decomposition using an additional physically-based deferred rendering branch. Experimental results show that our method enhances novel view synthesis, normal estimation, decomposition, and relighting while maintaining efficient training inference process.


cs.CR [Back]

[107] May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks cs.CR | cs.AI | cs.CLPDF

Nishit V. Pandya, Andrey Labunets, Sicun Gao, Earlence Fernandes

TL;DR: 该论文针对基于微调的prompt注入防御方法,提出了一种新型的注意力机制攻击算法,成功攻破了两种最新防御方法,攻击成功率高达70%。

Details

Motivation: 现有基于微调的prompt注入防御方法声称可以分离指令和数据以防止LLM执行恶意指令,但其实际安全性尚未经过充分验证。论文旨在评估这类防御方法的鲁棒性。

Result: 攻击成功率达到70%,攻击成本(token数量)仅小幅增加。

Insight: 基于微调的防御方法在白盒设定下容易被攻破,需要对更稳健的防御机制展开研究。

Abstract: A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data, so that the LLM does not follow instructions that might be present with data. There are several academic systems and production-level implementations of this idea. We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks and showing that the defenses do not provide the claimed security properties. Specifically, we construct a novel attention-based attack algorithm for text-based LLMs and apply it to two recent whitebox defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks with success rates of up to 70% with modest increase in attacker budget in terms of tokens. Our findings make fundamental progress towards understanding the robustness of prompt injection defenses in the whitebox setting. We release our code and attacks at https://github.com/nishitvp/better_opts_attacks


[108] Rainbow Artifacts from Electromagnetic Signal Injection Attacks on Image Sensors cs.CR | cs.CV | B.8; I.4PDF

Youqian Zhang, Xinyu Ji, Zhihao Wang, Qinhong Jiang

TL;DR: 该论文研究了一种针对图像传感器的电磁信号注入攻击,揭示了CMOS图像传感器在电磁干扰下会产生彩虹状伪影,并分析了这些攻击对目标检测模型的负面影响。

Details

Motivation: 图像传感器广泛应用于安全关键系统(如监控、自动驾驶等),其数据完整性对系统决策至关重要。然而,电磁信号注入攻击能够绕过传统数字完整性检查,直接干扰模拟域数据,引发潜在安全问题。

Result: 实验表明,电磁干扰导致的彩虹伪影能够显著降低目标检测模型的准确性,导致错误的预测结果。

Insight: 视觉感知系统的物理层(模拟域)攻击是一个重要的安全漏洞,传统数字防护措施无法完全覆盖,需要开发新的防御机制以应对此类威胁。

Abstract: Image sensors are integral to a wide range of safety- and security-critical systems, including surveillance infrastructure, autonomous vehicles, and industrial automation. These systems rely on the integrity of visual data to make decisions. In this work, we investigate a novel class of electromagnetic signal injection attacks that target the analog domain of image sensors, allowing adversaries to manipulate raw visual inputs without triggering conventional digital integrity checks. We uncover a previously undocumented attack phenomenon on CMOS image sensors: rainbow-like color artifacts induced in images captured by image sensors through carefully tuned electromagnetic interference. We further evaluate the impact of these attacks on state-of-the-art object detection models, showing that the injected artifacts propagate through the image signal processing pipeline and lead to significant mispredictions. Our findings highlight a critical and underexplored vulnerability in the visual perception stack, highlighting the need for more robust defenses against physical-layer attacks in such systems.


cs.LG [Back]

[109] Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate cs.LG | cs.CLPDF

A. Bochkov

TL;DR: 论文提出了一种模块化构建和分层扩展的Transformer训练方法,基于冻结的输入嵌入,展现出高效、灵活的模型扩展能力。

Details

Motivation: 传统的大型语言模型(LLM)训练方式是整体、端到端的,资源消耗大且缺乏灵活性。本文探索一种利用冻结嵌入的模块化与分层扩展方法,以实现高效、灵活的模型开发。

Result: 模块化合并的MoE模型在MMLU等任务上表现超越单独专家;分层扩展的模型在复杂推理任务(如SQuAD)中表现稳定且深度与能力相关。

Insight: 冻结嵌入为模块化和增量训练提供了基础,表明AI模型开发可以更生物化或模块化,支持高效扩展和持续学习。

Abstract: The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal “docking port,” enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is “grown” by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.


[110] Bradley-Terry and Multi-Objective Reward Modeling Are Complementary cs.LG | cs.CLPDF

Zhiwei Zhang, Hui Liu, Xiaomin Li, Zhenwei Dai, Jingying Zeng

TL;DR: 该论文提出一种统一的奖励建模框架,通过联合训练单目标(Bradley-Terry)和多目标回归奖励函数,提升大语言模型在分布外数据上的鲁棒性和评分性能。

Details

Motivation: 现有的RLHF方法在分布外(OOD)场景下表现不佳,且多目标奖励函数因数据质量限制成为性能瓶颈。论文旨在解决这些问题。

Result: 实验表明,该方法显著提升了奖励模型的鲁棒性和评分性能,7B模型甚至优于70B基线模型。

Insight: BT损失与回归目标的互补性为奖励建模提供了新思路,尤其是在OOD场景下,多属性评分和单目标训练的协同作用是关键。

Abstract: Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior. Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution. In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of high-quality data often leads to weak performance of multi-objective reward functions, which can negatively impact overall performance and become the bottleneck. To address this issue, we propose a unified reward modeling framework that jointly trains Bradley–Terry (BT) single-objective and multi-objective regression-based reward functions using a shared embedding space. We theoretically establish a connection between the BT loss and the regression objective and highlight their complementary benefits. Specifically, the regression task enhances the single-objective reward function’s ability to mitigate reward hacking in challenging OOD settings, while BT-based training improves the scoring capability of the multi-objective reward function, enabling a 7B model to outperform a 70B baseline. Extensive experimental results demonstrate that our framework significantly improves both the robustness and the scoring performance of reward models.


[111] GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing cs.LG | cs.CL | cs.CR | I.2.7; I.2.8PDF

Peiyan Zhang, Haibo Jin, Liying Kang, Haohan Wang

TL;DR: 论文介绍了GuardVal,一种针对大语言模型(LLM)的动态越狱评估协议,通过实时生成和优化越狱提示,更全面地测试模型的安全性。

Details

Motivation: 现有的越狱评估方法难以全面捕捉大语言模型的动态性和复杂性,导致安全漏洞评估不足。

Result: 在Mistral-7b到GPT-4等多样模型上测试,揭示了模型的行为模式,提供了安全性的全面评估。

Insight: 动态评估揭示了模型的具体弱点,为未来的研究和安全性改进提供了方向。

Abstract: Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the sophistication required in effectively probing their vulnerabilities. Current benchmarks and evaluation methods struggle to fully address these challenges, leaving gaps in the assessment of LLM vulnerabilities. In this paper, we review existing jailbreak evaluation practices and identify three assumed desiderata for an effective jailbreak evaluation protocol. To address these challenges, we introduce GuardVal, a new evaluation protocol that dynamically generates and refines jailbreak prompts based on the defender LLM’s state, providing a more accurate assessment of defender LLMs’ capacity to handle safety-critical situations. Moreover, we propose a new optimization method that prevents stagnation during prompt refinement, ensuring the generation of increasingly effective jailbreak prompts that expose deeper weaknesses in the defender LLMs. We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains. Our findings highlight distinct behavioral patterns among the models, offering a comprehensive view of their robustness. Furthermore, our evaluation process deepens the understanding of LLM behavior, leading to insights that can inform future research and drive the development of more secure models.


[112] Weighted Multi-Prompt Learning with Description-free Large Language Model Distillation cs.LG | cs.AI | cs.CL | cs.CVPDF

Sua Lee, Kyubum Shin, Jung Ho Park

TL;DR: 本文提出了一种名为DeMul的新方法,通过直接从大型语言模型(LLM)蒸馏知识到提示词,避免了提取描述的步骤,从而提高了语义丰富性和优化效率。在多提示设置中,还展示了提示权重的重要性。实验表明,该方法在11个识别数据集上表现优越。

Details

Motivation: 现有的利用LLM生成描述的方法存在高变异性和低可靠性的问题。为了克服这些局限性,作者提出了一种无需描述的提示学习方法,直接从LLM蒸馏知识。

Result: 实验结果显示,DeMul在11个识别任务中显著优于现有方法,证明了其高效性和鲁棒性。

Insight: 1. 直接从LLM蒸馏知识可以避免描述提取的不可靠性;2. 多提示权重机制能够有效反映不同提示的重要性;3. 连续向量表示消除了对离散模板的依赖,提升了灵活性。

Abstract: Recent advances in pre-trained Vision Language Models (VLM) have shown promising potential for effectively adapting to downstream tasks through prompt learning, without the need for additional annotated paired datasets. To supplement the text information in VLM trained on correlations with vision data, new approaches leveraging Large Language Models (LLM) in prompts have been proposed, enhancing robustness to unseen and diverse data. Existing methods typically extract text-based responses (i.e., descriptions) from LLM to incorporate into prompts; however, this approach suffers from high variability and low reliability. In this work, we propose Description-free Multi-prompt Learning(DeMul), a novel method that eliminates the process of extracting descriptions and instead directly distills knowledge from LLM into prompts. By adopting a description-free approach, prompts can encapsulate richer semantics while still being represented as continuous vectors for optimization, thereby eliminating the need for discrete pre-defined templates. Additionally, in a multi-prompt setting, we empirically demonstrate the potential of prompt weighting in reflecting the importance of different prompts during training. Experimental results show that our approach achieves superior performance across 11 recognition datasets.