cs.CV [Total: 86]
cs.CL [Total: 26]
eess.AS [Total: 1]
cs.NE [Total: 1]
cs.LG [Total: 4]
eess.IV [Total: 6]
cs.AI [Total: 7]
cs.GR [Total: 1]
cs.SE [Total: 1]
astro-ph.IM [Total: 1]
cs.CY [Total: 2]
cs.RO [Total: 1]
cs.IR [Total: 2]

cs.CV [Back]

[1] PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation cs.CV | cs.LG | I.2.10; I.4.8; I.5.4PDF

Zongyou Yang, Jonathan Loo

TL;DR: 该论文提出了PyCAT4，一种基于分层视觉Transformer的3D人体姿态估计框架，通过引入自注意力机制、时序特征融合和空间金字塔结构，显著提升了姿态估计的精度。

Details

Motivation: 现有的人体姿态估计方法虽已取得进展，但仍存在低层特征捕捉不足和时序信号理解不够的问题。结合Transformer的时序分析优势，作者旨在优化Pymaf网络架构。

Result: 在COCO和3DPW数据集上的实验表明，PyCAT4显著提升了姿态估计能力，推动了该技术的进步。

Insight: Transformer的自注意力机制在多尺度特征融合和时序建模中表现出色，未来可进一步探索其在复杂场景中的潜力。

Abstract: Recently, a significant improvement in the accuracy of 3D human pose estimation has been achieved by combining convolutional neural networks (CNNs) with pyramid grid alignment feedback loops. Additionally, innovative breakthroughs have been made in the field of computer vision through the adoption of Transformer-based temporal analysis architectures. Given these advancements, this study aims to deeply optimize and improve the existing Pymaf network architecture. The main innovations of this paper include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance the capture of low-level features; (2) Enhancing the understanding and capture of temporal signals in video sequences through feature temporal fusion techniques; (3) Implementing spatial pyramid structures to achieve multi-scale feature fusion, effectively balancing feature representations differences across different scales. The new PyCAT4 model obtained in this study is validated through experiments on the COCO and 3DPW datasets. The results demonstrate that the proposed improvement strategies significantly enhance the network’s detection capability in human pose estimation, further advancing the development of human pose estimation technology.

[2] DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework cs.CVPDF

Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang

TL;DR: DreamVVT 是一个基于 Diffusion Transformers 的两阶段视频虚拟试穿框架，通过利用未配对的人体中心数据和预训练模型的先验知识，实现了高质量和时序一致的试穿效果。

Details

Motivation: 当前视频虚拟试穿技术依赖于稀缺的配对数据集，且难以在无约束场景中保持细节和时间一致性。DreamVVT 旨在解决这些问题，提升真实场景下的适应性。

Result: 实验表明，DreamVVT 在保留服装细节和时间一致性上优于现有方法。

Insight: 通过分阶段设计和利用预训练模型，DreamVVT 在真实场景中实现了更高的灵活性和生成质量。

Abstract: Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. \textbf{In the second stage}, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page https://virtu-lab.github.io/

[3] Evaluation and Analysis of Deep Neural Transformers and Convolutional Neural Networks on Modern Remote Sensing Datasets cs.CV | cs.AI | cs.LGPDF

J. Alex Hurt, Trevor M. Bajkowski, Grant J. Scott, Curt H. Davis

TL;DR: 论文评估了深度学习中的Transformer和卷积神经网络（CNN）在遥感数据集上的表现，比较了11种目标检测算法，包括5种Transformer和6种CNN模型，展示了Transformer在遥感图像中的先进性能。

Details

Motivation: 随着Transformer在自然语言处理和计算机视觉中的成功应用，作者希望探究其在遥感图像处理中的表现，并与传统的CNN模型进行对比。

Result: 结果显示，Transformer在某些遥感任务中表现优于CNN，达到了先进水平。

Insight: 论文表明Transformer在遥感图像处理中具有巨大潜力，尤其是在目标检测任务中，尽管计算成本较高，但其性能优势使其具有竞争力。

Abstract: In 2012, AlexNet established deep convolutional neural networks (DCNNs) as the state-of-the-art in CV, as these networks soon led in visual tasks for many domains, including remote sensing. With the publication of Visual Transformers, we are witnessing the second modern leap in computational vision, and as such, it is imperative to understand how various transformer-based neural networks perform on satellite imagery. While transformers have shown high levels of performance in natural language processing and CV applications, they have yet to be compared on a large scale to modern remote sensing data. In this paper, we explore the use of transformer-based neural networks for object detection in high-resolution electro-optical satellite imagery, demonstrating state-of-the-art performance on a variety of publicly available benchmark data sets. We compare eleven distinct bounding-box detection and localization algorithms in this study, of which seven were published since 2020, and all eleven since 2015. The performance of five transformer-based architectures is compared with six convolutional networks on three state-of-the-art opensource high-resolution remote sensing imagery datasets ranging in size and complexity. Following the training and evaluation of thirty-three deep neural models, we then discuss and analyze model performance across various feature extraction methodologies and detection algorithms.

[4] VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction cs.CV | cs.CLPDF

Rongxin Jiang, Robert Long, Chenghao Gu, Mingrui Yan

TL;DR: 论文提出VisuCraft框架，通过结构化信息提取和动态提示生成，显著提升大型视觉语言模型在复杂视觉引导内容生成中的表现。

Details

Motivation: 现有大型视觉语言模型在生成长文本时难以兼顾视觉保真度、创造力和用户指令的精确遵循。VisuCraft旨在解决这些问题。

Result: VisuCraft在故事生成和诗歌创作等任务中表现优于基线模型，特别是在创造力和指令遵循方面有显著提升。

Insight: VisuCraft为大型视觉语言模型在复杂创意应用中开辟了新潜力，强调了结构化信息提取对提升模型表现的重要性。

Abstract: This paper introduces VisuCraft, a novel framework designed to significantly enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation. Existing LVLMs often exhibit limitations in maintaining high visual fidelity, genuine creativity, and precise adherence to nuanced user instructions when generating long-form texts. VisuCraft addresses these challenges by integrating a multimodal structured information extractor (E) and a dynamic prompt generation module (G). The extractor distills fine-grained visual attributes from input images into a rich, structured representation, which the dynamic prompt module then combines with user instructions to create highly optimized prompts for underlying LVLMs (e.g., LLaVA, InstructBLIP). Evaluated on the self-constructed ImageStoryGen-500K dataset using VisuGen Metrics (Visual Grounding, Creativity, and Instruction Adherence), VisuCraft consistently outperforms baseline LVLMs across tasks like story generation and poetry composition. Our results demonstrate remarkable improvements, particularly in creativity and instruction adherence, validating VisuCraft’s effectiveness in producing imaginative, visually grounded, and user-aligned long-form creative text. This work unlocks new potential for LVLMs in sophisticated creative AI applications.

[5] RDDPM: Robust Denoising Diffusion Probabilistic Model for Unsupervised Anomaly Segmentation cs.CV | 68T07 | I.4.9; I.2.10PDF

Mehrdad Moradi, Kamran Paynabar

TL;DR: 提出了一种鲁棒的降噪扩散概率模型（RDDPM），用于在仅有受污染（混合正常和异常）的无标签数据时进行无监督异常分割，性能优于现有方法。

Details

Motivation: 现有扩散模型通常需要正常数据训练，而实际场景中可能仅有受污染数据可用，限制了其适用性。

Result: 在MVTec数据集上，AUROC提高8.08%，AUPRC提高10.37%。

Insight: 鲁棒回归为扩散模型在受污染数据上的训练提供了新思路，扩展了其应用范围。

Abstract: Recent advancements in diffusion models have demonstrated significant success in unsupervised anomaly segmentation. For anomaly segmentation, these models are first trained on normal data; then, an anomalous image is noised to an intermediate step, and the normal image is reconstructed through backward diffusion. Unlike traditional statistical methods, diffusion models do not rely on specific assumptions about the data or target anomalies, making them versatile for use across different domains. However, diffusion models typically assume access to normal data for training, limiting their applicability in realistic settings. In this paper, we propose novel robust denoising diffusion models for scenarios where only contaminated (i.e., a mix of normal and anomalous) unlabeled data is available. By casting maximum likelihood estimation of the data as a nonlinear regression problem, we reinterpret the denoising diffusion probabilistic model through a regression lens. Using robust regression, we derive a robust version of denoising diffusion probabilistic models. Our novel framework offers flexibility in constructing various robust diffusion models. Our experiments show that our approach outperforms current state of the art diffusion models, for unsupervised anomaly segmentation when only contaminated data is available. Our method outperforms existing diffusion-based approaches, achieving up to 8.08% higher AUROC and 10.37% higher AUPRC on MVTec datasets. The implementation code is available at: https://github.com/mehrdadmoradi124/RDDPM

[6] How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes cs.CV | cs.SD | eess.ASPDF

Mahnoor Fatima Saad, Ziad Al-Halah

TL;DR: 该论文提出了一种材料控制的多模态声学特征生成方法，用于室内场景的声音模拟，并引入了一种新的编码器-解码器模型来根据用户定义的材料配置生成目标声学特征。

Details

Motivation: 研究动机在于探索如何根据室内场景的音频和视觉特性，动态生成基于不同材料配置的声学特征（如房间脉冲响应）。

Result: 实验结果表明，该模型能够有效编码材料信息并生成高质量的RIR，优于多种基线方法和现有技术。

Insight: 该研究为动态模拟室内声学环境提供了一种有效方法，未来可在虚拟现实、声学设计等领域应用。

Abstract: How would the sound in a studio change with a carpeted floor and acoustic tiles on the walls? We introduce the task of material-controlled acoustic profile generation, where, given an indoor scene with specific audio-visual characteristics, the goal is to generate a target acoustic profile based on a user-defined material configuration at inference time. We address this task with a novel encoder-decoder approach that encodes the scene’s key properties from an audio-visual observation and generates the target Room Impulse Response (RIR) conditioned on the material specifications provided by the user. Our model enables the generation of diverse RIRs based on various material configurations defined dynamically at inference time. To support this task, we create a new benchmark, the Acoustic Wonderland Dataset, designed for developing and evaluating material-aware RIR prediction methods under diverse and challenging settings. Our results demonstrate that the proposed model effectively encodes material information and generates high-fidelity RIRs, outperforming several baselines and state-of-the-art methods.

[7] Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces cs.CV | cs.AI | cs.CL | cs.ROPDF

Vebjørn Haug Kåsene, Pierre Lison

TL;DR: 该论文探讨了现成的大型视觉-语言模型（LVLMs）在视觉与语言导航（VLN）任务中的表现，比较了低层次和全景动作空间的差异，并在R2R数据集上微调了Qwen2.5-VL-3B-Instruct模型。

Details

Motivation: 研究动机在于探索现成的LVLMs是否能够有效支持VLN任务，并比较低层次和全景动作空间的性能差异，填补了未充分研究现成模型潜力及动作空间影响的研究空白。

Result: 实验结果表明，微调后的模型在R2R测试集上达到41%的成功率，但性能仍不及专用模型。

Insight: 研究揭示了现成LVLMs在VLN任务中的潜力与局限性，为未来改进现成模型的导航能力提供了方向。

Abstract: Vision-and-Language Navigation (VLN) refers to the task of enabling autonomous robots to navigate unfamiliar environments by following natural language instructions. While recent Large Vision-Language Models (LVLMs) have shown promise in this task, most current VLM systems rely on models specifically designed and optimized for navigation, leaving the potential of off-the-shelf LVLMs underexplored. Furthermore, while older VLN approaches used low-level action spaces with egocentric views and atomic actions (such as “turn left” or “move forward”), newer models tend to favor panoramic action spaces with discrete navigable viewpoints. This paper investigates (1) whether off-the-shelf LVLMs (fine-tuned without architectural modifications or simulator-based training) can effectively support VLN tasks and (2) whether such models can support both low-level and panoramic action paradigms. To this end, we fine-tune the open-source model Qwen2.5-VL-3B-Instruct on the Room-to-Room (R2R) dataset and evaluate its empirical performance across both low-level and panoramic action spaces. The best resulting model achieves a 41% success rate on the R2R test set, demonstrating that while off-the-shelf LVLMs can learn to perform Vision-and-Language Navigation, they still lag behind models specifically designed for this task.

[8] X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio cs.CVPDF

Chenxu Zhang, Zenan Li, Hongyi Xu, You Xie, Xiaochen Zhao

TL;DR: X-Actor是一种新颖的音频驱动人物动画框架，能从单张参考图像和音频片段生成具有丰富情感的逼真人物视频，突破了传统方法的短时唇同步限制。

Details

Motivation: 现有方法多关注短时唇同步和视觉保真度，无法生成长时情感丰富的肖像表演。X-Actor旨在解决这一问题，实现与音频节奏和内容一致的长时情感表达。

Result: 实验表明，X-Actor能生成超越标准说话头动画的电影级表演，在长时音频驱动情感肖像表演中达到SOTA效果。

Insight: 解耦面部运动与视觉身份的学习能有效捕捉长时情感动态，扩散模型的渐进式生成机制适合复杂情感建模。

Abstract: We present X-Actor, a novel audio-driven portrait animation framework that generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip. Unlike prior methods that emphasize lip synchronization and short-range visual fidelity in constrained speaking scenarios, X-Actor enables actor-quality, long-form portrait performance capturing nuanced, dynamically evolving emotions that flow coherently with the rhythm and content of speech. Central to our approach is a two-stage decoupled generation pipeline: an audio-conditioned autoregressive diffusion model that predicts expressive yet identity-agnostic facial motion latent tokens within a long temporal context window, followed by a diffusion-based video synthesis module that translates these motions into high-fidelity video animations. By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics through a diffusion-forcing training paradigm, enabling infinite-length emotionally-rich motion prediction without error accumulation. Extensive experiments demonstrate that X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations and achieves state-of-the-art results in long-range, audio-driven emotional portrait acting.

[9] Towards Robust Image Denoising with Scale Equivariance cs.CVPDF

Dawei Zhang, Xiaojie Guo

TL;DR: 这篇论文探讨了通过引入尺度等变性的归纳偏置来提升图像去噪模型在分布外噪声条件下的鲁棒性，提出了包含异构归一化模块（HNM）和交互门控模块（IGM）的框架，显著优于现有方法。

Details

Motivation: 现有图像去噪模型在分布外噪声（尤其是空间异质性噪声）条件下泛化能力不足，导致性能下降，这一现象尚未得到充分研究。

Result: 模型在空间异质性噪声条件下显著优于现有方法，提升了分布外噪声的鲁棒性。

Insight: 尺度等变性能够帮助模型适应噪声的空间变化性，动态归一化和门控机制是提升鲁棒性的有效手段。

Abstract: Despite notable advances in image denoising, existing models often struggle to generalize beyond in-distribution noise patterns, particularly when confronted with out-of-distribution (OOD) conditions characterized by spatially variant noise. This generalization gap remains a fundamental yet underexplored challenge. In this work, we investigate \emph{scale equivariance} as a core inductive bias for improving OOD robustness. We argue that incorporating scale-equivariant structures enables models to better adapt from training on spatially uniform noise to inference on spatially non-uniform degradations. Building on this insight, we propose a robust blind denoising framework equipped with two key components: a Heterogeneous Normalization Module (HNM) and an Interactive Gating Module (IGM). HNM stabilizes feature distributions and dynamically corrects features under varying noise intensities, while IGM facilitates effective information modulation via gated interactions between signal and feature paths. Extensive evaluations demonstrate that our model consistently outperforms state-of-the-art methods on both synthetic and real-world benchmarks, especially under spatially heterogeneous noise. Code will be made publicly available.

[10] Diffusion Models with Adaptive Negative Sampling Without External Resources cs.CVPDF

Alakh Desai, Nuno Vasconcelos

TL;DR: 该论文提出了一种无需外部资源的自适应负采样方法（ANSWER），通过结合负采样和分类器引导（CFG）来优化扩散模型的提示遵从性和生成质量。

Details

Motivation: 扩散模型在生成多样且高质量的图像时，对提示的遵从性和生成质量存在较大波动。负提示被用于改善提示遵从性，但现有的负提示方法存在信息不完整的问题。因此，研究如何利用扩散模型内部对否定的理解，提出一种无需外部资源的自适应方法。

Result: 实验表明，ANSWER在多个基准测试中优于基线方法，并且在人类评估中比其他方法更受欢迎。

Insight: 通过内部机制优化负采样可以有效提升扩散模型的生成质量和提示遵从性，而无需依赖外部资源或显式负提示。

Abstract: Diffusion models (DMs) have demonstrated an unparalleled ability to create diverse and high-fidelity images from text prompts. However, they are also well-known to vary substantially regarding both prompt adherence and quality. Negative prompting was introduced to improve prompt compliance by specifying what an image must not contain. Previous works have shown the existence of an ideal negative prompt that can maximize the odds of the positive prompt. In this work, we explore relations between negative prompting and classifier-free guidance (CFG) to develop a sampling procedure, {\it Adaptive Negative Sampling Without External Resources} (ANSWER), that accounts for both positive and negative conditions from a single prompt. This leverages the internal understanding of negation by the diffusion model to increase the odds of generating images faithful to the prompt. ANSWER is a training-free technique, applicable to any model that supports CFG, and allows for negative grounding of image concepts without an explicit negative prompts, which are lossy and incomplete. Experiments show that adding ANSWER to existing DMs outperforms the baselines on multiple benchmarks and is preferred by humans 2x more over the other methods.

[11] Separating Shared and Domain-Specific LoRAs for Multi-Domain Learning cs.CVPDF

Yusaku Takama, Ning Ding, Tatsuya Yokota, Toru Tamaki

TL;DR: 该论文提出了一种在多领域学习中分离共享和领域特定LoRA（低秩适配器）的方法，确保二者存在于预训练权重不同的子空间中，并通过实验验证了其有效性。

Details

Motivation: 现有多领域学习方法中的共享和领域特定LoRA结构是否真正捕捉了领域特定信息尚不明确，需要更有效的分离方法。

Result: 实验表明，该方法在某些情况下有效，并对LoRA权重的维度进行了分析。

Insight: 明确分离共享和领域特定子空间有助于更有效地捕捉多领域学习中的领域特定信息。

Abstract: Existing architectures of multi-domain learning have two types of adapters: shared LoRA for all domains and domain-specific LoRA for each particular domain. However, it remains unclear whether this structure effectively captures domain-specific information. In this paper, we propose a method that ensures that shared and domain-specific LoRAs exist in different subspaces; specifically, the column and left null subspaces of the pre-trained weights. We apply the proposed method to action recognition with three datasets (UCF101, Kinetics400, and HMDB51) and demonstrate its effectiveness in some cases along with the analysis of the dimensions of LoRA weights.

[12] MoExDA: Domain Adaptation for Edge-based Action Recognition cs.CVPDF

Takuya Sugimoto, Ning Ding, Toru Tamaki

TL;DR: MoExDA提出了一种轻量级的领域自适应方法，通过结合RGB和边缘信息来缓解动作识别中的静态偏差问题，以更低的计算成本实现了更强的鲁棒性。

Details

Motivation: 现代动作识别模型存在静态偏差问题，导致泛化性能下降，需要一种更高效的解决方案。

Result: 实验表明，该方法能以较低计算成本有效抑制静态偏差，提升动作识别的鲁棒性。

Insight: 边缘信息可以补充RGB数据的不足，从而缓解静态偏差，提升模型的泛化能力。

Abstract: Modern action recognition models suffer from static bias, leading to reduced generalization performance. In this paper, we propose MoExDA, a lightweight domain adaptation between RGB and edge information using edge frames in addition to RGB frames to counter the static bias issue. Experiments demonstrate that the proposed method effectively suppresses static bias with a lower computational cost, allowing for more robust action recognition than previous approaches.

[13] Adversarial Attention Perturbations for Large Object Detection Transformers cs.CVPDF

Zachary Yahn, Selim Furkan Tekin, Fatih Ilhan, Sihao Hu, Tiansheng Huang

TL;DR: 该论文提出了一种针对基于 Transformer 的大目标检测器的对抗注意力攻击方法（AFOG），通过聚焦注意力机制和多框检测任务的脆弱区域，显著提升了对抗攻击的性能和隐秘性。

Details

Motivation: 现有对抗攻击方法主要针对 CNN 检测器，对基于 Transformer 的检测器效果有限。论文旨在填补这一空白，设计一种通用的对抗攻击方法，适用于不同架构的目标检测器。

Result: 在 COCO 数据集上的实验表明，AFOG 比现有攻击方法性能提升高达 83%，且扰动更隐秘。

Insight: 注意力机制能够有效揭示 Transformer 检测器的脆弱性，为对抗攻击设计提供了新思路。

Abstract: Adversarial perturbations are useful tools for exposing vulnerabilities in neural networks. Existing adversarial perturbation methods for object detection are either limited to attacking CNN-based detectors or weak against transformer-based detectors. This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. By design, AFOG is neural-architecture agnostic and effective for attacking both large transformer-based object detectors and conventional CNN-based detectors with a unified adversarial attention framework. This paper makes three original contributions. First, AFOG utilizes a learnable attention mechanism that focuses perturbations on vulnerable image regions in multi-box detection tasks, increasing performance over non-attention baselines by up to 30.6%. Second, AFOG’s attack loss is formulated by integrating two types of feature loss through learnable attention updates with iterative injection of adversarial perturbations. Finally, AFOG is an efficient and stealthy adversarial perturbation method. It probes the weak spots of detection transformers by adding strategically generated and visually imperceptible perturbations which can cause well-trained object detection models to fail. Extensive experiments conducted with twelve large detection transformers on COCO demonstrate the efficacy of AFOG. Our empirical results also show that AFOG outperforms existing attacks on transformer-based and CNN-based object detectors by up to 83% with superior speed and imperceptibility. Code is available at https://github.com/zacharyyahn/AFOG.

[14] Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models cs.CVPDF

Fan Yang, Yihao Huang, Jiayi Zhu, Ling Shi, Geguang Pu

TL;DR: 该论文提出了一种在扩散模型生成过程中检测NSFW内容的新方法（IGD），利用预测噪声作为内部信号，显著提高了检测准确率。

Details

Motivation: 基于扩散模型的文本到图像生成技术虽然强大，但也可能被用于生成不适宜工作场合（NSFW）的内容。现有方法通常在生成前后进行检测，但生成过程中的检测仍未被充分探索。

Result: 在七种NSFW类别上的实验表明，IGD的平均检测准确率达到91.32%，优于七种基线方法。

Insight: 扩散模型在生成过程中（而非仅输入或输出阶段）包含可用于内容审核的重要信号，对抗性提示的防御能力也得到验证。

Abstract: Diffusion-based text-to-image (T2I) models enable high-quality image generation but also pose significant risks of misuse, particularly in producing not-safe-for-work (NSFW) content. While prior detection methods have focused on filtering prompts before generation or moderating images afterward, the in-generation phase of diffusion models remains largely unexplored for NSFW detection. In this paper, we introduce In-Generation Detection (IGD), a simple yet effective approach that leverages the predicted noise during the diffusion process as an internal signal to identify NSFW content. This approach is motivated by preliminary findings suggesting that the predicted noise may capture semantic cues that differentiate NSFW from benign prompts, even when the prompts are adversarially crafted. Experiments conducted on seven NSFW categories show that IGD achieves an average detection accuracy of 91.32% over naive and adversarial NSFW prompts, outperforming seven baseline methods.

[15] Multi-Granularity Feature Calibration via VFM for Domain Generalized Semantic Segmentation cs.CVPDF

Xinhui Li, Xiaojie Guo

TL;DR: 本文提出了多粒度特征校准（MGFC）框架，通过从粗到细校准视觉基础模型（VFM）的特征，提升跨域通用语义分割（DGSS）的性能。

Details

Motivation: 现有DGSS方法主要集中在全局特征微调，忽略了多层级特征的适配，而层级适配对密集预测任务至关重要。

Result: 在多个基准数据集上，MGFC优于现有DGSS方法，验证了多粒度适配的有效性。

Insight: 多粒度特征校准能更全面地利用VFM的泛化能力，提升跨域语义分割的性能。

Abstract: Domain Generalized Semantic Segmentation (DGSS) aims to improve the generalization ability of models across unseen domains without access to target data during training. Recent advances in DGSS have increasingly exploited vision foundation models (VFMs) via parameter-efficient fine-tuning strategies. However, most existing approaches concentrate on global feature fine-tuning, while overlooking hierarchical adaptation across feature levels, which is crucial for precise dense prediction. In this paper, we propose Multi-Granularity Feature Calibration (MGFC), a novel framework that performs coarse-to-fine alignment of VFM features to enhance robustness under domain shifts. Specifically, MGFC first calibrates coarse-grained features to capture global contextual semantics and scene-level structure. Then, it refines medium-grained features by promoting category-level feature discriminability. Finally, fine-grained features are calibrated through high-frequency spatial detail enhancement. By performing hierarchical and granularity-aware calibration, MGFC effectively transfers the generalization strengths of VFMs to the domain-specific task of DGSS. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art DGSS approaches, highlighting the effectiveness of multi-granularity adaptation for the semantic segmentation task of domain generalization.

[16] Enhancing Long Video Question Answering with Scene-Localized Frame Grouping cs.CV | cs.AIPDF

Xuyi Yang, Wenhao Zhang, Hongbo Jin, Lin Liu, Hongbo Xu

TL;DR: 论文提出了一种新方法SLFG，通过场景局部帧分组（Scene-Localized Frame Grouping）提升长视频问答性能。该方法将语义相关的帧组合成场景帧，无需修改现有多模态大语言模型（MLLMs）架构，并在多个长视频基准测试中表现优异。

Details

Motivation: 现有MLLMs在长视频理解中表现不佳，主要因为资源限制无法处理所有视频帧及其关联信息。现有方法侧重于从大量无关帧中识别特定帧，不符合实际应用需求。

Result: 实验表明SLFG在多个长视频基准测试中表现优异。

Insight: SLFG借鉴人类认知机制，通过场景化帧分组有效解决了长视频理解中的信息冗余问题，为MLLMs的长视频任务提供了高效解决方案。

Abstract: Current Multimodal Large Language Models (MLLMs) often perform poorly in long video understanding, primarily due to resource limitations that prevent them from processing all video frames and their associated information. Efficiently extracting relevant information becomes a challenging task. Existing frameworks and evaluation tasks focus on identifying specific frames containing core objects from a large number of irrelevant frames, which does not align with the practical needs of real-world applications. To address this issue, we propose a new scenario under the video question-answering task, SceneQA, which emphasizes scene-based detail perception and reasoning abilities. And we develop the LVSQA dataset to support the SceneQA task, which is built upon carefully selected videos from LVBench and contains a new collection of question-answer pairs to promote a more fair evaluation of MLLMs’ scene perception abilities in long videos. Inspired by human cognition, we introduce a novel method called SLFG. The core idea of SLFG is to combine individual frames into semantically coherent scene frames. By leveraging scene localization methods and dynamic frame reassembly mechanisms, SLFG significantly enhances the understanding capabilities of existing MLLMs in long videos. SLFG requires no modification to the original model architecture and boasts excellent plug-and-play usability. Experimental results show that this method performs exceptionally well in several long video benchmark tests. Code and dataset will be released at http://www.slfg.pkuzwh.cn.

[17] SA-3DGS: A Self-Adaptive Compression Method for 3D Gaussian Splatting cs.CVPDF

Liheng Zhang, Weihao Yu, Zubo Lu, Haozhi Gu, Jin Huang

TL;DR: SA-3DGS是一种自适应压缩方法，显著降低3D高斯泼溅的存储成本，同时保持渲染质量。通过学习重要性分数和聚类模块，有效减少冗余高斯点并通过代码本修复模块恢复信息。

Details

Motivation: 当前3D高斯泼溅方法需要大量高斯点，存储需求高且难以识别不重要的点，导致压缩和渲染性能下降。需要一种自适应压缩方法以减少存储成本并保持质量。

Result: 在多个基准数据集上实现高达66倍压缩，同时保持或提升渲染质量；提升了其他剪枝方法（如LightGaussian）的性能。

Insight: 自适应性方法（如重要性分数学习）能有效识别和压缩冗余信息；代码本修复可减轻信息损失对渲染质量的影响。

Abstract: Recent advancements in 3D Gaussian Splatting have enhanced efficient and high-quality novel view synthesis. However, representing scenes requires a large number of Gaussian points, leading to high storage demands and limiting practical deployment. The latest methods facilitate the compression of Gaussian models but struggle to identify truly insignificant Gaussian points in the scene, leading to a decline in subsequent Gaussian pruning, compression quality, and rendering performance. To address this issue, we propose SA-3DGS, a method that significantly reduces storage costs while maintaining rendering quality. SA-3DGS learns an importance score to automatically identify the least significant Gaussians in scene reconstruction, thereby enabling effective pruning and redundancy reduction. Next, the importance-aware clustering module compresses Gaussians attributes more accurately into the codebook, improving the codebook’s expressive capability while reducing model size. Finally, the codebook repair module leverages contextual scene information to repair the codebook, thereby recovering the original Gaussian point attributes and mitigating the degradation in rendering quality caused by information loss. Experimental results on several benchmark datasets show that our method achieves up to 66x compression while maintaining or even improving rendering quality. The proposed Gaussian pruning approach is not only adaptable to but also improves other pruning-based methods (e.g., LightGaussian), showcasing excellent performance and strong generalization ability.

[18] MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention cs.CVPDF

Qi Xie, Yongjia Ma, Donglin Di, Xuehao Gao, Xun Yang

TL;DR: MoCA提出了一种基于Diffusion Transformer（DiT）的新型视频扩散模型，通过混合交叉注意力机制（Mixture of Cross-Attention）提升文本到视频生成中的身份一致性，同时利用分层时间池化和时间感知交叉注意力专家动态建模时空关系。

Details

Motivation: 现有文本到视频生成方法在细粒度面部动态捕捉和时间身份一致性方面表现不足，亟需一种能够更好保持身份一致性的模型。

Result: 在CelebIPVid数据集上，MoCA在面部相似度指标上超越现有方法5%以上。

Insight: 混合注意力机制和分层时间池化有效改善了视频生成中的身份一致性，潜在损失函数进一步提升了细节表现。

Abstract: Achieving ID-preserving text-to-video (T2V) generation remains challenging despite recent advances in diffusion-based models. Existing approaches often fail to capture fine-grained facial dynamics or maintain temporal identity coherence. To address these limitations, we propose MoCA, a novel Video Diffusion Model built on a Diffusion Transformer (DiT) backbone, incorporating a Mixture of Cross-Attention mechanism inspired by the Mixture-of-Experts paradigm. Our framework improves inter-frame identity consistency by embedding MoCA layers into each DiT block, where Hierarchical Temporal Pooling captures identity features over varying timescales, and Temporal-Aware Cross-Attention Experts dynamically model spatiotemporal relationships. We further incorporate a Latent Video Perceptual Loss to enhance identity coherence and fine-grained details across video frames. To train this model, we collect CelebIPVid, a dataset of 10,000 high-resolution videos from 1,000 diverse individuals, promoting cross-ethnicity generalization. Extensive experiments on CelebIPVid show that MoCA outperforms existing T2V methods by over 5% across Face similarity.

[19] VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering cs.CV | cs.MMPDF

Yiran Meng, Junhong Ye, Wei Zhou, Guanghui Yue, Xudong Mao

TL;DR: VideoForest提出了一种基于人物锚点的层次化推理框架，用于解决跨视频问答任务，通过人物级特征作为桥梁连接多视频流，显著提升了跨视频推理的性能。

Details

Motivation: 传统单视频理解难以处理跨视频问答的复杂性，尤其是多源信息检索和视频间关联的建立，因此需要一种新方法来实现高效的跨视频推理。

Result: 实验表明，VideoForest在人物识别（71.93%）、行为分析（83.75%）和推理总结（51.67%）任务中显著优于现有方法。

Insight: 人物级特征是跨视频理解的天然桥梁，层次化结构能够高效组织复杂信息，多智能体推理可以灵活应对跨视频问答任务。

Abstract: Cross-video question answering presents significant challenges beyond traditional single-video understanding, particularly in establishing meaningful connections across video streams and managing the complexity of multi-source information retrieval. We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning. Our approach leverages person-level features as natural bridge points between videos, enabling effective cross-video understanding without requiring end-to-end training. VideoForest integrates three key innovations: 1) a human-anchored feature extraction mechanism that employs ReID and tracking algorithms to establish robust spatiotemporal relationships across multiple video sources; 2) a multi-granularity spanning tree structure that hierarchically organizes visual content around person-level trajectories; and 3) a multi-agent reasoning framework that efficiently traverses this hierarchical structure to answer complex cross-video queries. To evaluate our approach, we develop CrossVideoQA, a comprehensive benchmark dataset specifically designed for person-centric cross-video analysis. Experimental results demonstrate VideoForest’s superior performance in cross-video reasoning tasks, achieving 71.93% accuracy in person recognition, 83.75% in behavior analysis, and 51.67% in summarization and reasoning, significantly outperforming existing methods. Our work establishes a new paradigm for cross-video understanding by unifying multiple video streams through person-level features, enabling sophisticated reasoning across distributed visual information while maintaining computational efficiency.

[20] Multi-human Interactive Talking Dataset cs.CVPDF

Zeyu Zhu, Weijia Wu, Mike Zheng Shou

TL;DR: 论文提出了一个专门用于多人物对话视频生成的大规模数据集MIT，并开发了基线模型CovOG，展示了多人物交互视频生成的可行性与挑战。

Details

Motivation: 现有研究主要关注单人物独白或孤立面部动画，缺乏对多人物交互的适用性。为填补这一空白，作者提出了MIT数据集。

Result: MIT数据集和CovOG模型为多人物交互视频生成提供了基准，展示了任务的可行性与挑战。

Insight: 论文揭示了多人物交互视频生成的复杂性，强调了对自然对话动态建模的重要性，为未来研究提供了重要资源。

Abstract: Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset.

[21] Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation cs.CV | cs.AI | I.4.8PDF

Hyebin Cho, Jaehyup Lee

TL;DR: 论文提出了一种名为FaceMat的无需trimap、基于不确定性的面部抠图框架，用于改善面部遮挡下的抠图效果，并通过两阶段训练和自适应知识蒸馏提升模型性能。

Details

Motivation: 短视頻中的面部滤镜在遮挡情况下性能下降，现有方法依赖trimap或分割掩码，难以适应实时应用。

Result: FaceMat在多个基准测试中优于现有方法，提升了实际视频场景中面部滤镜的视觉质量和鲁棒性。

Insight: 不确定性引导的训练有助于模型聚焦于模糊或遮挡区域，从而提升语义一致性；明确的前景-背景定义改善了合成效果。

Abstract: Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at https://github.com/hyebin-c/FaceMat.git

[22] CHARM: Collaborative Harmonization across Arbitrary Modalities for Modality-agnostic Semantic Segmentation cs.CVPDF

Lekang Wen, Jing Xiao, Liang Liao, Jiajun Chen, Mi Wang

TL;DR: CHARM提出了一个新型互补学习框架，通过隐式对齐和模态特异性优化，实现了跨模态的协作和谐化，显著提升了模态不可知的语义分割性能。

Details

Motivation: 现有方法通常依赖显式特征对齐来实现模态同质化，但这种方式削弱了各模态的独特优势并破坏了其互补性。CHARM旨在通过协作和谐化而非同质化，保留模态特异性并发挥其互补性。

Result: 在多个数据集和主干网络上，CHARM均优于基线方法，在脆弱模态上表现尤为突出。

Insight: CHARM将研究重点从同质化转向和谐化，强调了保留模态特异性和互补性的重要性，为跨模态任务提供了新思路。

Abstract: Modality-agnostic Semantic Segmentation (MaSS) aims to achieve robust scene understanding across arbitrary combinations of input modality. Existing methods typically rely on explicit feature alignment to achieve modal homogenization, which dilutes the distinctive strengths of each modality and destroys their inherent complementarity. To achieve cooperative harmonization rather than homogenization, we propose CHARM, a novel complementary learning framework designed to implicitly align content while preserving modality-specific advantages through two components: (1) Mutual Perception Unit (MPU), enabling implicit alignment through window-based cross-modal interaction, where modalities serve as both queries and contexts for each other to discover modality-interactive correspondences; (2) A dual-path optimization strategy that decouples training into Collaborative Learning Strategy (CoL) for complementary fusion learning and Individual Enhancement Strategy (InE) for protected modality-specific optimization. Experiments across multiple datasets and backbones indicate that CHARM consistently outperform the baselines, with significant increment on the fragile modalities. This work shifts the focus from model homogenization to harmonization, enabling cross-modal complementarity for true harmony in diversity.

[23] Exploring Fairness across Fine-Grained Attributes in Large Vision-Language Models cs.CVPDF

Zaiying Zhao, Toshihiko Yamasaki

TL;DR: 该论文研究了大型视觉语言模型（LVLMs）在细粒度属性上的公平性问题，通过构建开放的偏见属性知识库，揭示了LVLMs在多种属性上的偏见输出，并发现文化、环境和行为因素对模型决策的影响大于传统人口统计属性。

Details

Motivation: 随着LVLMs（如GPT-4o）应用的迅速扩展，其公平性问题引发了广泛关注。现有研究主要关注种族和性别等人口统计属性，而更广泛的属性公平性尚未充分探索。

Result: 实验结果表明，LVLMs在多种细粒度属性上表现出偏见输出，且文化、环境和行为因素对模型决策的影响比传统人口统计属性更显著。

Insight: 论文的洞见在于揭示了LVLMs公平性问题不仅限于传统人口统计属性，文化、环境和行为因素对模型决策的影响更为关键，为未来公平性研究提供了新的方向。

Abstract: The rapid expansion of applications using Large Vision-Language Models (LVLMs), such as GPT-4o, has raised significant concerns about their fairness. While existing studies primarily focus on demographic attributes such as race and gender, fairness across a broader range of attributes remains largely unexplored. In this study, we construct an open-set knowledge base of bias attributes leveraging Large Language Models (LLMs) and evaluate the fairness of LVLMs across finer-grained attributes. Our experimental results reveal that LVLMs exhibit biased outputs across a diverse set of attributes and further demonstrate that cultural, environmental, and behavioral factors have a more pronounced impact on LVLM decision-making than traditional demographic attributes.

[24] Augmenting Continual Learning of Diseases with LLM-Generated Visual Concepts cs.CVPDF

Jiantao Tan, Peixian Ma, Kanghao Chen, Zhiming Dai, Ruixuan Wang

TL;DR: 论文提出了一种利用大语言模型(LLMs)生成的视觉概念来增强疾病持续学习的新框架，通过跨模态注意力模块和过滤机制显著提升了分类性能。

Details

Motivation: 现有方法在医学图像分类的持续学习中仅依赖于简单的文本模板，忽视了丰富的语义信息，限制了性能提升。

Result: 在医学和自然图像数据集上实现了最先进的性能。

Insight: 大语言模型生成的视觉概念可以作为有效的语义引导，显著提升持续学习的分类效果。

Abstract: Continual learning is essential for medical image classification systems to adapt to dynamically evolving clinical environments. The integration of multimodal information can significantly enhance continual learning of image classes. However, while existing approaches do utilize textual modality information, they solely rely on simplistic templates with a class name, thereby neglecting richer semantic information. To address these limitations, we propose a novel framework that harnesses visual concepts generated by large language models (LLMs) as discriminative semantic guidance. Our method dynamically constructs a visual concept pool with a similarity-based filtering mechanism to prevent redundancy. Then, to integrate the concepts into the continual learning process, we employ a cross-modal image-concept attention module, coupled with an attention loss. Through attention, the module can leverage the semantic knowledge from relevant visual concepts and produce class-representative fused features for classification. Experiments on medical and natural image datasets show our method achieves state-of-the-art performance, demonstrating the effectiveness and superiority of our method. We will release the code publicly.

[25] AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video cs.CVPDF

Yogesh Kulkarni, Pooyan Fazli

TL;DR: 论文提出了AVATAR框架，通过改进数据效率和信用分配策略，显著提升了多模态视频推理任务的性能。

Details

Motivation: 多模态视频推理任务面临时空融合与模态对齐的挑战。现有方法如GRPO存在数据效率低、优势消失和信用分配不均的问题。

Result: 在多个基准测试中优于基线模型（如Qwen2.5-Omni），样本效率提升35%以上。

Insight: 样本复用和差异化信用分配对多模态视频推理任务至关重要。

Abstract: Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning. AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, while demonstrating over 35% higher sample efficiency.

Tianjiao Jiang, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi

TL;DR: 该论文提出了一种名为Causal CLIP Adapter (CCA)的新框架，通过无监督独立成分分析(ICA)显式解缠CLIP提取的视觉特征，减少可训练参数并避免过拟合。此外，CCA通过单向和双向方式增强跨模态对齐，显著提升了少样本学习的性能。

Details

Motivation: 现有的少样本学习方法依赖于隐含解缠的表示，而这种方法在有限监督下效果不佳。论文基于多模态对比学习（如CLIP）的理论进展，提出显式解缠视觉特征的方法。

Result: 在11个基准数据集上的实验表明，CCA在少样本学习和分布偏移鲁棒性上优于现有方法，同时保持计算效率。

Insight: 显式解缠和跨模态对齐的结合是提升少样本学习性能的关键，同时保留CLIP的预训练优势。

Abstract: Few-shot learning (FSL) often requires effective adaptation of models using limited labeled data. However, most existing FSL methods rely on entangled representations, requiring the model to implicitly recover the unmixing process to obtain disentangled representations using only limited supervision, which hinders effective adaptation. Recent theoretical studies show that multimodal contrastive learning methods, such as CLIP, can disentangle latent representations up to linear transformations. In light of this, we propose the Causal CLIP Adapter (CCA), a novel framework that explicitly disentangles visual features extracted from CLIP using unsupervised Independent Component Analysis (ICA). This removes the need to learn the unmixing process from the labeled data, thereby reducing the number of trainable parameters and mitigating overfitting. Taking a step further, while ICA can obtain visual disentangled representations, it may also disrupt CLIP’s intra- and inter-modal alignment. To counteract this, CCA further leverages CLIP’s inherent cross-modal alignment by enhancing it in two ways: unidirectionally, through fine-tuning a CLIP-based text classifier, and bidirectionally, via a cross-attention mechanism that enriches visual and textual representations through mutual interaction. Both unimodal and cross-modal classification outputs can be effectively combined linearly to improve classification accuracy. Extensive experiments on 11 benchmark datasets demonstrate that our method consistently outperforms state-of-the-art approaches in terms of few-shot performance and robustness to distributional shifts, while maintaining computational efficiency. Code will be available at https://github.com/tianjiao-j/CCA.

[27] H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction cs.CVPDF

Heng Jia, Linchao Zhu, Na Zhao

TL;DR: H3R提出了一种混合框架，结合显式几何约束与隐式特征聚合，解决多视角对应建模中的精度与鲁棒性权衡问题，实现更快的收敛和跨数据集泛化。

Details

Motivation: 现有方法在3D重建中存在显式方法精度高但鲁棒性差、隐式方法鲁棒性强但收敛慢的问题，亟需一种兼顾两者的解决方案。

Result: 在RealEstate10K、ACID和DTU数据集上PSNR分别提升0.59dB、1.06dB和0.22dB，收敛速度快2倍。

Insight: 空间对齐的基础模型（如SD-VAE）比语义对齐模型（如DINOv2）更适合3D重建任务，解决了语义与空间需求的错配问题。

Abstract: Despite recent advances in feed-forward 3D Gaussian Splatting, generalizable 3D reconstruction remains challenging, particularly in multi-view correspondence modeling. Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. Our framework consists of two complementary components: an efficient latent volume that enforces geometric consistency through epipolar constraints, and a camera-aware Transformer that leverages Pl"ucker coordinates for adaptive correspondence refinement. By integrating both paradigms, our approach enhances generalization while converging 2$\times$ faster than existing methods. Furthermore, we show that spatial-aligned foundation models (e.g., SD-VAE) substantially outperform semantic-aligned models (e.g., DINOv2), resolving the mismatch between semantic representations and spatial reconstruction requirements. Our method supports variable-number and high-resolution input views while demonstrating robust cross-dataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with significant PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code is available at https://github.com/JiaHeng-DLUT/H3R.

[28] Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery cs.CV | cs.AIPDF

Sai Ma, Zhuang Li, John A Taylor

TL;DR: Landsat30-AU是一个面向澳大利亚Landsat卫星图像的大规模视觉-语言数据集，填补了长期、低分辨率多卫星存档数据的空白，并通过评估现有VLM模型的表现，表明其在卫星图像理解上的局限性。

Details

Motivation: 现有视觉-语言模型的数据集主要关注短期高分辨率图像，忽略了长期、低分辨率的多卫星存档数据（如Landsat），而这对于全球监测至关重要。

Result: 现有VLM模型在Landsat30-AU上表现不佳（EarthDial的SPIDEr仅为0.07），但轻量级微调显著提升了性能（Qwen2.5-VL-7B在SPIDEr上从0.11提升到0.31）。

Insight: 长期、低分辨率的多卫星存档数据对全球监测至关重要，但现有VLM模型在此类数据上的表现仍需改进，轻量级微调是一种有效的提升手段。

Abstract: Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing 196,262 image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from \textbf{0.74} to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

[29] Uint: Building Uint Detection Dataset cs.CVPDF

Haozhou Zhai, Yanzhe Gao, Tianjiang Hu

TL;DR: 该论文提出了一种通过无人机采集并结合多种增强技术构建的建筑单元火灾检测数据集，解决了现有火灾数据中缺乏建筑单元标注数据的问题。

Details

Motivation: 现有的火灾相关数据中缺乏专门针对建筑单元的标注数据，影响了计算机视觉模型的训练效果，特别是在火灾预警和紧急救援任务中。

Result: 生成的合成数据集包含1978张图像，涵盖多种建筑场景，能够有效提升火灾单元检测的泛化能力，同时降低真实数据采集的风险和成本。

Insight: 通过合成数据和多增强技术的结合，可以在无需大量真实火灾数据的情况下，生成高质量的训练数据，为火灾检测任务提供了新的解决方案。

Abstract: Fire scene datasets are crucial for training robust computer vision models, particularly in tasks such as fire early warning and emergency rescue operations. However, among the currently available fire-related data, there is a significant shortage of annotated data specifically targeting building units.To tackle this issue, we introduce an annotated dataset of building units captured by drones, which incorporates multiple enhancement techniques. We construct backgrounds using real multi-story scenes, combine motion blur and brightness adjustment to enhance the authenticity of the captured images, simulate drone shooting conditions under various circumstances, and employ large models to generate fire effects at different locations.The synthetic dataset generated by this method encompasses a wide range of building scenarios, with a total of 1,978 images. This dataset can effectively improve the generalization ability of fire unit detection, providing multi-scenario and scalable data while reducing the risks and costs associated with collecting real fire data. The dataset is available at https://github.com/boilermakerr/FireUnitData.

[30] UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying cs.CVPDF

Chengyu Bai, Jintao Chen, Xiang Bai, Yilong Chen, Qi She

TL;DR: UniEdit-I 是一个无需训练的图像编辑框架，通过迭代的理解、编辑和验证步骤，为统一视觉语言模型（VLM）提供图像编辑能力。

Details

Motivation: 现有的统一视觉语言模型在生成任务上表现出色，但缺乏便捷的图像编辑能力。本文提出一个无需训练的框架来解决这一问题。

Result: 在 GEdit-Bench 基准测试中取得最优性能。

Insight: 通过迭代的自动验证和反馈机制，无需额外训练即可实现高质量图像编辑，为统一视觉语言模型的应用扩展提供了新思路。

Abstract: In recent years, unified vision-language models (VLMs) have rapidly advanced, effectively tackling both visual understanding and generation tasks within a single design. While many unified VLMs have explored various design choices, the recent hypothesis from OpenAI’s GPT-4o suggests a promising generation pipeline: Understanding VLM->Visual Feature->Projector->Diffusion Model->Image. The understanding VLM is frozen, and only the generation-related modules are trained. This pipeline maintains the strong capability of understanding VLM while enabling the image generation ability of the unified VLM. Although this pipeline has shown very promising potential for the future development of unified VLM, how to easily enable image editing capability is still unexplored. In this paper, we introduce a novel training-free framework named UniEdit-I to enable the unified VLM with image editing capability via three iterative steps: understanding, editing, and verifying. 1. The understanding step analyzes the source image to create a source prompt through structured semantic analysis and makes minimal word replacements to form the target prompt based on the editing instruction. 2. The editing step introduces a time-adaptive offset, allowing for coherent editing from coarse to fine throughout the denoising process. 3. The verification step checks the alignment between the target prompt and the intermediate edited image, provides automatic consistency scores and corrective feedback, and determines whether to stop early or continue the editing loop. This understanding, editing, and verifying loop iterates until convergence, delivering high-fidelity editing in a training-free manner. We implemented our method based on the latest BLIP3-o and achieved state-of-the-art (SOTA) performance on the GEdit-Bench benchmark.

[31] ChartCap: Mitigating Hallucination of Dense Chart Captioning cs.CV | cs.AI | cs.CLPDF

Junyoung Lim, Jaewoo Ahn, Gunhee Kim

TL;DR: 本文提出了ChartCap数据集，用于解决密集图表描述任务中的幻觉问题，并通过四阶段流水线和新颖的视觉一致性评分指标提升了描述质量。

Details

Motivation: 现有真实世界图表数据集包含不可推断的冗余信息，且未能充分捕捉图表结构要素和关键见解，导致图表描述任务中生成的内容存在不准确和幻觉问题。

Result: 实验表明，基于ChartCap微调的模型生成的描述更准确、信息更丰富，幻觉现象减少，优于开源、商业模型及人工标注描述。

Insight: 1. 排除冗余信息并突出结构和关键见解是提升描述质量的关键；2. 循环一致性验证可加速质量控制；3. VCS可独立评估描述质量，避免了传统依赖参考描述的局限。

Abstract: Generating accurate, informative, and hallucination-free captions for charts remains challenging for vision language models, primarily due to the lack of large-scale, high-quality datasets of real-world charts. However, existing real-world chart datasets suffer from the inclusion of extraneous information that cannot be inferred from the chart and failure to sufficiently capture structural elements and key insights. Therefore, we introduce ChartCap, a large-scale dataset of 565K real-world chart images paired with type-specific, dense captions that exclude extraneous information and highlight both structural elements and key insights in detail. To build ChartCap, we design a four-stage pipeline that generates captions using only the discernible data from the chart and employ a cycle consistency-based human verification, which accelerates quality control without sacrificing accuracy. Additionally, we propose a novel metric, the Visual Consistency Score, which evaluates caption quality by measuring the similarity between the chart regenerated from a caption and the original chart, independent of reference captions. Extensive experiments confirms that models fine-tuned on ChartCap consistently generate more accurate and informative captions with reduced hallucinations, surpassing both open-source and proprietary models and even human-annotated captions.

[32] SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision cs.CVPDF

Zhaoxu Li, Chenqi Kong, Yi Yu, Qiangqiang Wu, Xinghao Jiang

TL;DR: 该论文提出了一种名为SAVER的方法，旨在通过风格感知的视觉早期修正机制，减少大型视觉语言模型在风格化图像上产生的幻觉问题。

Details

Motivation: 大型视觉语言模型（LVLMs）在复杂视觉文本理解方面取得了突破，但幻觉问题限制了其实际应用。现有方法主要针对真实照片的幻觉问题，而忽略了风格化图像带来的潜在风险。

Result: 实验表明，SAVER在多种模型、数据集和任务中均达到了最先进的幻觉缓解性能。

Insight: 风格化图像在LVLMs中更容易引发幻觉，通过早期视觉注意力修正可以显著改善这一问题。

Abstract: Large Vision-Language Models (LVLMs) recently achieve significant breakthroughs in understanding complex visual-textual contexts. However, hallucination issues still limit their real-world applicability. Although previous mitigation methods effectively reduce hallucinations in photographic images, they largely overlook the potential risks posed by stylized images, which play crucial roles in critical scenarios such as game scene understanding, art education, and medical analysis. In this work, we first construct a dataset comprising photographic images and their corresponding stylized versions with carefully annotated caption labels. We then conduct head-to-head comparisons on both discriminative and generative tasks by benchmarking 13 advanced LVLMs on the collected datasets. Our findings reveal that stylized images tend to induce significantly more hallucinations than their photographic counterparts. To address this issue, we propose Style-Aware Visual Early Revision SAVER, a novel mechanism that dynamically adjusts LVLMs’ final outputs based on the token-level visual attention patterns, leveraging early-layer feedback to mitigate hallucinations caused by stylized images. Extensive experiments demonstrate that SAVER achieves state-of-the-art performance in hallucination mitigation across various models, datasets, and tasks.

[33] Advancing Precision in Multi-Point Cloud Fusion Environments cs.CV | cs.GRPDF

Ulugbek Alibekov, Vanessa Staderini, Philipp Schneider, Doris Antensteiner

TL;DR: 论文主要研究工业视觉检测中的点云匹配方法，提出合成数据集和新型CloudCompare插件，提升多点云融合的精度和效率。

Details

Motivation: 工业检测中，点云匹配和表面缺陷检测的精度与效率是关键。现有方法在多点云融合方面存在不足，亟需改进。

Result: 新方法提高了点云匹配的精度和效率，插件工具为工业检测提供了实用支持。

Insight: 合成数据集和多点云融合工具的结合，为工业视觉检测提供了一种更可靠的解决方案。

Abstract: This research focuses on visual industrial inspection by evaluating point clouds and multi-point cloud matching methods. We also introduce a synthetic dataset for quantitative evaluation of registration method and various distance metrics for point cloud comparison. Additionally, we present a novel CloudCompare plugin for merging multiple point clouds and visualizing surface defects, enhancing the accuracy and efficiency of automated inspection systems.

[34] Neovascularization Segmentation via a Multilateral Interaction-Enhanced Graph Convolutional Network cs.CVPDF

Tao Chen, Dan Zhang, Da Chen, Huazhu Fu, Kai Jin

TL;DR: 本文提出了一个多边交互增强的图卷积网络（MTG-Net）用于脉络膜新生血管（CNV）的分割，并发布了首个公开的CNV数据集（CNVSeg）。通过多任务框架和两个图推理模块（MIGR和MRGR），该方法实现了对病变形状和表面的几何特征的高效捕捉，实验结果显示其性能优于现有方法。

Details

Motivation: 脉络膜新生血管（CNV）是湿性年龄相关性黄斑变性（wet AMD）的主要特征，准确分割CNV区域和血管对临床评估至关重要。然而，现有方法面临不规则形状、投影伪影和噪声等挑战，且缺乏公开数据集。

Result: 实验结果显示，MTG-Net在区域分割和血管分割任务中的Dice得分分别为87.21%和88.12%，优于现有方法。

Insight: 通过图机制结合多任务学习，可以有效捕捉病变的几何特征并优化分割结果。公开的数据集为未来研究提供了重要资源。

Abstract: Choroidal neovascularization (CNV), a primary characteristic of wet age-related macular degeneration (wet AMD), represents a leading cause of blindness worldwide. In clinical practice, optical coherence tomography angiography (OCTA) is commonly used for studying CNV-related pathological changes, due to its micron-level resolution and non-invasive nature. Thus, accurate segmentation of CNV regions and vessels in OCTA images is crucial for clinical assessment of wet AMD. However, challenges existed due to irregular CNV shapes and imaging limitations like projection artifacts, noises and boundary blurring. Moreover, the lack of publicly available datasets constraints the CNV analysis. To address these challenges, this paper constructs the first publicly accessible CNV dataset (CNVSeg), and proposes a novel multilateral graph convolutional interaction-enhanced CNV segmentation network (MTG-Net). This network integrates both region and vessel morphological information, exploring semantic and geometric duality constraints within the graph domain. Specifically, MTG-Net consists of a multi-task framework and two graph-based cross-task modules: Multilateral Interaction Graph Reasoning (MIGR) and Multilateral Reinforcement Graph Reasoning (MRGR). The multi-task framework encodes rich geometric features of lesion shapes and surfaces, decoupling the image into three task-specific feature maps. MIGR and MRGR iteratively reason about higher-order relationships across tasks through a graph mechanism, enabling complementary optimization for task-specific objectives. Additionally, an uncertainty-weighted loss is proposed to mitigate the impact of artifacts and noise on segmentation accuracy. Experimental results demonstrate that MTG-Net outperforms existing methods, achieving a Dice socre of 87.21% for region segmentation and 88.12% for vessel segmentation.

[35] AlignCAT: Visual-Linguistic Alignment of Category and Attributefor Weakly Supervised Visual Grounding cs.CVPDF

Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe

TL;DR: AlignCAT提出了一种新的基于查询的语义匹配框架，通过粗粒度对齐和细粒度对齐模块，解决了弱监督视觉定位中跨模态推理不足的问题。

Details

Motivation: 现有弱监督视觉定位方法在跨模态推理方面表现不足，难以从文本表达中区分细微的语义差异，尤其是由类别和属性引起的歧义。

Result: 在RefCOCO、RefCOCO+和RefCOCOg等基准测试中，AlignCAT显著优于现有弱监督方法，验证了其有效性。

Insight: 最大程度利用语言线索（类别和属性信息）是提升视觉定位性能的关键，同时渐进式过滤和对比学习的结合可显著提升模型效率。

Abstract: Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.

[36] Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration cs.CVPDF

Ting Lei, Shaofeng Yin, Qingchao Chen, Yuxin Peng, Yang Liu

TL;DR: INP-CC通过交互感知提示和概念校准，提出了一种端到端的开放式词汇HOI检测器，显著优于现有方法。

Details

Motivation: 当前基于VLM的方法在HOI检测中存在图像编码器不优和文本描述编码困难的问题，限制了细粒度交互检测的能力。

Result: 在SWIG-HOI和HICO-DET数据集上表现优于现有方法。

Insight: 交互感知提示和概念校准可以显著提升开放式词汇HOI检测的性能。

Abstract: Open Vocabulary Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects while generalizing to novel interaction classes beyond the training set. Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders, as image-level pre-training does not align well with the fine-grained region-level interaction detection required for HOI. Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model’s ability to capture detailed HOI relationships. To address these issues, we propose INteraction-aware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. Specifically, we propose an interaction-aware prompt generator that dynamically generates a compact set of prompts based on the input scene, enabling selective sharing among similar interactions. This approach directs the model’s attention to key interaction patterns rather than generic image-level semantics, enhancing HOI detection. Furthermore, we refine HOI concept representations through language model-guided calibration, which helps distinguish diverse HOI concepts by investigating visual similarities across categories. A negative sampling strategy is also employed to improve inter-modal similarity modeling, enabling the model to better differentiate visually similar but semantically distinct actions. Extensive experimental results demonstrate that INP-CC significantly outperforms state-of-the-art models on the SWIG-HOI and HICO-DET datasets. Code is available at https://github.com/ltttpku/INP-CC.

[37] GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations cs.CV | cs.AIPDF

Xinwei Liu, Xiaojun Jia, Yuan Xun, Simeng Qin, Xiaochun Cao

TL;DR: GeoShield 是一种新的对抗性扰动框架，用于保护用户的地理位置隐私，防止视觉语言模型（VLMs）从公开图像中推断位置信息。通过特征解耦、曝光元素识别和尺度自适应增强模块，GeoShield 在各种分辨率和扰动预算下实现了高效的隐私保护。

Details

Motivation: 现有对抗性扰动方法在保护地理隐私上表现不佳，尤其在高分辨率图像和低扰动预算下效果有限，可能引入无关语义内容。因此，需要一种更稳健的方法应对高级VLMs的威胁。

Result: 实验表明，GeoShield在对抗高级VLMs的黑盒场景中显著优于现有方法，能够在最小化视觉和语义影响的前提下实现强大的隐私保护效果。

Insight: 该研究揭示了对抗性扰动在保护地理隐私中的潜力，尤其是在面对大规模预训练模型时，为实际隐私保护提供了新思路。

Abstract: Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users’ locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns.

[38] ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow cs.CVPDF

Shanshan Guo, Xiwen Liang, Junfan Lin, Yuzheng Zhuang, Liang Lin

TL;DR: 论文提出了ActionSink框架，通过动态整合自监督生成的动作流，提升了机器人操纵任务中低层级动作估计的精度。

Details

Motivation: 当前基于语言指令的机器人操纵任务中，高层级感知和规划已取得进展，但低层级动作估计的精度不足成为性能瓶颈。

Result: 在LIBERO基准测试中，ActionSink比SOTA方法提高了7.9%的成功率，在LIBERO-Long任务中提升了近8%的准确率。

Insight: 通过动态整合动作流，可以显著提升机器人操纵任务的动作估计精度，尤其是对长时程任务效果明显。

Abstract: Language-instructed robot manipulation has garnered significant interest due to the potential of learning from collected data. While the challenges in high-level perception and planning are continually addressed along the progress of general large pre-trained models, the low precision of low-level action estimation has emerged as the key limiting factor in manipulation performance. To this end, this paper introduces a novel robot manipulation framework, i.e., ActionSink, to pave the way toward precise action estimations in the field of learning-based robot manipulation. As the name suggests, ActionSink reformulates the actions of robots as action-caused optical flows from videos, called “action flow”, in a self-supervised manner, which are then used to be retrieved and integrated to enhance the action estimation. Specifically, ActionSink incorporates two primary modules. The first module is a coarse-to-fine action flow matcher, which continuously refines the accuracy of action flow via iterative retrieval and denoising process. The second module is a dynamic action flow integrator, which employs a working memory pool that dynamically and efficiently manages the historical action flows that should be used to integrate to enhance the current action estimation. In this module, a multi-layer fusion module is proposed to integrate direct estimation and action flows from both the current and the working memory, achieving highly accurate action estimation through a series of estimation-integration processes. Our ActionSink framework outperformed prior SOTA on the LIBERO benchmark by a 7.9% success rate, and obtained nearly an 8% accuracy gain on the challenging long-horizon visual task LIBERO-Long.

[39] Zero-shot Shape Classification of Nanoparticles in SEM Images using Vision Foundation Models cs.CVPDF

Freida Barnatan, Emunah Goldstein, Einav Kalimian, Orchen Madar, Avi Huri

TL;DR: 该论文提出了一种利用视觉基础模型（SAM和DINOv2）的零样本分类方法，用于SEM图像中纳米颗粒的形状分类，无需大量标记数据和计算资源，表现优于传统方法。

Details

Motivation: 传统的深度学习方法需要大量标记数据和计算资源，限制了其在纳米颗粒研究中的应用。作者希望通过视觉基础模型实现高效、可访问的分类方法。

Result: 该方法在三个纳米颗粒数据集上表现出高精度分类性能，优于微调的YOLOv11和ChatGPT baselines，并对小数据集和领域迁移具有鲁棒性。

Insight: 视觉基础模型（如SAM和DINOv2）可以显著简化显微镜图像分析流程，为纳米颗粒研究提供更高效和可访问的解决方案。

Abstract: Accurate and efficient characterization of nanoparticle morphology in Scanning Electron Microscopy (SEM) images is critical for ensuring product quality in nanomaterial synthesis and accelerating development. However, conventional deep learning methods for shape classification require extensive labeled datasets and computationally demanding training, limiting their accessibility to the typical nanoparticle practitioner in research and industrial settings. In this study, we introduce a zero-shot classification pipeline that leverages two vision foundation models: the Segment Anything Model (SAM) for object segmentation and DINOv2 for feature embedding. By combining these models with a lightweight classifier, we achieve high-precision shape classification across three morphologically diverse nanoparticle datasets - without the need for extensive parameter fine-tuning. Our methodology outperforms a fine-tuned YOLOv11 and ChatGPT o4-mini-high baselines, demonstrating robustness to small datasets, subtle morphological variations, and domain shifts from natural to scientific imaging. Quantitative clustering metrics on PCA plots of the DINOv2 features are discussed as a means of assessing the progress of the chemical synthesis. This work highlights the potential of foundation models to advance automated microscopy image analysis, offering an alternative to traditional deep learning pipelines in nanoparticle research which is both more efficient and more accessible to the user.

[40] Ultralight Polarity-Split Neuromorphic SNN for Event-Stream Super-Resolution cs.CV | cs.LGPDF

Chuanzhi Xu, Haoxian Zhou, Langyi Chen, Yuk Ying Chung, Qiang Qu

TL;DR: 该论文提出了一种基于脉冲神经网络（SNN）的超轻量事件流超分辨率方法，适用于资源受限设备。通过引入新颖的极分离编码策略和可学习时空极感知损失函数，模型在保持性能的同时显著减小了计算开销。

Details

Motivation: 事件相机虽然具有高时间分辨率、低延迟和高动态范围的优势，但其空间分辨率较低，限制了精细感知任务的性能。为解决这一问题，论文提出了一种轻量级的事件流超分辨率方法。

Result: 在多个数据集上验证了方法的有效性，超分辨率性能与现有方法相当，同时显著降低模型大小和推理时间。

Insight: 1. 极分离策略是轻量化的关键。2. 可学习损失函数能够自适应任务需求，提升性能。该方法适用于嵌入式设备或作为下游任务的前端预处理。

Abstract: Event cameras offer unparalleled advantages such as high temporal resolution, low latency, and high dynamic range. However, their limited spatial resolution poses challenges for fine-grained perception tasks. In this work, we propose an ultra-lightweight, stream-based event-to-event super-resolution method based on Spiking Neural Networks (SNNs), designed for real-time deployment on resource-constrained devices. To further reduce model size, we introduce a novel Dual-Forward Polarity-Split Event Encoding strategy that decouples positive and negative events into separate forward paths through a shared SNN. Furthermore, we propose a Learnable Spatio-temporal Polarity-aware Loss (LearnSTPLoss) that adaptively balances temporal, spatial, and polarity consistency using learnable uncertainty-based weights. Experimental results demonstrate that our method achieves competitive super-resolution performance on multiple datasets while significantly reducing model size and inference time. The lightweight design enables embedding the module into event cameras or using it as an efficient front-end preprocessing for downstream vision tasks.

[41] V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models cs.CV | cs.AIPDF

Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park

TL;DR: 该论文提出了一种名为V.I.P.的迭代在线偏好蒸馏方法，用于高效视频扩散模型。通过结合DPO（目标偏好优化）和SFT（监督微调），解决了传统蒸馏方法中因模型容量减少导致的模式坍塌问题，同时实现了参数减少和性能提升。

Details

Motivation: 在资源受限的环境中部署文本到视频（T2V）模型时，降低其高计算成本至关重要。然而，现有蒸馏方法主要依赖SFT，容易因学生模型无法匹配教师模型的输出而导致性能下降。

Result: 在VideoCrafter2和AnimateDiff模型上分别减少了36.2%和67.5%的参数，同时保持了甚至超越了完整模型的性能。

Insight: 结合目标偏好优化和监督微调可以更有效地解决蒸馏中的模式坍塌问题，同时数据集的高质量筛选和在线训练校准对性能提升至关重要。

Abstract: With growing interest in deploying text-to-video (T2V) models in resource-constrained environments, reducing their high computational cost has become crucial, leading to extensive research on pruning and knowledge distillation methods while maintaining performance. However, existing distillation methods primarily rely on supervised fine-tuning (SFT), which often leads to mode collapse as pruned models with reduced capacity fail to directly match the teacher’s outputs, ultimately resulting in degraded quality. To address this challenge, we propose an effective distillation method, ReDPO, that integrates DPO and SFT. Our approach leverages DPO to guide the student model to focus on recovering only the targeted properties, rather than passively imitating the teacher, while also utilizing SFT to enhance overall performance. We additionally propose V.I.P., a novel framework for filtering and curating high-quality pair datasets, along with a step-by-step online approach for calibrated training. We validate our method on two leading T2V models, VideoCrafter2 and AnimateDiff, achieving parameter reduction of 36.2% and 67.5% each, while maintaining or even surpassing the performance of full models. Further experiments demonstrate the effectiveness of both ReDPO and V.I.P. framework in enabling efficient and high-quality video generation. Our code and videos are available at https://jiiiisoo.github.io/VIP.github.io/.

[42] Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation cs.CVPDF

Gang Dai, Yifan Zhang, Yutao Qin, Qiangya Guo, Shuangping Huang

TL;DR: 该论文提出DiffBrush，一种基于扩散模型的手写文本行生成方法，通过解耦内容和风格以及对多尺度内容的学习，实现了高质量的文本行生成。

Details

Motivation: 现有手写文本生成方法主要关注孤立单词，而实际手写文本需要关注单词间的关系（如垂直对齐和水平间距），因此生成完整文本行是一个更全面的任务。

Result: 实验表明DiffBrush在风格模仿和内容准确性方面表现优异，能够生成高质量的文本行。

Insight: 解耦风格与内容并利用多尺度判别器是提升手写文本行生成质量的有效策略。

Abstract: Existing handwritten text generation methods primarily focus on isolated words. However, realistic handwritten text demands attention not only to individual words but also to the relationships between them, such as vertical alignment and horizontal spacing. Therefore, generating entire text lines emerges as a more promising and comprehensive task. However, this task poses significant challenges, including the accurate modeling of complex style patterns encompassing both intra- and inter-word relationships, and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. Unlike existing methods, DiffBrush excels in both style imitation and content accuracy through two key strategies: (1) content-decoupled style learning, which disentangles style from content to better capture intra-word and inter-word style patterns by using column- and row-wise masking; and (2) multi-scale content learning, which employs line and word discriminators to ensure global coherence and local accuracy of textual content. Extensive experiments show that DiffBrush excels in generating high-quality text lines, particularly in style reproduction and content preservation. Code is available at https://github.com/dailenson/DiffBrush.

[43] VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation cs.CV | cs.AI | cs.CLPDF

Yufei Xue, Yushi Huang, Jiawei Shao, Jun Zhang

TL;DR: VLMQ提出了一种针对视觉语言模型（VLM）的后训练量化框架，通过Hessian增强和令牌级重要性因子，解决了模态差异问题，并在低比特量化设置下实现了SOTA性能。

Details

Motivation: 现有后训练量化（PTQ）方法主要针对大型语言模型（LLM），而在视觉语言模型（VLM）中表现不佳，主要由于文本令牌有限而视觉令牌冗余的问题未被解决。

Result: 在8个基准测试中，VLMQ在0.5B~32B规模的VLM上表现优异，2比特量化下MME-RealWorld任务提升16.45%。

Insight: 视觉令牌冗余是VLM量化中的关键问题，令牌级重要性感知可显著提升量化性能。

Abstract: Post-training quantization (PTQ) has emerged as an effective approach for compressing large models and accelerating their inference without retraining. While PTQ has been extensively studied in the context of large language models (LLMs), its applicability to vision-language models (VLMs) remains underexplored. In this paper, we identify a modality discrepancy (\emph{i.e.}, limited text tokens \emph{vs.} excessive and redundant vision tokens) of VLMs. However, existing Hessian-based LLM PTQ methods treat all tokens equally during quantization, resulting in severe performance drops when applied to VLMs. Motivated by this observation, we propose a novel importance-aware PTQ framework tailored for VLMs, dubbed VLMQ. Specifically, to address vision token redundancy, VLMQ 1) optimizes an importance-aware objective that yields an enhanced Hessian with token-level importance factors, while retaining compatibility with parallelized weight updates, and 2) ensures efficiency and effectiveness by computing these factors via a single lightweight block-wise backward pass, guided by a theoretical connection to token-level perturbations. Extensive evaluations on 8 benchmarks across 0.5B$\sim$32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial \textbf{16.45%} improvement on MME-RealWorld under 2-bit quantization.

[44] Efficient Multi-Slide Visual-Language Feature Fusion for Placental Disease Classification cs.CVPDF

Hang Guo, Qing Zhang, Zixuan Gao, Siyuan Yang, Shulin Peng

TL;DR: 论文提出了一种名为EmmPD的高效多模态框架，用于胎盘疾病分类，通过两阶段patch选择模块和混合多模态融合模块，实现了计算效率和特征保留的平衡，并在实验中表现优异。

Details

Motivation: 胎盘疾病的准确预测对母婴健康至关重要，但WSI分析面临数据量大、计算复杂度高的问题。现有方法在patch选择和全局上下文保留上存在不足。

Result: 在自建胎盘数据集和两个公共数据集上实现了最先进的诊断性能。

Insight: 结合patch级别的细粒度分析和全局多模态数据可以有效提升WSI分类性能，同时优化计算效率。

Abstract: Accurate prediction of placental diseases via whole slide images (WSIs) is critical for preventing severe maternal and fetal complications. However, WSI analysis presents significant computational challenges due to the massive data volume. Existing WSI classification methods encounter critical limitations: (1) inadequate patch selection strategies that either compromise performance or fail to sufficiently reduce computational demands, and (2) the loss of global histological context resulting from patch-level processing approaches. To address these challenges, we propose an Efficient multimodal framework for Patient-level placental disease Diagnosis, named EmmPD. Our approach introduces a two-stage patch selection module that combines parameter-free and learnable compression strategies, optimally balancing computational efficiency with critical feature preservation. Additionally, we develop a hybrid multimodal fusion module that leverages adaptive graph learning to enhance pathological feature representation and incorporates textual medical reports to enrich global contextual understanding. Extensive experiments conducted on both a self-constructed patient-level Placental dataset and two public datasets demonstrating that our method achieves state-of-the-art diagnostic performance. The code is available at https://github.com/ECNU-MultiDimLab/EmmPD.

[45] Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation cs.CVPDF

Jun Luo, Zijing Zhao, Yang Liu

TL;DR: 论文提出了SDGPA方法，通过合成数据生成和渐进式适应解决零样本域自适应语义分割问题，生成目标域风格的训练数据并优化布局精度，最终在实验中取得最优性能。

Details

Motivation: 现有深度学习语义分割模型在训练和测试数据分布偏移时表现不佳，且零样本域适应中缺乏目标域图像数据，仅提供目标域风格描述。希望通过合成数据生成和渐进适应策略解决这一问题。

Result: 实验表明，SDGPA在零样本语义分割任务中达到最优性能。

Insight: 合成数据生成结合渐进适应策略是解决零样本域分割问题的有效途径，分块处理能显著提升合成图像的布局精度。

Abstract: Deep learning-based semantic segmentation models achieve impressive results yet remain limited in handling distribution shifts between training and test data. In this paper, we present SDGPA (Synthetic Data Generation and Progressive Adaptation), a novel method that tackles zero-shot domain adaptive semantic segmentation, in which no target images are available, but only a text description of the target domain’s style is provided. To compensate for the lack of target domain training data, we utilize a pretrained off-the-shelf text-to-image diffusion model, which generates training images by transferring source domain images to target style. Directly editing source domain images introduces noise that harms segmentation because the layout of source images cannot be precisely maintained. To address inaccurate layouts in synthetic data, we propose a method that crops the source image, edits small patches individually, and then merges them back together, which helps improve spatial precision. Recognizing the large domain gap, SDGPA constructs an augmented intermediate domain, leveraging easier adaptation subtasks to enable more stable model adaptation to the target domain. Additionally, to mitigate the impact of noise in synthetic data, we design a progressive adaptation strategy, ensuring robust learning throughout the training process. Extensive experiments demonstrate that our method achieves state-of-the-art performance in zero-shot semantic segmentation. The code is available at https://github.com/ROUJINN/SDGPA

[46] Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation cs.CVPDF

Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie

TL;DR: Skywork UniPic是一个15亿参数的自回归模型，统一了图像理解、文本生成图像和图像编辑任务，无需任务特定适配器，在消费级硬件上达到SOTA性能。

Details

Motivation: 多模态系统通常需要复杂架构和大量资源，论文提出一个统一模型以降低部署成本并保持高性能。

Result: 在GenEval、DPG-Bench、GEditBench和ImgEdit-Bench上均取得领先成绩，且仅需15GB显存生成高清图像。

Insight: 展示了多模态任务的高保真集成可以低成本实现，为可部署的高性能多模态AI提供了范例。

Abstract: We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.

[47] Beyond Meme Templates: Limitations of Visual Similarity Measures in Meme Matching cs.CV | cs.CLPDF

Muzhaffar Hazman, Susan McKeever, Josephine Griffith

TL;DR: 该论文探讨了传统视觉相似度测量在非模板型网络表情包（Meme）匹配中的局限性，并提出了一种超越模板匹配的广泛定义方法。通过实验，论文展示了传统方法在非模板型表情包上的不足，并探索了基于预训练多模态大语言模型的提示方法。

Details

Motivation: 网络表情包在数字文化传播中扮演重要角色，但传统匹配方法仅适用于模板型表情包，忽视了非模板型表情包的存在。这种局限性影响了自动分析和当代网络表情包词典的构建。

Result: 分段相似度计算在非模板型表情包匹配中表现优于整体图像测量，但完全解决非模板型表情包匹配问题仍需要更复杂的技术。

Insight: 非模板型表情包匹配是一个开放性问题，传统视觉相似度测量不足以解决，需要更先进的匹配技术，如多模态模型的智能化应用。

Abstract: Internet memes, now a staple of digital communication, play a pivotal role in how users engage within online communities and allow researchers to gain insight into contemporary digital culture. These engaging user-generated content are characterised by their reuse of visual elements also found in other memes. Matching instances of memes via these shared visual elements, called Meme Matching, is the basis of a wealth of meme analysis approaches. However, most existing methods assume that every meme consists of a shared visual background, called a Template, with some overlaid text, thereby limiting meme matching to comparing the background image alone. Current approaches exclude the many memes that are not template-based and limit the effectiveness of automated meme analysis and would not be effective at linking memes to contemporary web-based meme dictionaries. In this work, we introduce a broader formulation of meme matching that extends beyond template matching. We show that conventional similarity measures, including a novel segment-wise computation of the similarity measures, excel at matching template-based memes but fall short when applied to non-template-based meme formats. However, the segment-wise approach was found to consistently outperform the whole-image measures on matching non-template-based memes. Finally, we explore a prompting-based approach using a pretrained Multimodal Large Language Model for meme matching. Our results highlight that accurately matching memes via shared visual elements, not just background templates, remains an open challenge that requires more sophisticated matching techniques.

[48] LRDDv2: Enhanced Long-Range Drone Detection Dataset with Range Information and Comprehensive Real-World Challenges cs.CV | cs.ROPDF

Amirreza Rouhi, Sneh Patel, Noah McCarthy, Siddiqa Khan, Hadi Khorsand

TL;DR: LRDDv2是一个增强版的无人机长距离检测数据集，包含39,516张标注图像，特别针对长距离和小尺寸目标优化，并首次引入了目标距离信息，为无人机检测和距离估计算法提供了更全面的数据支持。

Details

Motivation: 无人机（UAVs）的广泛应用带来了对长距离检测的需求，尤其是在密集区域。现有的无人机检测数据集在多样性和长距离条件下的小目标检测方面仍显不足。

Result: LRDDv2为无人机长距离检测和距离估计提供了更丰富的实验数据，填补了现有数据集的空白。

Insight: 长距离小目标检测仍然是计算机视觉中的挑战，距离信息的引入为算法开发提供了新的研究方向。

Abstract: The exponential growth in Unmanned Aerial Vehicles (UAVs) usage underscores the critical need of detecting them at extended distances to ensure safe operations, especially in densely populated areas. Despite the tremendous advances made in computer vision through deep learning, the detection of these small airborne objects remains a formidable challenge. While several datasets have been developed specifically for drone detection, the need for a more extensive and diverse collection of drone image data persists, particularly for long-range detection under varying environmental conditions. We introduce here the Long Range Drone Detection (LRDD) Version 2 dataset, comprising 39,516 meticulously annotated images, as a second release of the LRDD dataset released previously. The LRDDv2 dataset enhances the LRDDv1 by incorporating a greater variety of images, providing a more diverse and comprehensive resource for drone detection research. What sets LRDDv2 apart is its inclusion of target range information for over 8,000 images, making it possible to develop algorithms for drone range estimation. Tailored for long-range aerial object detection, the majority of LRDDv2’s dataset consists of images capturing drones with 50 or fewer pixels in 1080p resolution. For access to the complete Long-Range Drone Detection Dataset (LRDD)v2, please visit https://research.coe.drexel.edu/ece/imaple/lrddv2/ .

[49] Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation cs.CVPDF

Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao

TL;DR: 这篇论文提出了一种名为Macro-from-Micro Planning (MMPL)的框架，用于高质量且并行化的自回归长视频生成，通过分层规划解决传统方法中因错误累积导致的时间漂移问题。

Details

Motivation: 当前的自回归扩散模型在视频生成中表现优异，但通常仅限于短时视频。研究发现，自回归建模存在因错误累积导致的时间漂移问题，阻碍了长视频合成的并行化。

Result: 实验表明，该方法在质量和稳定性上优于现有长视频生成模型，并通过自适应工作量调度优化了并行化效率。

Insight: 分层规划结合并行化技术是解决长视频生成中时间漂移和效率问题的有效方法，同时自适应调度进一步提升了生成速度。

Abstract: Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.

[50] Beyond Illumination: Fine-Grained Detail Preservation in Extreme Dark Image Restoration cs.CVPDF

Tongshun Zhang, Pingping Liu, Zixuan Zhong, Zijian Zhang, Qiuzhan Zhou

TL;DR: 该论文提出了一种用于极端暗光图像恢复的双阶段方法，通过残差傅里叶引导模块（RFGM）和Mamba模块，实现了高保真度细节恢复和边缘增强。

Details

Motivation: 现有方法在极端暗光图像中难以恢复精细细节和锐利边缘，影响了文本和边缘检测等下游应用的效果。论文旨在解决这一问题。

Result: 在多个基准数据集和下游任务中展示了卓越的细节恢复性能，同时保持了高效性。

Insight: 频域处理和状态空间模型（如Mamba）的结合为极端暗光图像恢复提供了新的方向。

Abstract: Recovering fine-grained details in extremely dark images remains challenging due to severe structural information loss and noise corruption. Existing enhancement methods often fail to preserve intricate details and sharp edges, limiting their effectiveness in downstream applications like text and edge detection. To address these deficiencies, we propose an efficient dual-stage approach centered on detail recovery for dark images. In the first stage, we introduce a Residual Fourier-Guided Module (RFGM) that effectively restores global illumination in the frequency domain. RFGM captures inter-stage and inter-channel dependencies through residual connections, providing robust priors for high-fidelity frequency processing while mitigating error accumulation risks from unreliable priors. The second stage employs complementary Mamba modules specifically designed for textural structure refinement: (1) Patch Mamba operates on channel-concatenated non-downsampled patches, meticulously modeling pixel-level correlations to enhance fine-grained details without resolution loss. (2) Grad Mamba explicitly focuses on high-gradient regions, alleviating state decay in state space models and prioritizing reconstruction of sharp edges and boundaries. Extensive experiments on multiple benchmark datasets and downstream applications demonstrate that our method significantly improves detail recovery performance while maintaining efficiency. Crucially, the proposed modules are lightweight and can be seamlessly integrated into existing Fourier-based frameworks with minimal computational overhead. Code is available at https://github.com/bywlzts/RFGM.

[51] Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration cs.CVPDF

Shaoguang Wang, Jianxiang He, Yijie Xu, Ziyang Chen, Weiyu Guo

TL;DR: 论文提出了一种名为自适应帧剪枝（AFP）的新方法，通过智能剪枝视频关键帧并引入轻量级语义图，显著减少了处理视频问答任务时所需的帧数和输入令牌数，同时提升效率与准确性。

Details

Motivation: 现有的多模态大语言模型（MLLMs）在视频问答（Video-QA）中因处理大量视频帧的高令牌成本而受限，且帧数过多可能导致性能下降（上下文稀释）。同时，现有关键帧选择方法存在时间冗余（视觉回声）。

Result: 实验表明，AFP可减少多达86.9%的帧数和83.2%的输入令牌数，同时性能优于使用更多帧的基线方法。

Insight: ‘少即是多’：合理选择和处理少量高质量帧比简单增加帧数更有效，且能避免冗余信息带来的负面影响。引入轻量级语义图可为效率与性能提供平衡。

Abstract: The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a “less is more” phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redundancy, which we term ‘visual echoes’. To address these dual challenges, we propose Adaptive Frame-Pruning (AFP), a novel post-processing method that intelligently prunes the selected keyframes. AFP employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives. To compensate for information loss, we then introduce a lightweight, text-based semantic graph that provides critical context with minimal token overhead. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks across multiple leading MLLMs, our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. The code will be released upon publication.

[52] CIVQLLIE: Causal Intervention with Vector Quantization for Low-Light Image Enhancement cs.CVPDF

Tongshun Zhang, Pingping Liu, Zhe Zhang, Qiuzhan Zhou

TL;DR: CIVQLLIE提出了一种基于因果推理和向量量化的低光图像增强框架，通过多级因果干预解决现有方法在极端黑暗条件下的局限性。

Details

Motivation: 现有低光图像增强方法缺乏可解释性或依赖不可靠的先验，导致在极端黑暗条件下性能不佳。物理方法则因简化假设而不适用于复杂场景。

Result: CIVQLLIE在极端黑暗条件下表现出色，优于现有方法，同时保持了良好的可解释性和泛化性能。

Insight: 离散表示学习和因果干预的结合为解决低光图像增强中的分布偏移问题提供了新思路。

Abstract: Images captured in nighttime scenes suffer from severely reduced visibility, hindering effective content perception. Current low-light image enhancement (LLIE) methods face significant challenges: data-driven end-to-end mapping networks lack interpretability or rely on unreliable prior guidance, struggling under extremely dark conditions, while physics-based methods depend on simplified assumptions that often fail in complex real-world scenarios. To address these limitations, we propose CIVQLLIE, a novel framework that leverages the power of discrete representation learning through causal reasoning. We achieve this through Vector Quantization (VQ), which maps continuous image features to a discrete codebook of visual tokens learned from large-scale high-quality images. This codebook serves as a reliable prior, encoding standardized brightness and color patterns that are independent of degradation. However, direct application of VQ to low-light images fails due to distribution shifts between degraded inputs and the learned codebook. Therefore, we propose a multi-level causal intervention approach to systematically correct these shifts. First, during encoding, our Pixel-level Causal Intervention (PCI) module intervenes to align low-level features with the brightness and color distributions expected by the codebook. Second, a Feature-aware Causal Intervention (FCI) mechanism with Low-frequency Selective Attention Gating (LSAG) identifies and enhances channels most affected by illumination degradation, facilitating accurate codebook token matching while enhancing the encoder’s generalization performance through flexible feature-level intervention. Finally, during decoding, the High-frequency Detail Reconstruction Module (HDRM) leverages structural information preserved in the matched codebook representations to reconstruct fine details using deformable convolution techniques.

[53] FedPromo: Federated Lightweight Proxy Models at the Edge Bring New Domains to Foundation Models cs.CV | cs.LGPDF

Matteo Caligiuri, Francesco Barbato, Donald Shenaj, Umberto Michieli, Pietro Zanuttigh

TL;DR: FedPromo提出了一种高效的方法，通过在边缘设备上优化轻量级代理模型，实现大规模基础模型在新领域的适配，同时显著减少计算开销并保护隐私。

Details

Motivation: 传统联邦学习在大规模模型上需要大量计算资源，而边缘设备资源有限，难以直接训练大型模型。FedPromo旨在解决这一问题。

Result: 在五个图像分类基准测试中表现优于现有方法，尤其适用于资源有限的客户端。

Insight: 分离大型模型与轻量代理模型，结合联邦学习，可在保护隐私的前提下高效适配新领域。

Abstract: Federated Learning (FL) is an established paradigm for training deep learning models on decentralized data. However, as the size of the models grows, conventional FL approaches often require significant computational resources on client devices, which may not be feasible. We introduce FedPromo, a novel framework that enables efficient adaptation of large-scale foundation models stored on a central server to new domains encountered only by remote clients. Instead of directly training the large model on client devices, FedPromo optimizes lightweight proxy models via FL, significantly reducing computational overhead while maintaining privacy. Our method follows a two-stage process: first, server-side knowledge distillation aligns the representations of a large-scale foundation model (e.g., a transformer) with those of a compact counterpart (e.g., a CNN). Then, the compact model encoder is deployed to client devices, where trainable classifiers are learned locally. These classifiers are subsequently aggregated and seamlessly transferred back to the foundation model, facilitating personalized adaptation without requiring direct access to user data. Through novel regularization strategies, our framework enables decentralized multi-domain learning, balancing performance, privacy, and resource efficiency. Extensive experiments on five image classification benchmarks demonstrate that FedPromo outperforms existing methods while assuming limited-resource clients.

[54] GRASPing Anatomy to Improve Pathology Segmentation cs.CVPDF

Keyi Li, Alexander Jaus, Jens Kleesiek, Rainer Stiefelhagen

TL;DR: GRASP是一种即插即用的模块化框架，通过伪标签整合和特征对齐，利用现有的解剖分割模型提升病理分割表现，无需重新训练解剖组件，在PET/CT数据集上表现优异。

Details

Motivation: 当前深度学习方法主要依赖模式识别，忽略了病理发展的解剖学背景，而放射科医生依赖解剖知识准确划分病理区域，因此需要一种方法将解剖学知识整合到病理分割模型中。

Result: GRASP在多个评估指标和不同架构中表现优异，双通道解剖注入策略（伪标签输入和特征融合）有效整合解剖背景。

Insight: 解剖学背景对病理分割至关重要，GRASP提供了一种无需额外训练的高效整合方法，模块化设计使其易于扩展和应用。

Abstract: Radiologists rely on anatomical understanding to accurately delineate pathologies, yet most current deep learning approaches use pure pattern recognition and ignore the anatomical context in which pathologies develop. To narrow this gap, we introduce GRASP (Guided Representation Alignment for the Segmentation of Pathologies), a modular plug-and-play framework that enhances pathology segmentation models by leveraging existing anatomy segmentation models through pseudolabel integration and feature alignment. Unlike previous approaches that obtain anatomical knowledge via auxiliary training, GRASP integrates into standard pathology optimization regimes without retraining anatomical components. We evaluate GRASP on two PET/CT datasets, conduct systematic ablation studies, and investigate the framework’s inner workings. We find that GRASP consistently achieves top rankings across multiple evaluation metrics and diverse architectures. The framework’s dual anatomy injection strategy, combining anatomical pseudo-labels as input channels with transformer-guided anatomical feature fusion, effectively incorporates anatomical context.

[55] Neutralizing Token Aggregation via Information Augmentation for Efficient Test-Time Adaptation cs.CVPDF

Yizhe Xiong, Zihan Zhou, Yiwen Liang, Hui Chen, Zijia Lin

TL;DR: 本文提出了一种名为NAVIA的方法，通过信息增强中和令牌聚合，实现了高效测试时适应（ETTA），显著降低了计算开销并提升了性能。

Details

Motivation: 现有的测试时适应（TTA）方法在视觉变换器（ViT）中计算开销大，令牌聚合方法虽高效但性能下降严重，因此需要一种既能保持适应能力又能降低延迟的方法。

Result: 在多个分布偏移基准测试中，NAVIA性能优于现有方法2.5%，同时将推理延迟降低20%以上。

Insight: 令牌聚合虽能降低计算开销，但会引入信息损失；通过信息增强可有效恢复损失信息，实现高效的测试时适应。

Abstract: Test-Time Adaptation (TTA) has emerged as an effective solution for adapting Vision Transformers (ViT) to distribution shifts without additional training data. However, existing TTA methods often incur substantial computational overhead, limiting their applicability in resource-constrained real-world scenarios. To reduce inference cost, plug-and-play token aggregation methods merge redundant tokens in ViTs to reduce total processed tokens. Albeit efficient, it suffers from significant performance degradation when directly integrated with existing TTA methods. We formalize this problem as Efficient Test-Time Adaptation (ETTA), seeking to preserve the adaptation capability of TTA while reducing inference latency. In this paper, we first provide a theoretical analysis from a novel mutual information perspective, showing that token aggregation inherently leads to information loss, which cannot be fully mitigated by conventional norm-tuning-based TTA methods. Guided by this insight, we propose to \textbf{N}eutralize Token \textbf{A}ggregation \textbf{v}ia \textbf{I}nformation \textbf{A}ugmentation (\textbf{NAVIA}). Specifically, we directly augment the [CLS] token embedding and incorporate adaptive biases into the [CLS] token in shallow layers of ViTs. We theoretically demonstrate that these augmentations, when optimized via entropy minimization, recover the information lost due to token aggregation. Extensive experiments across various out-of-distribution benchmarks demonstrate that NAVIA significantly outperforms state-of-the-art methods by over 2.5%, while achieving an inference latency reduction of more than 20%, effectively addressing the ETTA challenge.

[56] SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models cs.CV | cs.AI | cs.LGPDF

Pingchuan Ma, Xiaopei Yang, Yusong Li, Ming Gui, Felix Krause

TL;DR: SCFlow提出了一种通过流模型隐式学习风格和内容解耦的方法，避免了传统方法中显式解耦的挑战，并通过双向映射实现自然解耦。

Details

Motivation: 传统方法在视觉模型中显式解耦风格和内容时面临语义重叠和人类感知主观性的挑战，SCFlow试图通过隐式学习绕过这一难题。

Result: 在可控生成任务中，SCFlow在ImageNet-1k和WikiArt的零样本设置中表现优异，解耦能力自然涌现。

Insight: SCFlow表明，通过隐式学习合并风格和内容可以自然实现解耦，且无需依赖显式监督或限制性先验分布。

Abstract: Explicitly disentangling style and content in vision models remains challenging due to their semantic overlap and the subjectivity of human perception. Existing methods propose separation through generative or discriminative objectives, but they still face the inherent ambiguity of disentangling intertwined concepts. Instead, we ask: Can we bypass explicit disentanglement by learning to merge style and content invertibly, allowing separation to emerge naturally? We propose SCFlow, a flow-matching framework that learns bidirectional mappings between entangled and disentangled representations. Our approach is built upon three key insights: 1) Training solely to merge style and content, a well-defined task, enables invertible disentanglement without explicit supervision; 2) flow matching bridges on arbitrary distributions, avoiding the restrictive Gaussian priors of diffusion models and normalizing flows; and 3) a synthetic dataset of 510,000 samples (51 styles $\times$ 10,000 content samples) was curated to simulate disentanglement through systematic style-content pairing. Beyond controllable generation tasks, we demonstrate that SCFlow generalizes to ImageNet-1k and WikiArt in zero-shot settings and achieves competitive performance, highlighting that disentanglement naturally emerges from the invertible merging process.

[57] Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling cs.CV | cs.AIPDF

Xinlei Yu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Ruolin Shen

TL;DR: 该论文提出了一个名为MACT的多智能体协作框架，专注于视觉文档理解和视觉问答任务，通过四个明确分工的智能体和测试时扩展策略，显著提升了性能。

Details

Motivation: 现有视觉语言模型（VLMs）受限于参数规模、缺乏自校正能力，且在长视觉上下文和复杂推理任务中表现不佳，因此需要一种更高效的框架。

Result: 在15个基准测试中，MACT的三种变体占据了前三位，并在13个基准中领先，同时保持较小的参数量。

Insight: 通过智能体分工协作和定制化扩展策略，可以在较小参数量下实现高性能，尤其在长视觉上下文和复杂推理任务中表现突出。

Abstract: Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.

[58] SlotMatch: Distilling Temporally Consistent Object-Centric Representations for Unsupervised Video Segmentation cs.CV | cs.AIPDF

Diana-Nicoleta Grigore, Neelu Madan, Andreas Mogelmose, Thomas B. Moeslund, Radu Tudor Ionescu

TL;DR: SlotMatch是一个简单的知识蒸馏框架，用于无监督视频分割任务，通过余弦相似度对齐教师和学生模型的slot表示，无需额外损失或监督。

Details

Motivation: 无监督视频分割任务由于缺乏监督信号和复杂场景的挑战，现有基于slot attention的模型通常需要大型计算密集型架构。

Result: 学生在参数量减少3.6倍、速度提升1.9倍的情况下，性能匹配甚至优于教师模型，且超过以往的无监督视频分割模型。

Insight: 知识蒸馏中简单的对齐方法可能比复杂的多任务学习更有效，证明了无监督表示学习的潜力。

Abstract: Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state-of-the-art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object-centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on two datasets to compare the state-of-the-art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running 1.9x faster. Moreover, our student surpasses previous unsupervised video segmentation models.

[59] Learning Latent Representations for Image Translation using Frequency Distributed CycleGAN cs.CV | cs.AI | cs.GRPDF

Shivangi Nigam, Adarsh Prasad Behera, Shekhar Verma, P. Nagabhushan

TL;DR: 本文提出了Fd-CycleGAN，一种基于CycleGAN改进的图像到图像翻译框架，通过结合局部邻域编码（LNE）和频率感知监督，增强潜在表示学习，以更接近真实数据分布。实验表明，该方法在感知质量、收敛速度和模式多样性上优于现有方法。

Details

Motivation: 传统的CycleGAN在捕捉局部像素语义和保留源域结构一致性方面存在不足。本文希望通过频率域监督和局部邻域编码改进潜在表示学习，从而提升图像翻译任务的效果。

Result: Fd-CycleGAN在感知质量、收敛速度和模式多样性上优于基线方法和现有技术。在低数据场景下表现尤为突出，并能生成更视觉连贯和语义一致的翻译结果。

Insight: 频率引导的潜在学习能显著提升图像翻译任务的泛化能力。相比于扩散模型，轻量级的对抗训练方法在训练效率和生成质量上更具优势。

Abstract: This paper presents Fd-CycleGAN, an image-to-image (I2I) translation framework that enhances latent representation learning to approximate real data distributions. Building upon the foundation of CycleGAN, our approach integrates Local Neighborhood Encoding (LNE) and frequency-aware supervision to capture fine-grained local pixel semantics while preserving structural coherence from the source domain. We employ distribution-based loss metrics, including KL/JS divergence and log-based similarity measures, to explicitly quantify the alignment between real and generated image distributions in both spatial and frequency domains. To validate the efficacy of Fd-CycleGAN, we conduct experiments on diverse datasets – Horse2Zebra, Monet2Photo, and a synthetically augmented Strike-off dataset. Compared to baseline CycleGAN and other state-of-the-art methods, our approach demonstrates superior perceptual quality, faster convergence, and improved mode diversity, particularly in low-data regimes. By effectively capturing local and global distribution characteristics, Fd-CycleGAN achieves more visually coherent and semantically consistent translations. Our results suggest that frequency-guided latent learning significantly improves generalization in image translation tasks, with promising applications in document restoration, artistic style transfer, and medical image synthesis. We also provide comparative insights with diffusion-based generative models, highlighting the advantages of our lightweight adversarial approach in terms of training efficiency and qualitative output.

Futian Wang, Yuhan Qiao, Xiao Wang, Fuling Wang, Yuxiang Zhang

TL;DR: 该论文提出了R2GenKG框架，通过构建多模态医学知识图谱（M3KG）并结合多粒度语义图与视觉特征，利用大型语言模型生成高质量的X光医学报告，解决了现有方法中的幻觉和疾病诊断能力弱的问题。

Details

Motivation: X光医学报告生成是医疗AI的重要应用，但现有方法存在幻觉和疾病诊断能力不足的问题。论文旨在通过多模态知识图谱和视觉特征的结合，提升报告的准确性和可靠性。

Result: 在多个数据集上的实验验证了知识图谱和报告生成框架的有效性，显著提升了报告的准确性和疾病诊断能力。

Insight: 通过结合多模态知识图谱和视觉特征，可以有效减少幻觉问题，并增强医疗报告的诊断能力，为医疗AI提供了一种新的解决方案。

Abstract: X-ray medical report generation is one of the important applications of artificial intelligence in healthcare. With the support of large foundation models, the quality of medical report generation has significantly improved. However, challenges such as hallucination and weak disease diagnostic capability still persist. In this paper, we first construct a large-scale multi-modal medical knowledge graph (termed M3KG) based on the ground truth medical report using the GPT-4o. It contains 2477 entities, 3 kinds of relations, 37424 triples, and 6943 disease-aware vision tokens for the CheXpert Plus dataset. Then, we sample it to obtain multi-granularity semantic graphs and use an R-GCN encoder for feature extraction. For the input X-ray image, we adopt the Swin-Transformer to extract the vision features and interact with the knowledge using cross-attention. The vision tokens are fed into a Q-former and retrieved the disease-aware vision tokens using another cross-attention. Finally, we adopt the large language model to map the semantic knowledge graph, input X-ray image, and disease-aware vision tokens into language descriptions. Extensive experiments on multiple datasets fully validated the effectiveness of our proposed knowledge graph and X-ray report generation framework. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.

[61] MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image Analysis cs.CVPDF

Ning Zhu, Xiaochuan Ma, Shaoting Zhang, Guotai Wang

TL;DR: 论文提出MedCAL-Bench，首个基于基础模型的冷启动主动学习（CSAL）基准测试，用于医学图像分析，评估了14种基础模型和7种CSAL策略，揭示了基础模型在特征提取和样本选择阶段的性能差异。

Details

Motivation: 冷启动主动学习（CSAL）在医学图像分析中缺乏标注预算的情况下效率低下，现有方法依赖自监督学习特征提取，效果有限。基础模型（FMs）展示了更强的特征提取潜力，但目前缺乏相关基准测试。

Result: 1. 多数FMs是有效的CSAL特征提取器，DINO家族在分割任务中表现最佳；2. FMs在分割任务中性能差异大，分类任务中差异小；3. 不同数据集需选择不同样本选择策略，ALPS在分割任务中表现最好，RepDiv在分类中领先。

Insight: 基础模型在医学图像CSAL中潜力巨大，但需根据任务类型选择不同模型和策略。DINO家族在分割任务中表现突出，而样本选择策略需针对数据集优化。

Abstract: Cold-Start Active Learning (CSAL) aims to select informative samples for annotation without prior knowledge, which is important for improving annotation efficiency and model performance under a limited annotation budget in medical image analysis. Most existing CSAL methods rely on Self-Supervised Learning (SSL) on the target dataset for feature extraction, which is inefficient and limited by insufficient feature representation. Recently, pre-trained Foundation Models (FMs) have shown powerful feature extraction ability with a potential for better CSAL. However, this paradigm has been rarely investigated, with a lack of benchmarks for comparison of FMs in CSAL tasks. To this end, we propose MedCAL-Bench, the first systematic FM-based CSAL benchmark for medical image analysis. We evaluate 14 FMs and 7 CSAL strategies across 7 datasets under different annotation budgets, covering classification and segmentation tasks from diverse medical modalities. It is also the first CSAL benchmark that evaluates both the feature extraction and sample selection stages. Our experimental results reveal that: 1) Most FMs are effective feature extractors for CSAL, with DINO family performing the best in segmentation; 2) The performance differences of these FMs are large in segmentation tasks, while small for classification; 3) Different sample selection strategies should be considered in CSAL on different datasets, with Active Learning by Processing Surprisal (ALPS) performing the best in segmentation while RepDiv leading for classification. The code is available at https://github.com/HiLab-git/MedCAL-Bench.

[62] RAAG: Ratio Aware Adaptive Guidance cs.CVPDF

Shangwen Zhu, Qianyu Peng, Yuting Hu, Zhantao Yang, Han Zhang

TL;DR: 论文提出了一种名为RAAG的比率感知自适应引导方法，解决了流模型在生成过程中早期步骤对引导尺度过度敏感的问题，显著提升了生成速度和质量。

Details

Motivation: 尽管流模型在图像和视频生成中取得了显著进展，但引导机制（如CFG）在不同采样阶段的交互机制尚不明确，尤其是在快速低步数采样中容易出现不稳定性。

Result: 实验表明，RAAG方法在图像（SD3.5, Lumina）和视频（WAN2.1）模型中实现了最高3倍的加速，同时保持了或提升了生成质量、鲁棒性和语义对齐性。

Insight: 研究强调了引导机制在流生成模型中的关键作用，尤其是对不同采样阶段的动态适应需求，为未来高效生成模型的优化提供了理论支持。

Abstract: Flow-based generative models have recently achieved remarkable progress in image and video synthesis, with classifier-free guidance (CFG) becoming the standard tool for high-fidelity, controllable generation. However, despite their practical success, little is known about how guidance interacts with different stages of the sampling process-especially in the fast, low-step regimes typical of modern flow-based pipelines. In this work, we uncover and analyze a fundamental instability: the earliest reverse steps are acutely sensitive to the guidance scale, owing to a pronounced spike in the relative strength (RATIO) of conditional to unconditional predictions. Through rigorous theoretical analysis and empirical validation, we show that this RATIO spike is intrinsic to the data distribution, independent of the model architecture, and causes exponential error amplification when paired with strong guidance. To address this, we propose a simple, theoretically grounded, RATIO-aware adaptive guidance schedule that automatically dampens the guidance scale at early steps based on the evolving RATIO, using a closed-form exponential decay. Our method is lightweight, requires no additional inference overhead, and is compatible with standard flow frameworks. Experiments across state-of-the-art image (SD3.5, Lumina) and video (WAN2.1) models demonstrate that our approach enables up to 3x faster sampling while maintaining or improving generation quality, robustness, and semantic alignment. Extensive ablation studies further confirm the generality and stability of our schedule across models, datasets, and hyperparameters. Our findings highlight the critical role of stepwise guidance adaptation in unlocking the full potential of fast flow-based generative models.

[63] CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection cs.CVPDF

Qiyu Chen, Zhen Qu, Wei Luo, Haiming Yao, Yunkang Cao

TL;DR: CoPS提出了一种基于视觉特征条件合成动态提示的方法，显著提升了零样本异常检测的性能，克服了静态提示和稀疏标签的限制。

Details

Motivation: 现有的零样本异常检测方法依赖静态提示和固定标签，难以捕捉连续且多样的异常模式，且易过拟合。

Result: 在13个工业和医学数据集上，CoPS的分类和分割AUROC提升2.5%。

Insight: 视觉特征的动态提示合成能有效解决零样本异常检测中的泛化和过拟合问题。

Abstract: Recently, large pre-trained vision-language models have shown remarkable performance in zero-shot anomaly detection (ZSAD). With fine-tuning on a single auxiliary dataset, the model enables cross-category anomaly detection on diverse datasets covering industrial defects and medical lesions. Compared to manually designed prompts, prompt learning eliminates the need for expert knowledge and trial-and-error. However, it still faces the following challenges: (i) static learnable tokens struggle to capture the continuous and diverse patterns of normal and anomalous states, limiting generalization to unseen categories; (ii) fixed textual labels provide overly sparse category information, making the model prone to overfitting to a specific semantic subspace. To address these issues, we propose Conditional Prompt Synthesis (CoPS), a novel framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance. Specifically, we extract representative normal and anomaly prototypes from fine-grained patch features and explicitly inject them into prompts, enabling adaptive state modeling. Given the sparsity of class labels, we leverage a variational autoencoder to model semantic image features and implicitly fuse varied class tokens into prompts. Additionally, integrated with our spatially-aware alignment mechanism, extensive experiments demonstrate that CoPS surpasses state-of-the-art methods by 2.5% AUROC in both classification and segmentation across 13 industrial and medical datasets. Code will be available at https://github.com/cqylunlun/CoPS.

[64] Video Demoireing using Focused-Defocused Dual-Camera System cs.CVPDF

Xuan Dong, Xiangyuan Sun, Xia Wang, Jian Song, Ya Li

TL;DR: 论文提出了一种基于双摄像头系统的视频去摩尔纹方法，通过同步捕捉聚焦和散焦视频帧，利用散焦帧指导聚焦帧的去摩尔纹处理，显著提升了效果。

Details

Motivation: 摩尔纹是数码相机采样高频率场景内容时产生的干扰，现有单摄像头处理方法难以区分摩尔纹和真实纹理，且在保持色调一致性和时间连续性方面存在挑战。

Result: 实验结果显示，该方法在视频和图像去摩尔纹任务中显著优于现有技术。

Insight: 双摄像头系统能够利用散焦帧有效区分摩尔纹和真实纹理，为去摩尔纹处理提供了新的思路。

Abstract: Moire patterns, unwanted color artifacts in images and videos, arise from the interference between spatially high-frequency scene contents and the spatial discrete sampling of digital cameras. Existing demoireing methods primarily rely on single-camera image/video processing, which faces two critical challenges: 1) distinguishing moire patterns from visually similar real textures, and 2) preserving tonal consistency and temporal coherence while removing moire artifacts. To address these issues, we propose a dual-camera framework that captures synchronized videos of the same scene: one in focus (retaining high-quality textures but may exhibit moire patterns) and one defocused (with significantly reduced moire patterns but blurred textures). We use the defocused video to help distinguish moire patterns from real texture, so as to guide the demoireing of the focused video. We propose a frame-wise demoireing pipeline, which begins with an optical flow based alignment step to address any discrepancies in displacement and occlusion between the focused and defocused frames. Then, we leverage the aligned defocused frame to guide the demoireing of the focused frame using a multi-scale CNN and a multi-dimensional training loss. To maintain tonal and temporal consistency, our final step involves a joint bilateral filter to leverage the demoireing result from the CNN as the guide to filter the input focused frame to obtain the final output. Experimental results demonstrate that our proposed framework largely outperforms state-of-the-art image and video demoireing methods.

[65] AVPDN: Learning Motion-Robust and Scale-Adaptive Representations for Video-Based Polyp Detection cs.CVPDF

Zilin Chen, Shengnan Lu

TL;DR: AVPDN提出了一种用于结肠镜视频中多尺度息肉检测的鲁棒框架，通过AFIA和SACI模块增强特征表示和多尺度上下文集成。

Details

Motivation: 结肠镜视频中的快速相机运动导致背景噪声大，易产生假阳性，因此需要一种鲁棒且能适应多尺度的检测方法。

Result: 在多个公开基准测试中取得竞争性表现，验证了方法的有效性和泛化能力。

Insight: 自注意力和多尺度上下文集成能显著提升视频息肉检测的鲁棒性和准确性。

Abstract: Accurate detection of polyps is of critical importance for the early and intermediate stages of colorectal cancer diagnosis. Compared to static images, dynamic colonoscopy videos provide more comprehensive visual information, which can facilitate the development of effective treatment plans. However, unlike fixed-camera recordings, colonoscopy videos often exhibit rapid camera movement, introducing substantial background noise that disrupts the structural integrity of the scene and increases the risk of false positives. To address these challenges, we propose the Adaptive Video Polyp Detection Network (AVPDN), a robust framework for multi-scale polyp detection in colonoscopy videos. AVPDN incorporates two key components: the Adaptive Feature Interaction and Augmentation (AFIA) module and the Scale-Aware Context Integration (SACI) module. The AFIA module adopts a triple-branch architecture to enhance feature representation. It employs dense self-attention for global context modeling, sparse self-attention to mitigate the influence of low query-key similarity in feature aggregation, and channel shuffle operations to facilitate inter-branch information exchange. In parallel, the SACI module is designed to strengthen multi-scale feature integration. It utilizes dilated convolutions with varying receptive fields to capture contextual information at multiple spatial scales, thereby improving the model’s denoising capability. Experiments conducted on several challenging public benchmarks demonstrate the effectiveness and generalization ability of the proposed method, achieving competitive performance in video-based polyp detection tasks.

[66] IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models cs.CVPDF

Jiabing Yang, Chenhang Cui, Yiyang Zhou, Yixiang Chen, Peng Xia

TL;DR: 该论文提出了一种名为IKOD的方法，通过协同解码策略减轻大型视觉语言模型（LVLM）中的注意力退化问题，从而减少幻觉现象。

Details

Motivation: 当前的LVLM在生成长序列时会出现视觉注意力逐渐减弱的现象，这可能是导致幻觉增加的主要原因。为了解决这一问题，本文试图通过改进解码策略来保持模型对视觉输入的关注。

Result: 实验表明，IKOD在减少幻觉和提升模型综合能力方面表现出色，且无需额外训练或外部工具。

Insight: 研究发现视觉注意力随着序列增长而减弱是导致幻觉的关键因素，通过简单的解码策略改进即可显著提升模型性能。

Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress across multiple domains. However, these models still face the inherent challenge of integrating vision and language for collaborative inference, which often leads to “hallucinations”, outputs that are not grounded in the corresponding images. Many efforts have been made to address these issues, but each comes with its own limitations, such as high computational cost or expensive dataset annotation. Recent research shows that LVLMs exhibit a long-term bias where hallucinations increase as the sequence length grows, yet the underlying cause remains poorly understood. Building on extensive research into attention mechanisms in LVLMs, we analyze the relationship between this long-term bias and visual attention. In our research, we identify a consistent phenomenon in current LVLMs: the model’s attention to visual input diminishes as the generated sequence grows, which we hypothesize to be a key factor contributing to observed increasing hallucinations. Based on these insights, we propose Image attention-guided Key-value merging cOllaborative Decoding (IKOD), a collaborative decoding strategy generating more image-focused sequences. This method derives logits from shorter sequences with higher image attention through key-value merging and combines them with those from the original decoding, effectively mitigating attention degradation and suppressing hallucinations while not incurring too much inference cost. Extensive experiments on both hallucination and comprehensive benchmarks demonstrate IKOD’s superior effectiveness in mitigating hallucinations and improving comprehensive capacities for LVLMs. Importantly, IKOD requires no additional training or external tools, making it a lightweight and efficient framework applicable to various models.

[67] VideoGuard: Protecting Video Content from Unauthorized Editing cs.CV | cs.AIPDF

Junjie Cao, Kaizhou Li, Xinchun Yu, Hongxiang Li, Xiaoping Zhang

TL;DR: 本文提出了一种名为VideoGuard的方法，用于保护视频内容免受未经授权的恶意编辑，通过联合帧优化和融合运动信息来干扰生成扩散模型的功能。

Details

Motivation: 随着生成技术的发展，恶意个人可能滥用这些技术进行误导性活动，目前对视频内容的保护研究存在明显不足。

Result: 实验表明，VideoGuard的保护效果优于所有基线方法，并通过主客观指标验证了其有效性。

Insight: 视频保护需要考虑帧间冗余和动态信息，简单的逐帧图像保护方法无法有效抵御恶意编辑。

Abstract: With the rapid development of generative technology, current generative models can generate high-fidelity digital content and edit it in a controlled manner. However, there is a risk that malicious individuals might misuse these capabilities for misleading activities. Although existing research has attempted to shield photographic images from being manipulated by generative models, there remains a significant disparity in the protection offered to video content editing. To bridge the gap, we propose a protection method named VideoGuard, which can effectively protect videos from unauthorized malicious editing. This protection is achieved through the subtle introduction of nearly unnoticeable perturbations that interfere with the functioning of the intended generative diffusion models. Due to the redundancy between video frames, and inter-frame attention mechanism in video diffusion models, simply applying image-based protection methods separately to every video frame can not shield video from unauthorized editing. To tackle the above challenge, we adopt joint frame optimization, treating all video frames as an optimization entity. Furthermore, we extract video motion information and fuse it into optimization objectives. Thus, these alterations can effectively force the models to produce outputs that are implausible and inconsistent. We provide a pipeline to optimize this perturbation. Finally, we use both objective metrics and subjective metrics to demonstrate the efficacy of our method, and the results show that the protection performance of VideoGuard is superior to all the baseline methods.

[68] When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models cs.CV | cs.AIPDF

Dasol Choi Jihwan Lee, Minjae Lee, Minsuk Kahng

TL;DR: 论文研究了文本到图像生成模型中的对象（如汽车）在人口统计学上的偏见，提出了SODA框架来系统测量这些偏见。通过对三个先进模型生成的2700张图像进行分析，揭示了特定人群与视觉属性之间的强关联。

Details

Motivation: 现有研究多关注人类描绘中的偏见，而忽略了对象生成中的人口统计学偏见。这种偏见虽然微妙但普遍存在，需要通过系统方法进行测量和分析。

Result: 发现了特定人群与视觉属性（如颜色模式）之间的强关联，既有明显的刻板印象，也有微妙且不直观的偏见。某些模型生成的多样性较低，视觉差异更显著。

Insight: 对象生成中的人口统计学偏见可能反映并强化了社会刻板印象，需要系统审计方法来推动更负责任的人工智能发展。

Abstract: While prior research on text-to-image generation has predominantly focused on biases in human depictions, we investigate a more subtle yet pervasive phenomenon: demographic bias in generated objects (e.g., cars). We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring such biases. Our approach compares visual attributes of objects generated with demographic cues (e.g., “for young people’’) to those from neutral prompts, across 2,700 images produced by three state-of-the-art models (GPT Image-1, Imagen 4, and Stable Diffusion) in five object categories. Through a comprehensive analysis, we uncover strong associations between specific demographic groups and visual attributes, such as recurring color patterns prompted by gender or ethnicity cues. These patterns reflect and reinforce not only well-known stereotypes but also more subtle and unintuitive biases. We also observe that some models generate less diverse outputs, which in turn amplifies the visual disparities compared to neutral prompts. Our proposed auditing framework offers a practical approach for testing, revealing how stereotypes still remain embedded in today’s generative models. We see this as an essential step toward more systematic and responsible AI development.

[69] ParticleSAM: Small Particle Segmentation for Material Quality Monitoring in Recycling Processes cs.CVPDF

Yu Zhou, Pelle Thielmann, Ayush Chamoli, Bruno Mirbach, Didier Stricker

TL;DR: 该论文提出了ParticleSAM，一种针对建筑回收材料中密集小颗粒分割的改进方法，并创建了一个模拟密集多颗粒数据集，验证了其方法的有效性。

Details

Motivation: 建筑行业是资源消耗的主要领域，而回收材料的质量控制仍依赖于人工方法。现有的视觉分割方法难以直接应用于密集小颗粒图像，亟需一种更高效的技术解决方案。

Result: 实验结果表明，ParticleSAM在定量和定性评估中均优于原始SAM方法，验证了其在密集小颗粒分割任务中的优势。

Insight: 该研究展示了基础模型的适应性改进在特定领域的潜力，同时为材料质量监控的自动化提供了新工具。其方法也可推广至其他需要小颗粒分割的应用场景。

Abstract: The construction industry represents a major sector in terms of resource consumption. Recycled construction material has high reuse potential, but quality monitoring of the aggregates is typically still performed with manual methods. Vision-based machine learning methods could offer a faster and more efficient solution to this problem, but existing segmentation methods are by design not directly applicable to images with hundreds of small particles. In this paper, we propose ParticleSAM, an adaptation of the segmentation foundation model to images with small and dense objects such as the ones often encountered in construction material particles. Moreover, we create a new dense multi-particle dataset simulated from isolated particle images with the assistance of an automated data generation and labeling pipeline. This dataset serves as a benchmark for visual material quality control automation while our segmentation approach has the potential to be valuable in application areas beyond construction where small-particle segmentation is needed. Our experimental results validate the advantages of our method by comparing to the original SAM method both in quantitative and qualitative experiments.

Shreyank N Gowda, Xiaobo Jin, Christian Wagner

TL;DR: 该论文提出了一个原型增强的置信度建模框架(PECM)，用于解决医学图像-报告跨模态检索任务中的语义模糊和变异性问题，通过多级原型和双流置信度估计提升了检索的鲁棒性和准确性。

Details

Motivation: 医学图像和文本报告之间存在复杂的语义关系，现有方法难以捕捉多层次的语义变异性，导致检索结果不可靠。

Result: 在多个数据集和任务上验证了方法的有效性，性能提升最高达10.17%，达到新SOTA。

Insight: 通过原型和置信度建模，可以更有效地处理医学数据中的模糊性，提升跨模态检索的可靠性。

Abstract: In cross-modal retrieval tasks, such as image-to-report and report-to-image retrieval, accurately aligning medical images with relevant text reports is essential but challenging due to the inherent ambiguity and variability in medical data. Existing models often struggle to capture the nuanced, multi-level semantic relationships in radiology data, leading to unreliable retrieval results. To address these issues, we propose the Prototype-Enhanced Confidence Modeling (PECM) framework, which introduces multi-level prototypes for each modality to better capture semantic variability and enhance retrieval robustness. PECM employs a dual-stream confidence estimation that leverages prototype similarity distributions and an adaptive weighting mechanism to control the impact of high-uncertainty data on retrieval rankings. Applied to radiology image-report datasets, our method achieves significant improvements in retrieval precision and consistency, effectively handling data ambiguity and advancing reliability in complex clinical scenarios. We report results on multiple different datasets and tasks including fully supervised and zero-shot retrieval obtaining performance gains of up to 10.17%, establishing in new state-of-the-art.

[71] EditGarment: An Instruction-Based Garment Editing Dataset Constructed with Automated MLLM Synthesis and Semantic-Aware Evaluation cs.CVPDF

Deqiang Yin, Junyi Guo, Huanda Lu, Fangyu Wu, Dongming Lu

TL;DR: 该论文提出了一种自动化构建服装编辑数据集的方法，解决了因高质量指令-图像对稀缺而限制进展的问题，并引入了一种语义感知的评估指标。

Details

Motivation: 服装编辑任务需要理解服装特定的语义和属性依赖关系，但由于高质量指令-图像对手工标注成本高且难以扩展，进展受限。

Result: 构建了首个针对独立服装编辑任务的指令型数据集EditGarment，包含52,257候选三元组并筛选出20,596个高质量数据。

Insight: 自动化数据合成和语义感知评估可以显著提升服装编辑数据集的多样性和质量，为相关任务奠定基础。

Abstract: Instruction-based garment editing enables precise image modifications via natural language, with broad applications in fashion design and customization. Unlike general editing tasks, it requires understanding garment-specific semantics and attribute dependencies. However, progress is limited by the scarcity of high-quality instruction-image pairs, as manual annotation is costly and hard to scale. While MLLMs have shown promise in automated data synthesis, their application to garment editing is constrained by imprecise instruction modeling and a lack of fashion-specific supervisory signals. To address these challenges, we present an automated pipeline for constructing a garment editing dataset. We first define six editing instruction categories aligned with real-world fashion workflows to guide the generation of balanced and diverse instruction-image triplets. Second, we introduce Fashion Edit Score, a semantic-aware evaluation metric that captures semantic dependencies between garment attributes and provides reliable supervision during construction. Using this pipeline, we construct a total of 52,257 candidate triplets and retain 20,596 high-quality triplets to build EditGarment, the first instruction-based dataset tailored to standalone garment editing. The project page is https://yindq99.github.io/EditGarment-project/.

[72] Distribution-aware Knowledge Unification and Association for Non-exemplar Lifelong Person Re-identification cs.CVPDF

Shiben Liu, Mingyue Xu, Huijie Fan, Qiang Wang, Yandong Tang

TL;DR: 论文提出了一种分布感知的知识统一与关联框架（DKUA），通过域风格建模和统一知识学习，解决了终身行人重识别（LReID）中旧知识保留和新信息适应的平衡问题。

Details

Motivation: 现有的终身行人重识别方法通常采用知识蒸馏进行表示对齐，但忽略了分布感知和跨域统一知识学习这两大关键因素。DKUA框架旨在弥补这些不足。

Result: 实验表明，DKUA在抗遗忘和泛化能力上平均提升7.6%/5.3%（mAP/R@1），显著优于现有方法。

Insight: 通过分布感知和域间关联建模，DKUA成功平衡了知识保留与新域适应，为终身学习任务提供了新思路。

Abstract: Lifelong person re-identification (LReID) encounters a key challenge: balancing the preservation of old knowledge with adaptation to new information. Existing LReID methods typically employ knowledge distillation to enforce representation alignment. However, these approaches ignore two crucial aspects: specific distribution awareness and cross-domain unified knowledge learning, both of which are essential for addressing this challenge. To overcome these limitations, we propose a novel distribution-aware knowledge unification and association (DKUA) framework where domain-style modeling is performed for each instance to propagate domain-specific representations, enhancing anti-forgetting and generalization capacity. Specifically, we design a distribution-aware model to transfer instance-level representations of the current domain into the domain-specific representations with the different domain styles, preserving learned knowledge without storing old samples. Next, we propose adaptive knowledge consolidation (AKC) to dynamically generate the unified representation as a cross-domain representation center. To further mitigate forgetting, we develop a unified knowledge association (UKA) mechanism, which explores the unified representation as a bridge to explicitly model inter-domain associations, reducing inter-domain gaps. Finally, distribution-based knowledge transfer (DKT) is proposed to prevent the current domain distribution from deviating from the cross-domain distribution center, improving adaptation capacity. Experimental results show our DKUA outperforms the existing methods by 7.6%/5.3% average mAP/R@1 improvement on anti-forgetting and generalization capacity, respectively. Our code will be publicly released.

[73] CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation cs.CVPDF

Kaishen Yuan, Yuting Zhang, Shang Gao, Yijie Zhu, Wenshuo Chen

TL;DR: CoEmoGen提出了一个语义一致且可扩展的情绪图像生成框架，通过多模态大语言模型生成高质量的情感描述，并引入层次化低秩适应模块建模情感特征。

Details

Motivation: 现有文本到图像扩散模型在生成抽象情感内容时表现不佳，且现有方法依赖词级属性标签，导致语义不连贯和可扩展性受限。

Result: 实验表明CoEmoGen在情感忠实性和语义一致性上优于现有方法，支持量化、定性和用户研究评估。

Insight: 心理学启发的层次化建模和高质量语义引导是提升情绪图像生成效果的关键；大规模情感数据集对可扩展性至关重要。

Abstract: Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories, with broad application prospects. While recent text-to-image diffusion models excel at generating concrete concepts, they struggle with the complexity of abstract emotions. There have also emerged methods specifically designed for EICG, but they excessively rely on word-level attribute labels for guidance, which suffer from semantic incoherence, ambiguity, and limited scalability. To address these challenges, we propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability. Specifically, leveraging multimodal large language models (MLLMs), we construct high-quality captions focused on emotion-triggering content for context-rich semantic guidance. Furthermore, inspired by psychological insights, we design a Hierarchical Low-Rank Adaptation (HiLoRA) module to cohesively model both polarity-shared low-level features and emotion-specific high-level semantics. Extensive experiments demonstrate CoEmoGen’s superiority in emotional faithfulness and semantic coherence from quantitative, qualitative, and user study perspectives. To intuitively showcase scalability, we curate EmoArt, a large-scale dataset of emotionally evocative artistic images, providing endless inspiration for emotion-driven artistic creation. The dataset and code are available at https://github.com/yuankaishen2001/CoEmoGen.

[74] Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and Detection cs.CVPDF

Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang

TL;DR: 提出了ARAS方法，用于语言驱动的局部自回归异常合成，结合QARAD框架提升异常检测性能，显著优于现有方法。

Details

Motivation: 现有异常合成方法存在微观结构不连续、语义可控性不足和生成效率低的问题。

Result: 在多个数据集上显著优于现有方法，合成速度提升5倍。

Insight: 语言条件控制和局部自回归合成是提升异常检测性能的关键。

Abstract: Despite substantial progress in anomaly synthesis methods, existing diffusion-based and coarse inpainting pipelines commonly suffer from structural deficiencies such as micro-structural discontinuities, limited semantic controllability, and inefficient generation. To overcome these limitations, we introduce ARAS, a language-conditioned, auto-regressive anomaly synthesis approach that precisely injects local, text-specified defects into normal images via token-anchored latent editing. Leveraging a hard-gated auto-regressive operator and a training-free, context-preserving masked sampling kernel, ARAS significantly enhances defect realism, preserves fine-grained material textures, and provides continuous semantic control over synthesized anomalies. Integrated within our Quality-Aware Re-weighted Anomaly Detection (QARAD) framework, we further propose a dynamic weighting strategy that emphasizes high-quality synthetic samples by computing an image-text similarity score with a dual-encoder model. Extensive experiments across three benchmark datasets-MVTec AD, VisA, and BTAD, demonstrate that our QARAD outperforms SOTA methods in both image- and pixel-level anomaly detection tasks, achieving improved accuracy, robustness, and a 5 times synthesis speedup compared to diffusion-based alternatives. Our complete code and synthesized dataset will be publicly available.

[75] Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences cs.CVPDF

Dmitrii Korzh, Dmitrii Tarasov, Artyom Iudin, Elvir Karimov, Matvey Skripkin

TL;DR: 提出首个开源大规模数据集与模型，用于将语音数学表达式转换为LaTeX，覆盖多语言（英、俄），并在多项指标上优于现有模型。

Details

Motivation: 语音数学表达式转换为LaTeX的需求在教育与研究领域（如讲座转录）未被充分探索，现有方法存在数据不足、覆盖有限等问题。

Result: 在MathSpeech基准上CER为28%（对比基准30%）；S2L-equations上CER为27%（对比基准64%）；数学句子识别CER为40%。

Insight: 多模态AI在数学内容识别上的潜力显著，未来可通过扩展数据集与优化模型进一步提升性能。

Abstract: Conversion of spoken mathematical expressions is a challenging task that involves transcribing speech into a strictly structured symbolic representation while addressing the ambiguity inherent in the pronunciation of equations. Although significant progress has been achieved in automatic speech recognition (ASR) and language models (LM), the problem of converting spoken mathematics into LaTeX remains underexplored. This task directly applies to educational and research domains, such as lecture transcription or note creation. Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences in both English and Russian, drawn from diverse scientific domains. In addition to the ASR post-correction models and few-shot prompting, we apply audio language models, demonstrating comparable character error rate (CER) results on the MathSpeech benchmark (28% vs. 30%) for the equations conversion. In contrast, on the proposed S2L-equations benchmark, our models outperform the MathSpeech model by a substantial margin of more than 40 percentage points, even after accounting for LaTeX formatting artifacts (27% vs. 64%). We establish the first benchmark for mathematical sentence recognition (S2L-sentences) and achieve an equation CER of 40%. This work lays the groundwork for future advances in multimodal AI, with a particular focus on mathematical content recognition.

[76] A Scalable Machine Learning Pipeline for Building Footprint Detection in Historical Maps cs.CV | I.4PDF

Annemarie McCarthy

TL;DR: 本文提出了一种可扩展且高效的机器学习流水线，用于从历史地图中检测建筑足迹，特别针对农村地区稀疏建筑分布的挑战。该方法采用分层CNN方法，先过滤掉不包含建筑的区域，再对剩余区域进行详细分割，显著提高了效率。

Details

Motivation: 历史地图是研究过去景观和定居模式的重要资源，但现有的机器学习方法主要集中在城市区域且计算成本高。农村地区的大规模分析需求未得到满足，限制了如验证历史人口普查数据或定位废弃定居点等研究问题的解决。

Result: 在爱尔兰历史地图系列上的测试表明，该方法相比传统的纯分割方法性能更高且效率提升。发现了一个可能在大饥荒时期被废弃的定居点，展示了其考古潜力。

Insight: 该方法不仅提高了建筑检测的效率，还为历史学和考古学提供了新工具，尤其适用于稀疏区域的长时间跨度分析。

Abstract: Historical maps offer a valuable lens through which to study past landscapes and settlement patterns. While prior research has leveraged machine learning based techniques to extract building footprints from historical maps, such approaches have largely focused on urban areas and tend to be computationally intensive. This presents a challenge for research questions requiring analysis across extensive rural regions, such as verifying historical census data or locating abandoned settlements. In this paper, this limitation is addressed by proposing a scalable and efficient pipeline tailored to rural maps with sparse building distributions. The method described employs a hierarchical machine learning based approach: convolutional neural network (CNN) classifiers are first used to progressively filter out map sections unlikely to contain buildings, significantly reducing the area requiring detailed analysis. The remaining high probability sections are then processed using CNN segmentation algorithms to extract building features. The pipeline is validated using test sections from the Ordnance Survey Ireland historical 25 inch map series and 6 inch map series, demonstrating both high performance and improved efficiency compared to conventional segmentation-only approaches. Application of the technique to both map series, covering the same geographic region, highlights its potential for historical and archaeological discovery. Notably, the pipeline identified a settlement of approximately 22 buildings in Tully, Co. Galway, present in the 6 inch map, produced in 1839, but absent from the 25 inch map, produced in 1899, suggesting it may have been abandoned during the Great Famine period.

[77] RadProPoser: A Framework for Human Pose Estimation with Uncertainty Quantification from Raw Radar Data cs.CVPDF

Jonas Leo Mueller, Lukas Engel, Eva Dorschky, Daniel Krauss, Ingrid Ullmann

TL;DR: RadProPoser提出了一种基于雷达数据的概率性人体姿态估计框架，首次从原始雷达张量中量化关节不确定性，实现了隐私保护且不受光照影响的姿态估计。

Details

Motivation: 雷达数据在隐私保护和光照不变性方面具有优势，但受噪声和多路径效应影响，现有方法难以准确估计人体姿态并量化不确定性。

Result: 在新发布的数据集上达到6.425 cm的平均关节位置误差，不确定性校准误差为0.021，并在数据增强中提升下游任务性能。

Insight: 学习到的不确定性与实际误差高度相关，为雷达应用中可靠的人类运动分析提供了可解释性基础。

Abstract: Radar-based human pose estimation (HPE) provides a privacy-preserving, illumination-invariant sensing modality but is challenged by noisy, multipath-affected measurements. We introduce RadProPoser, a probabilistic encoder-decoder architecture that processes complex-valued radar tensors from a compact 3-transmitter, 4-receiver MIMO radar. By incorporating variational inference into keypoint regression, RadProPoser jointly predicts 26 three-dimensional joint locations alongside heteroscedastic aleatoric uncertainties and can be recalibrated to predict total uncertainty. We explore different probabilistic formulations using both Gaussian and Laplace distributions for latent priors and likelihoods. On our newly released dataset with optical motion-capture ground truth, RadProPoser achieves an overall mean per-joint position error (MPJPE) of 6.425 cm, with 5.678 cm at the 45 degree aspect angle. The learned uncertainties exhibit strong alignment with actual pose errors and can be calibrated to produce reliable prediction intervals, with our best configuration achieving an expected calibration error of 0.021. As an additional demonstration, sampling from these latent distributions enables effective data augmentation for downstream activity classification, resulting in an F1 score of 0.870. To our knowledge, this is the first end-to-end radar tensor-based HPE system to explicitly model and quantify per-joint uncertainty from raw radar tensor data, establishing a foundation for explainable and reliable human motion analysis in radar applications.

[78] DyCAF-Net: Dynamic Class-Aware Fusion Network cs.CV | cs.LGPDF

Md Abrar Jahin, Shahriar Soudeep, M. F. Mridha, Nafiz Fahad, Md. Jakir Hossen

TL;DR: DyCAF-Net是一种动态类感知融合网络，通过输入条件和类感知机制改进多尺度特征融合，显著提升在遮挡和类别不平衡场景下的检测性能。

Details

Motivation: 当前目标检测方法在动态场景（如遮挡、类别不平衡）中表现受限，主要是由于静态融合策略和类无关的注意力机制。

Result: 在13个数据集上显著提升精度和mAP指标，同时保持计算效率和推理速度。

Insight: DyCAF-Net通过动态和类感知机制，有效解决了遮挡和类别不平衡问题，适用于真实世界的复杂检测任务。

Abstract: Recent advancements in object detection rely on modular architectures with multi-scale fusion and attention mechanisms. However, static fusion heuristics and class-agnostic attention limit performance in dynamic scenes with occlusions, clutter, and class imbalance. We introduce Dynamic Class-Aware Fusion Network (DyCAF-Net) that addresses these challenges through three innovations: (1) an input-conditioned equilibrium-based neck that iteratively refines multi-scale features via implicit fixed-point modeling, (2) a dual dynamic attention mechanism that adaptively recalibrates channel and spatial responses using input- and class-dependent cues, and (3) class-aware feature adaptation that modulates features to prioritize discriminative regions for rare classes. Through comprehensive ablation studies with YOLOv8 and related architectures, alongside benchmarking against nine state-of-the-art baselines, DyCAF-Net achieves significant improvements in precision, mAP@50, and mAP@50-95 across 13 diverse benchmarks, including occlusion-heavy and long-tailed datasets. The framework maintains computational efficiency ($\sim$11.1M parameters) and competitive inference speeds, while its adaptability to scale variance, semantic overlaps, and class imbalance positions it as a robust solution for real-world detection tasks in medical imaging, surveillance, and autonomous systems.

[79] evTransFER: A Transfer Learning Framework for Event-based Facial Expression Recognition cs.CVPDF

Rodrigo Verschae, Ignacio Bugueno-Cordova

TL;DR: 本文提出了evTransFER，一种基于迁移学习的框架，用于事件相机的人脸表情识别。通过从其他任务（如人脸重建）中迁移学习的特征提取器，并结合LSTM捕获长期动态，显著提升了识别性能。

Details

Motivation: 事件相机具有高动态范围和微秒级延迟的特性，适合捕捉人脸表情的时空动态。但现有方法直接从零训练效率低，因此提出迁移学习框架以提升性能。

Result: 在e-CK+数据库上达到93.6%的识别率，比现有方法提升25.9%以上。

Insight: 从其他任务迁移学习可以显著提升事件相机任务的性能；时空动态特征的编码是关键；结合LSTM和新型事件表示能进一步优化结果。

Abstract: Event-based cameras are bio-inspired vision sensors that asynchronously capture per-pixel intensity changes with microsecond latency, high temporal resolution, and high dynamic range, providing valuable information about the spatio-temporal dynamics of the scene. In the present work, we propose evTransFER, a transfer learning-based framework and architecture for face expression recognition using event-based cameras. The main contribution is a feature extractor designed to encode the spatio-temporal dynamics of faces, built by training an adversarial generative method on a different problem (facial reconstruction) and then transferring the trained encoder weights to the face expression recognition system. We show that this proposed transfer learning method greatly improves the ability to recognize facial expressions compared to training a network from scratch. In addition, we propose an architecture that incorporates an LSTM to capture longer-term facial expression dynamics, and we introduce a new event-based representation, referred to as TIE, both of which further improve the results. We evaluate the proposed framework on the event-based facial expression database e-CK+ and compare it to state-of-the-art methods. The results show that the proposed framework evTransFER achieves a 93.6% recognition rate on the e-CK+ database, significantly improving the accuracy (25.9% points or more) when compared to state-of-the-art performance for similar problems.

[80] AttZoom: Attention Zoom for Better Visual Features cs.CV | cs.AIPDF

Daniel DeAlcala, Aythami Morales, Julian Fierrez, Ruben Tolosana

TL;DR: 论文提出了一种名为Attention Zoom的模块化且模型无关的空间注意力机制，旨在改进卷积神经网络（CNN）中的特征提取。该方法通过在输入中空间性地强调高重要性区域，提升了分类任务中的Top-1和Top-5准确率。

Details

Motivation: 传统的注意力机制通常需要针对特定架构进行集成，限制了其通用性与灵活性。本文旨在设计一种独立的注意力层，无需修改主干网络结构即可提升CNN的性能。

Result: 在CIFAR-100和TinyImageNet数据集上的实验表明，Attention Zoom在多款CNN骨干网络中都显著提升了Top-1和Top-5分类准确率。

Insight: 独立且模块化的注意力层设计能够在不增加架构复杂性的情况下，有效提升CNN的特征提取能力，尤其适用于需要细粒度注意力的任务。

Abstract: We present Attention Zoom, a modular and model-agnostic spatial attention mechanism designed to improve feature extraction in convolutional neural networks (CNNs). Unlike traditional attention approaches that require architecture-specific integration, our method introduces a standalone layer that spatially emphasizes high-importance regions in the input. We evaluated Attention Zoom on multiple CNN backbones using CIFAR-100 and TinyImageNet, showing consistent improvements in Top-1 and Top-5 classification accuracy. Visual analyses using Grad-CAM and spatial warping reveal that our method encourages fine-grained and diverse attention patterns. Our results confirm the effectiveness and generality of the proposed layer for improving CCNs with minimal architectural overhead.

[81] Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images cs.CVPDF

Xiangyu Sun, Haoyi jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang

TL;DR: Uni3R 是一个统一的3D重建与语义理解框架，通过通用化的高斯溅射技术从非固定多视角图像中直接生成带开放词汇语义的3D场景表示。

Details

Motivation: 传统方法将语义理解与重建解耦或需昂贵的逐场景优化，限制了其可扩展性和通用性。Uni3R 旨在解决这些问题。

Result: 在多个基准测试中创下新纪录，如RE10K上的25.07 PSNR和ScanNet上的55.84 mIoU。

Insight: Uni3R 展示了统一表示在3D重建与语义理解中的潜力，为通用化3D场景理解提供了新范式。

Abstract: Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.

[82] OmniShape: Zero-Shot Multi-Hypothesis Shape and Pose Estimation in the Real World cs.CV | cs.ROPDF

Katherine Liu, Sergey Zakharov, Dian Chen, Takuya Ikeda, Greg Shakhnarovich

TL;DR: OmniShape通过解耦形状补全为多模态分布，实现了零样本多假设的形状与姿态估计。

Details

Motivation: 目标是从单次观测中估计物体的姿态和完整形状，且无需已知3D模型或类别。

Result: 在真实世界数据集上表现出色，支持多假设采样。

Insight: 解耦形状补全为多模态分布是实现零样本姿态与形状估计的有效方法。

Abstract: We would like to estimate the pose and full shape of an object from a single observation, without assuming known 3D model or category. In this work, we propose OmniShape, the first method of its kind to enable probabilistic pose and shape estimation. OmniShape is based on the key insight that shape completion can be decoupled into two multi-modal distributions: one capturing how measurements project into a normalized object reference frame defined by the dataset and the other modelling a prior over object geometries represented as triplanar neural fields. By training separate conditional diffusion models for these two distributions, we enable sampling multiple hypotheses from the joint pose and shape distribution. OmniShape demonstrates compelling performance on challenging real world datasets. Project website: https://tri-ml.github.io/omnishape

[83] La La LiDAR: Large-Scale Layout Generation from LiDAR Data cs.CV | cs.ROPDF

Youquan Liu, Lingdong Kong, Weidong Yang, Xin Li, Ao Liang

TL;DR: 论文提出了一种名为”La La LiDAR”的布局引导生成框架，通过语义增强的场景图扩散和关系感知的上下文条件，实现结构化LiDAR布局生成，支持对物体放置的自定义控制，并确保空间和语义的一致性。

Details

Motivation: 当前基于扩散模型的LiDAR生成方法虽然能生成高保真数据，但缺乏对前景物体和空间关系的显式控制，限制了其在场景仿真和安全验证中的应用。

Result: 实验表明，La La LiDAR在LiDAR生成和下游感知任务中均达到最先进性能，为可控3D场景生成设立了新基准。

Insight: 通过场景图扩散和关系感知模型，不仅可以提升生成质量，还能实现更灵活的场景定制，为自动驾驶和机器人应用提供了更可靠的仿真数据。

Abstract: Controllable generation of realistic LiDAR scenes is crucial for applications such as autonomous driving and robotics. While recent diffusion-based models achieve high-fidelity LiDAR generation, they lack explicit control over foreground objects and spatial relationships, limiting their usefulness for scenario simulation and safety validation. To address these limitations, we propose Large-scale Layout-guided LiDAR generation model (“La La LiDAR”), a novel layout-guided generative framework that introduces semantic-enhanced scene graph diffusion with relation-aware contextual conditioning for structured LiDAR layout generation, followed by foreground-aware control injection for complete scene generation. This enables customizable control over object placement while ensuring spatial and semantic consistency. To support our structured LiDAR generation, we introduce Waymo-SG and nuScenes-SG, two large-scale LiDAR scene graph datasets, along with new evaluation metrics for layout synthesis. Extensive experiments demonstrate that La La LiDAR achieves state-of-the-art performance in both LiDAR generation and downstream perception tasks, establishing a new benchmark for controllable 3D scene generation.

[84] LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences cs.CV | cs.ROPDF

Ao Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li

TL;DR: LiDARCrafter是一个用于动态4D LiDAR建模和编辑的统一框架，通过自然语言输入生成可控的4D LiDAR序列，并结合场景图和多分支扩散网络实现场景编辑和时间一致性。

Details

Motivation: 现有生成式世界模型多关注视频或占据网格，忽略LiDAR数据的独特属性。动态4D LiDAR建模面临可控性、时间一致性和评估标准化的挑战。

Result: 在nuScenes数据集上的实验表明，LiDARCrafter在保真度、可控性和时间一致性上均达到SOTA水平。

Insight: 通过结合自然语言和结构化条件，LiDARCrafter为LiDAR数据生成和编辑提供了新思路，适用于自动驾驶的数据增强和仿真。

Abstract: Generative world models have become essential data engines for autonomous driving, yet most existing efforts focus on videos or occupancy grids, overlooking the unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling presents challenges in controllability, temporal coherence, and evaluation standardization. To this end, we present LiDARCrafter, a unified framework for 4D LiDAR generation and editing. Given free-form natural language inputs, we parse instructions into ego-centric scene graphs, which condition a tri-branch diffusion network to generate object structures, motion trajectories, and geometry. These structured conditions enable diverse and fine-grained scene editing. Additionally, an autoregressive module generates temporally coherent 4D LiDAR sequences with smooth transitions. To support standardized evaluation, we establish a comprehensive benchmark with diverse metrics spanning scene-, object-, and sequence-level aspects. Experiments on the nuScenes dataset using this benchmark demonstrate that LiDARCrafter achieves state-of-the-art performance in fidelity, controllability, and temporal consistency across all levels, paving the way for data augmentation and simulation. The code and benchmark are released to the community.

[85] LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation cs.CVPDF

Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si

TL;DR: LongVie提出了一种可控超长视频生成的端到端自回归框架，解决了现有方法在时间一致性和视觉质量上的挑战，通过统一噪声初始化、全局控制信号归一化和多模态控制框架等设计，实现了高质量的长视频生成。

Details

Motivation: 现有的视频生成方法在短片段中表现良好，但难以扩展到超长视频，主要问题是时间不一致性和视觉质量退化。LongVie旨在解决这些问题，实现可控的长视频生成。

Result: 实验表明，LongVie在长范围可控性、一致性和质量上达到了最先进的性能。

Insight: 多模态控制和时间一致性优化是长视频生成的关键挑战，统一噪声和全局归一化策略能显著提升性能。

Abstract: Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency: 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench, a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.

[86] Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition cs.CVPDF

Pulkit Kumar, Shuaiyi Huang, Matthew Walmer, Sai Saketh Rambhatla, Abhinav Shrivastava

TL;DR: Trokens提出了一种将轨迹点转化为语义感知的关系型令牌的方法，通过语义感知采样和运动建模框架，在少量样本动作识别任务中实现了最先进的性能。

Details

Motivation: 视频理解需要有效建模运动与外观信息，但现有方法在少量样本动作识别中难以选择有信息量的跟踪点并建模其运动模式。

Result: 在六个少量样本动作识别基准数据集上达到最优性能，包括Something-Something-V2、Kinetics、UCF101等。

Insight: 语义感知的跟踪点选择和运动建模的结合是提升少量样本动作识别性能的有效途径。

Abstract: Video understanding requires effective modeling of both motion and appearance information, particularly for few-shot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and inter-trajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym. For project page see https://trokens-iccv25.github.io

cs.CL [Back]

[87] Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation cs.CL | cs.AI | cs.LGPDF

Radhika Dua, Young Joon, Kwon, Siddhant Dogra, Daniel Freedman

TL;DR: 该论文提出了一种基于代理的临床评估框架ICARE，用于放射学报告生成（RRG）的透明和可解释评估，通过动态多选题问答（MCQA）衡量报告的临床准确性和一致性。

Details

Motivation: 现有的放射学报告生成评估指标多依赖表面相似性或缺乏可解释性，无法满足临床安全部署的需求。

Result: 临床研究表明，ICARE比现有指标更符合专家判断，且对临床内容的敏感性和可复现性更高。

Insight: ICARE通过问题-答案对透明化评估流程，为临床部署提供了可解释的误差分析模式。

Abstract: Radiological imaging is central to diagnosis, treatment planning, and clinical decision-making. Vision-language foundation models have spurred interest in automated radiology report generation (RRG), but safe deployment requires reliable clinical evaluation of generated reports. Existing metrics often rely on surface-level similarity or behave as black boxes, lacking interpretability. We introduce ICARE (Interpretable and Clinically-grounded Agent-based Report Evaluation), an interpretable evaluation framework leveraging large language model agents and dynamic multiple-choice question answering (MCQA). Two agents, each with either the ground-truth or generated report, generate clinically meaningful questions and quiz each other. Agreement on answers captures preservation and consistency of findings, serving as interpretable proxies for clinical precision and recall. By linking scores to question-answer pairs, ICARE enables transparent, and interpretable assessment. Clinician studies show ICARE aligns significantly more with expert judgment than prior metrics. Perturbation analyses confirm sensitivity to clinical content and reproducibility, while model comparisons reveal interpretable error patterns.

[88] Highlight & Summarize: RAG without the jailbreaks cs.CL | cs.LGPDF

Giovanni Cherubin, Andrew Paverd

TL;DR: 本文提出了一种新的检索增强生成（RAG）设计模式Highlight & Summarize（H&S），通过将任务拆分为高亮提取和总结两部分，避免用户问题直接暴露给生成式LLM，从而防止模型被恶意越狱或劫持。

Details

Motivation: 防止大型语言模型（LLM）的越狱和劫持是一个重要但具有挑战性的任务。现有的防御方法（如强化系统提示或使用内容分类器）容易因输入和输出的多样性而被绕过。

Result: 使用基于LLM的高亮器时，H&S生成的多数响应在正确性、相关性和质量上优于标准RAG。

Insight: 通过任务拆分和隔离用户问题，可以在不依赖概率性防御的情况下，从设计上提升模型的安全性和响应质量。

Abstract: Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. For example, when interacting with a chatbot, malicious users can input specially crafted prompts to cause the LLM to generate undesirable content or perform a completely different task from its intended purpose. Existing mitigations for such attacks typically rely on hardening the LLM’s system prompt or using a content classifier trained to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. In this paper, we present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user’s question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user’s question and extracts relevant passages (“highlights”) from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe several possible instantiations of H&S and evaluate their generated responses in terms of correctness, relevance, and response quality. Surprisingly, when using an LLM-based highlighter, the majority of H&S responses are judged to be better than those of a standard RAG pipeline.

[89] Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models cs.CLPDF

Wenjie Luo, Ruocheng Li, Shanshan Zhu, Julian Perry

TL;DR: 该论文提出了一个名为CMRF（Coherent Multimodal Reasoning Framework）的新框架，通过迭代的自我评估机制增强视觉语言模型的常识推理能力，并在多个基准测试中达到了最优性能。

Details

Motivation: 现有的LLMs和LVLMs在复杂的跨模态多步常识推理任务中表现不佳，主要依赖表面关联而非深入推理。

Result: 在VCR、A-OKVQA和DailyLife-MRC等基准测试中，CMRF平均准确率69.4%，优于开源基准模型2.4个百分点。

Insight: 迭代自我评估和模块化设计显著提升了模型在复杂推理任务中的性能，体现了人类问题解决方式的模拟对于模型推理能力的重要性。

Abstract: Despite significant advancements, current large language models (LLMs) and vision-language models (LVLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of “deliberative thinking.” They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. To address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs’ common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Our framework integrates three key modules: a Reasoning Decomposition Unit (RDU) for breaking down problems into sub-questions, a Contextual Inference Engine (CIE) for contextual inference, and a Coherence Assessment Module (CAM) for evaluating logical consistency and confidence. Coupled with an Adaptive Iterative Refinement strategy, CMRF systematically refines its reasoning paths. Built upon LLaVA-1.6-34B and trained on a novel Multimodal Daily Activity Reasoning (MDAR) dataset, CMRF achieves state-of-the-art performance among open-source LVLMs on challenging benchmarks like VCR, A-OKVQA, and DailyLife-MRC. It attains an average accuracy of 69.4%, surpassing the best open-source baseline by +2.4 percentage points, with particular strength in complex reasoning scenarios. Extensive ablation studies and human evaluations confirm the critical contributions of each module and the effectiveness of iterative refinement in fostering more coherent and accurate reasoning.

[90] SLIM-LLMs: Modeling of Style-Sensory Language RelationshipsThrough Low-Dimensional Representations cs.CLPDF

Osama Khalid, Sanvesh Srivastava, Padmini Srinivasan

TL;DR: 论文研究了感官语言与传统风格特征的关系，提出了一种低维表示方法（R4）和SLIM-LLMs模型，在减少参数量的同时保持了性能。

Details

Motivation: 感官语言在传递体验和感知中至关重要，但其与传统风格特征的关系尚不明确，需要一种高效建模方法。

Result: 在五种文体中，SLIM-LLMs与全量模型性能相当，参数量减少高达80%。

Insight: 低维表示足以捕捉复杂的感官语言风格关系，为高效语言建模提供了新思路。

Abstract: Sensorial language – the language connected to our senses including vision, sound, touch, taste, smell, and interoception, plays a fundamental role in how we communicate experiences and perceptions. We explore the relationship between sensorial language and traditional stylistic features, like those measured by LIWC, using a novel Reduced-Rank Ridge Regression (R4) approach. We demonstrate that low-dimensional latent representations of LIWC features r = 24 effectively capture stylistic information for sensorial language prediction compared to the full feature set (r = 74). We introduce Stylometrically Lean Interpretable Models (SLIM-LLMs), which model non-linear relationships between these style dimensions. Evaluated across five genres, SLIM-LLMs with low-rank LIWC features match the performance of full-scale language models while reducing parameters by up to 80%.

[91] Can LLMs Generate High-Quality Task-Specific Conversations? cs.CL | cs.AIPDF

Shengqi Li, Amarnath Gupta

TL;DR: 本文提出了一个参数化框架，用于控制大语言模型生成对话的质量，通过实验验证了参数化控制的显著性差异，适用于教育、治疗等多个领域。

Details

Motivation: 探索如何通过参数化方法精确控制大语言模型生成的对话质量，解决对话生成中的主题连贯性、知识递进等问题。

Result: 实验结果表明，参数化框架能显著改善对话的连贯性和一致性，为对话质量提供了可控性。

Insight: 参数化控制为大语言模型的对话生成提供了可量化、标准化的方法，未来可通过架构改进和数据集完善进一步提升效果。

Abstract: This paper introduces a parameterization framework for controlling conversation quality in large language models. We explore nine key parameters across six dimensions that enable precise specification of dialogue properties. Through experiments with state-of-the-art LLMs, we demonstrate that parameter-based control produces statistically significant differences in generated conversation properties. Our approach addresses challenges in conversation generation, including topic coherence, knowledge progression, character consistency, and control granularity. The framework provides a standardized method for conversation quality control with applications in education, therapy, customer service, and entertainment. Future work will focus on implementing additional parameters through architectural modifications and developing benchmark datasets for evaluation.

[92] CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors cs.CLPDF

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis

TL;DR: 本文提出了一种基于上下文共现矩阵和张量潜在空间特征的新方法（CoCoTen），用于检测针对大型语言模型（LLMs）的对抗性输入和越狱提示。该方法在数据稀缺环境下表现优异，仅需0.5%的标记数据，F1分数达到0.83，比基线模型提升96.6%，且速度提升显著。

Details

Motivation: 随着大型语言模型（LLMs）的广泛应用，其复杂性和难以理解的特性使其容易受到攻击，尤其是旨在产生有害回应的越狱攻击。开发高效的检测方法对于LLMs的安全可靠使用至关重要。

Result: 在仅使用0.5%标记数据的情况下，F1分数达到0.83，比基线模型提升96.6%，且速度提升了2.3至128.4倍。

Insight: 上下文共现矩阵和张量的潜在空间特征在数据稀缺环境下仍能有效捕捉对抗性输入的模式，为LLMs的安全检测提供了新思路。

Abstract: The widespread use of Large Language Models (LLMs) in many applications marks a significant advance in research and practice. However, their complexity and hard-to-understand nature make them vulnerable to attacks, especially jailbreaks designed to produce harmful responses. To counter these threats, developing strong detection methods is essential for the safe and reliable use of LLMs. This paper studies this detection problem using the Contextual Co-occurrence Matrix, a structure recognized for its efficacy in data-scarce environments. We propose a novel method leveraging the latent space characteristics of Contextual Co-occurrence Matrices and Tensors for the effective identification of adversarial and jailbreak prompts. Our evaluations show that this approach achieves a notable F1 score of 0.83 using only 0.5% of labeled prompts, which is a 96.6% improvement over baselines. This result highlights the strength of our learned patterns, especially when labeled data is scarce. Our method is also significantly faster, speedup ranging from 2.3 to 128.4 times compared to the baseline models. To support future research and reproducibility, we have made our implementation publicly available.

[93] When Algorithms Meet Artists: Topic Modeling the AI-Art Debate, 2013-2025 cs.CL | cs.CY | cs.HCPDF

Ariya Mukherjee-Gandhi, Oliver Muellerklein

TL;DR: 本文通过对2013年至2025年AI生成艺术相关英文讨论的分析，揭示了艺术家与主流媒体观点之间的不匹配，并提出了一种基于BERTopic的可重复方法，呼吁更多关注艺术家的声音。

Details

Motivation: 随着生成式AI重塑艺术创作，艺术家的生计和表达方式受到直接影响，但他们在公共和学术讨论中往往被边缘化。本文旨在填补这一空白，分析艺术家与主流观点之间的差异。

Result: 发现艺术家关注的议题与主流叙述存在显著偏差，技术术语的泛滥可能导致艺术家声音被压制。

Insight: 透明化和多模态方法在AI与艺术结合的研究中至关重要，未来需更深入地纳入艺术家视角。

Abstract: As generative AI continues to reshape artistic production and alternate modes of human expression, artists whose livelihoods are most directly affected have raised urgent concerns about consent, transparency, and the future of creative labor. However, the voices of artists are often marginalized in dominant public and scholarly discourse. This study presents a twelve-year analysis, from 2013 to 2025, of English-language discourse surrounding AI-generated art. It draws from 439 curated 500-word excerpts sampled from opinion articles, news reports, blogs, legal filings, and spoken-word transcripts. Through a reproducible methodology, we identify five stable thematic clusters and uncover a misalignment between artists’ perceptions and prevailing media narratives. Our findings highlight how the use of technical jargon can function as a subtle form of gatekeeping, often sidelining the very issues artists deem most urgent. Our work provides a BERTopic-based methodology and a multimodal baseline for future research, alongside a clear call for deeper, transparency-driven engagement with artist perspectives in the evolving AI-creative landscape.

[94] Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation cs.CLPDF

Haoran Wang, Xiongxiao Xu, Baixiang Huang, Kai Shu

TL;DR: 论文提出了一种隐私感知解码（PAD）方法，通过在生成过程中向token logits注入校准的高斯噪声，减少检索增强生成（RAG）系统中的隐私泄露风险，同时保持生成质量。

Details

Motivation: 检索增强生成（RAG）虽然在提高语言模型的事实准确性方面表现优异，但在涉及敏感数据时容易遭受隐私泄露攻击。现有方法通常需要重新训练或语料库级别的过滤，效率较低。

Result: 在三个真实数据集上的实验表明，PAD显著减少了隐私信息泄露，同时保持了生成质量，优于现有的检索和后处理防御方法。

Insight: 通过在解码阶段引入轻量级隐私保护机制，PAD为敏感领域的通用隐私解决方案提供了新思路。

Abstract: Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response $(\varepsilon, \delta)$-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD.

[95] RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior cs.CL | cs.AIPDF

Junyao Yang, Jianwei Wang, Huiping Zhuang, Cen Chen, Ziqian Zeng

TL;DR: RCP-Merging是一种新颖的模型融合框架，旨在将领域专用LLM与长链思维（CoT）模型结合，同时保持原始领域的性能，通过将推理能力视为先验，有效避免推理能力退化。实验表明，该框架在多个领域任务中优于现有方法。

Details

Motivation: 现有模型融合方法在将领域专用LLM与长链思维模型结合时，常导致推理能力退化或输出崩溃，而从头训练成本高昂。为此，需一种资源高效的融合方法。

Result: 在Qwen2.5-7B、Llama3.1-8B等模型上验证，领域任务性能显著提升，推理能力未受显著影响。

Insight: 将推理能力视为先验并设计选择性融合策略，可高效结合领域知识与长链推理能力，避免模型退化。

Abstract: Large Language Models (LLMs) with long chain-of-thought (CoT) capability, termed Reasoning Models, demonstrate superior intricate problem-solving abilities through multi-step long CoT reasoning. To create a dual-capability model with long CoT capability and domain-specific knowledge without substantial computational and data costs, model merging emerges as a highly resource-efficient method. However, significant challenges lie in merging domain-specific LLMs with long CoT ones since nowadays merging methods suffer from reasoning capability degradation, even gibberish output and output collapse. To overcome this, we introduce RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior, a novel merging framework designed to integrate domain-specific LLMs with long CoT capability, meanwhile maintaining model performance in the original domain. Treating reasoning model weights as foundational prior, our method utilizes a reasoning capability indicator to preserve core long CoT capability model weights while selectively merging essential domain-specific weights. We conducted extensive experiments on Qwen2.5-7B, Llama3.1-8B, and Qwen2.5-1.5B models in BioMedicine and Finance domains. Our results show that RCP-Merging successfully merges a reasoning model with domain-specific ones, improving domain task performance by 9.5% and 9.2% over state-of-the-art methods, without significantly harming the original long CoT reasoning capability.

[96] Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following cs.CL | cs.AI | cs.LGPDF

Chenyang Wang, Liang Wen, Shousheng Jia, Xiangzheng Zhang, Liang Xu

TL;DR: 论文提出Light-IF框架，通过预览和自检机制提升LLM在复杂指令遵循中的泛化推理能力。通过筛选高质量数据集和熵保留微调，模型在多个基准测试中表现优异，甚至超越更大规模的模型。

Details

Motivation: LLM在复杂的指令遵循任务中表现不佳，主要因懒惰推理导致无法严格遵循指令约束。作者希望通过预览和自检机制提升模型在严格约束下的推理能力。

Result: Light-IF-32B在指令遵循基准测试中表现卓越，超越DeepSeek-R1和Doubao-1.6等模型。

Insight: 通过严格的预览和自检机制，结合高质量数据与熵保留优化，可以有效提升LLM在复杂指令任务中的泛化推理能力。

Abstract: While advancements in the reasoning abilities of LLMs have significantly enhanced their performance in solving mathematical problems, coding tasks, and general puzzles, their effectiveness in accurately adhering to instructions remains inconsistent, particularly with more complex directives. Our investigation identifies lazy reasoning during the thinking stage as the primary factor contributing to poor instruction adherence. To mitigate this issue, we propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking, essential for satisfying strict instruction constraints. Specifically, we first generate instructions with complex constraints and apply a filtering process to obtain valid prompts, resulting in three distinct prompt datasets categorized as hard, easy, and pass. Then, we employ rejection sampling on the pass prompts to curate a small yet high-quality dataset, enabling a cold-start initialization of the model and facilitating its adaptation to effective reasoning patterns. Subsequently, we employ an entropy-preserving supervised fine-tuning (Entropy-SFT) strategy coupled with token-wise entropy-adaptive (TEA-RL) reinforcement learning guided by rule-based dense rewards. This approach encourages the model to transform its reasoning mechanism, ultimately fostering generalizable reasoning abilities that encompass preview and self-checking. Extensive experiments conducted on instruction-following benchmarks demonstrate remarkable performance improvements across various model scales. Notably, our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.

[97] Analyzing German Parliamentary Speeches: A Machine Learning Approach for Topic and Sentiment Classification cs.CL | cs.LGPDF

Lukas Pätz, Moritz Beyer, Jannik Späth, Lasse Bohlen, Patrick Zschech

TL;DR: 该研究通过机器学习方法分析了德国议会近五年的2.8万篇演讲，开发了主题和情感分类模型，性能优异，并揭示了政党角色与话语风格的关系。

Details

Motivation: 研究旨在探讨德国议会中的政治话语动态，通过分析演讲的主题和情感变化，揭示政党角色转变对语言风格的影响。

Result: 主题分类的AUROC为0.94，情感分类为0.89，分析显示政党从执政转为在野时话语风格会发生变化。

Insight: 意识形态和执政责任共同塑造政治话语，角色转变显著影响政党的语言风格。

Abstract: This study investigates political discourse in the German parliament, the Bundestag, by analyzing approximately 28,000 parliamentary speeches from the last five years. Two machine learning models for topic and sentiment classification were developed and trained on a manually labeled dataset. The models showed strong classification performance, achieving an area under the receiver operating characteristic curve (AUROC) of 0.94 for topic classification (average across topics) and 0.89 for sentiment classification. Both models were applied to assess topic trends and sentiment distributions across political parties and over time. The analysis reveals remarkable relationships between parties and their role in parliament. In particular, a change in style can be observed for parties moving from government to opposition. While ideological positions matter, governing responsibilities also shape discourse. The analysis directly addresses key questions about the evolution of topics, sentiment dynamics, and party-specific discourse strategies in the Bundestag.

[98] Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models cs.CLPDF

Muhammed Saeed, Shaina Raza, Ashmal Vayani, Muhammad Abdul-Mageed, Ali Emami

TL;DR: 研究发现语法性别显著影响多语言文本到图像模型的视觉输出，男性和女性语法标记分别增加了男性和女性形象的生成比例，揭示了语言结构对AI生成内容的偏见的深层次影响。

Details

Motivation: 探讨语法性别在多语言文本到图像模型中如何影响视觉表征，填补了现有研究主要关注人口统计和刻板印象而忽视语言结构影响的空白。

Result: 男性和女性语法标记分别使男性和女性形象生成比例显著提升（男性标记下73%，女性标记下38%），且高资源语言的影响更强。

Insight: 语言结构（如语法性别）是影响AI生成内容公平性的重要因素，需在多语言和多模态系统中予以重视。

Abstract: Research on bias in Text-to-Image (T2I) models has primarily focused on demographic representation and stereotypical attributes, overlooking a fundamental question: how does grammatical gender influence visual representation across languages? We introduce a cross-linguistic benchmark examining words where grammatical gender contradicts stereotypical gender associations (e.g., une sentinelle'' - grammatically feminine in French but referring to the stereotypically masculine concept guard’’). Our dataset spans five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese), comprising 800 unique prompts that generated 28,800 images across three state-of-the-art T2I models. Our analysis reveals that grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (compared to 22% with gender-neutral English), while feminine grammatical markers increase female representation to 38% (compared to 28% in English). These effects vary systematically by language resource availability and model architecture, with high-resource languages showing stronger effects. Our findings establish that language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems.

[99] Pay What LLM Wants: Can LLM Simulate Economics Experiment with 522 Real-human Persona? cs.CL | cs.AIPDF

Junhyuk Choi, Hyeonchu Park, Haemin Lee, Hyebeen Shin, Hyun Joung Jin

TL;DR: 大型语言模型（LLMs）在模拟人类经济行为方面的能力研究，使用真实522名韩国参与者的数据进行评估，结果显示LLMs在个体层面预测表现不佳，但在群体层面表现合理。

Details

Motivation: 现有研究主要依赖虚构角色评估LLMs的行为模拟能力，但缺乏基于真实人类数据的验证。本文旨在填补这一空白，探索LLMs在经济学实验中的实际表现。

Result: LLMs在个体层面的预测准确性较低，但在群体层面表现出合理的行为趋势。常见提示技术（如角色重构）并未显著优于简单提示方法。

Insight: 研究结果表明，当前LLMs在模拟复杂人类经济行为时仍存在局限，尤其是在个体层面。未来研究需进一步优化模型和方法以提升预测能力。

Abstract: Recent advances in Large Language Models (LLMs) have generated significant interest in their capacity to simulate human-like behaviors, yet most studies rely on fictional personas rather than actual human data. We address this limitation by evaluating LLMs’ ability to predict individual economic decision-making using Pay-What-You-Want (PWYW) pricing experiments with real 522 human personas. Our study systematically compares three state-of-the-art multimodal LLMs using detailed persona information from 522 Korean participants in cultural consumption scenarios. We investigate whether LLMs can accurately replicate individual human choices and how persona injection methods affect prediction performance. Results reveal that while LLMs struggle with precise individual-level predictions, they demonstrate reasonable group-level behavioral tendencies. Also, we found that commonly adopted prompting techniques are not much better than naive prompting methods; reconstruction of personal narrative nor retrieval augmented generation have no significant gain against simple prompting method. We believe that these findings can provide the first comprehensive evaluation of LLMs’ capabilities on simulating economic behavior using real human data, offering empirical guidance for persona-based simulation in computational social science.

[100] Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling cs.CL | cs.LGPDF

Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li

TL;DR: Hi-Guard 是一个多模态内容审核框架，通过分层管道和政策对齐推理，提升了审核的准确性、泛化性和可解释性，适用于大规模社交平台。

Details

Motivation: 当前内容审核系统依赖噪声标签学习，与审核规则不一致，且决策不透明，难以进行人工复审。因此，需要一种更准确、可解释的审核方法。

Result: 实验和实际部署表明，Hi-Guard 在分类准确性、泛化性和可解释性上表现优异，适用于可扩展、透明的安全系统。

Insight: 政策对齐和分层设计是提升内容审核系统效率和透明度的关键，同时多级奖励优化有助于模型更贴近实际审核需求。

Abstract: Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term “Hierarchical” reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: https://github.com/lianqi1008/Hi-Guard.

[101] Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models cs.CLPDF

Haotian Wu, Bo Xu, Yao Shu, Menglin Yang, Chengwei Qin

TL;DR: 论文提出了一种新的上下文学习范式（JointThinking），通过结合思考与非思考模式的差异来提升大语言模型的推理准确率，显著优于现有少样本思维链（CoT）方法，并在分布式任务中表现优异。

Details

Motivation: 现有研究主要关注大语言模型的训练和推理策略，但其在上下文学习（ICL）中的潜力尚未充分挖掘。作者希望通过结合不同推理模式的差异，提升推理准确性和鲁棒性。

Result: JointThinking在多个推理基准测试中显著优于少样本思维链（CoT）和多数投票方法，并在分布式任务中超越基于训练的SOTA方法。

Insight: 利用不同推理模式的结构性差异能降低错误率；模型规模增大时性能差距缩小，表明方法具有强扩展性。

Abstract: Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that leverages the structured difference between two reasoning modes, i.e., Thinking and Nothinking, to improve reasoning accuracy. Specifically, our method prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt that incorporates the original question and both candidate answers. Since such disagreement occurs infrequently (e.g., only 6% in GSM8K), our method performs just one round of reasoning in most cases, resulting in minimal latency overhead. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT) and majority voting with improved answer robustness. Moreover, It achieves comparable in-distribution performance to training-based SOTA method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing that leveraging different reasoning modes consistently lowers the error rate and highlights the value of structural thinking diversity. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second round of thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.

[102] ReDSM5: A Reddit Dataset for DSM-5 Depression Detection cs.CLPDF

Eliseo Bao, Anxo Pérez, Javier Parapar

TL;DR: 本文提出了ReDSM5，一个专门用于检测DSM-5抑郁症症状的Reddit数据集，包含1484篇长帖，每篇帖子的句子级别标注了九种DSM-5抑郁症状，并由心理学家提供临床解释。

Details

Motivation: 传统抑郁症检测方法通常仅标注帖子是否涉及抑郁，缺乏与DSM-5临床标准的细粒度关联，限制其临床相关性和可解释性。

Result: 建立了多标签症状分类和解释生成的基线结果，为未来研究提供参考。

Insight: 社交媒体数据可用于细粒度的抑郁症状检测，结合专家解释可提升模型的可解释性和临床价值。

Abstract: Depression is a pervasive mental health condition that affects hundreds of millions of individuals worldwide, yet many cases remain undiagnosed due to barriers in traditional clinical access and pervasive stigma. Social media platforms, and Reddit in particular, offer rich, user-generated narratives that can reveal early signs of depressive symptomatology. However, existing computational approaches often label entire posts simply as depressed or not depressed, without linking language to specific criteria from the DSM-5, the standard clinical framework for diagnosing depression. This limits both clinical relevance and interpretability. To address this gap, we introduce ReDSM5, a novel Reddit corpus comprising 1484 long-form posts, each exhaustively annotated at the sentence level by a licensed psychologist for the nine DSM-5 depression symptoms. For each label, the annotator also provides a concise clinical rationale grounded in DSM-5 methodology. We conduct an exploratory analysis of the collection, examining lexical, syntactic, and emotional patterns that characterize symptom expression in social media narratives. Compared to prior resources, ReDSM5 uniquely combines symptom-specific supervision with expert explanations, facilitating the development of models that not only detect depression but also generate human-interpretable reasoning. We establish baseline benchmarks for both multi-label symptom classification and explanation generation, providing reference results for future research on detection and interpretability.

[103] Variety Is the Spice of Life: Detecting Misinformation with Dynamic Environmental Representations cs.CL | cs.SIPDF

Bing Wang, Ximing Li, Yiming Wang, Changchun Li, Jiaxu Cui

TL;DR: 本文提出了一种动态环境表示框架MISDER，用于检测社交媒体上的虚假信息，考虑了社会环境随时间的变化，优于传统的静态学习方法。

Details

Motivation: 社交媒体上虚假信息的传播是一个动态变化的过程，而现有的虚假信息检测方法通常基于静态假设，无法适应环境的变化。

Result: 在两个流行数据集上的实验表明，MISDER优于多种基线方法，验证了其有效性。

Insight: 虚假信息检测需要考虑社会环境的动态变化，静态学习方法可能不足以捕捉真实场景中的复杂性。

Abstract: The proliferation of misinformation across diverse social media platforms has drawn significant attention from both academic and industrial communities due to its detrimental effects. Accordingly, automatically distinguishing misinformation, dubbed as Misinformation Detection (MD), has become an increasingly active research topic. The mainstream methods formulate MD as a static learning paradigm, which learns the mapping between the content, links, and propagation of news articles and the corresponding manual veracity labels. However, the static assumption is often violated, since in real-world scenarios, the veracity of news articles may vacillate within the dynamically evolving social environment. To tackle this problem, we propose a novel framework, namely Misinformation detection with Dynamic Environmental Representations (MISDER). The basic idea of MISDER lies in learning a social environmental representation for each period and employing a temporal model to predict the representation for future periods. In this work, we specify the temporal model as the LSTM model, continuous dynamics equation, and pre-trained dynamics system, suggesting three variants of MISDER, namely MISDER-LSTM, MISDER-ODE, and MISDER-PT, respectively. To evaluate the performance of MISDER, we compare it to various MD baselines across 2 prevalent datasets, and the experimental results can indicate the effectiveness of our proposed model.

[104] LLMs Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models cs.CL | cs.AIPDF

Junhong Wu, Jinliang Lu, Zixuan Ren, Ganqiang Hu, Zhi Wu

TL;DR: 论文探讨了大语言模型（LLMs）的软思考能力，发现其依赖最显著的软输入成分，限制了多样性推理。通过引入随机性策略（如Dirichlet重采样和Gumbel-Softmax技巧），提升了软思考的潜力，后者在多个推理基准中表现优异。

Details

Motivation: 人类认知自然处理抽象和流体概念，而现有推理模型依赖离散标记，限制了表达力。研究旨在探索LLMs通过软标记实现抽象推理的能力。

Result: 实验表明，Gumbel-Softmax技巧能有效引入可控随机性，在八个推理基准中表现最优。

Insight: 软思考的潜力受限于模型的贪婪解码行为，需通过随机性策略突破，Gumbel-Softmax技巧在平衡随机性与平滑性方面表现突出。

Abstract: Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. This paper explores the `Soft Thinking’ capabilities of various LLMs by examining the models’ internal behavior using a suite of probing techniques. Contrary to the common belief that Soft Thinking enables the simultaneous exploration of diverse reasoning paths, our findings reveal that LLMs predominantly rely on the most influential component of the soft inputs during subsequent decoding steps. This reliance hinders the exploration of different reasoning paths and reduces vanilla Soft Thinking to a form of greedy decoding, obscuring the advantage of transmitting more information through Soft Tokens. To tackle this issue, we explore sampling strategies to introduce \emph{randomness}, employing methods such as Dirichlet resampling and the Gumbel-Softmax trick. Our experiments demonstrate that incorporating randomness can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking. Notably, the Gumbel-Softmax trick provides adequate randomness with controlled smoothness, resulting in superior performance across eight reasoning benchmarks.

[105] Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings cs.CL | cs.LGPDF

Rita González-Márquez, Philipp Berens, Dmitry Kobak

TL;DR: 本文比较了对比学习中两种主流的数据增强策略（裁剪和dropout）在自监督文本嵌入训练中的表现，发现裁剪显著优于dropout，且在领域内数据上接近监督学习的SOTA。

Details

Motivation: 当前文本嵌入模型主要依赖监督微调，而计算机视觉中自监督学习已取得显著成功。本文试图探索自监督学习在文本嵌入中的应用，并比较不同增强策略的效果。

Result: 裁剪优于dropout，领域内数据表现接近SOTA；最后几层Transformer变化最大，仅微调这些层即可达到类似质量。

Insight: 自监督学习在文本嵌入中潜力巨大，裁剪是更有效的增强策略；微调主要集中在模型高层，为高效训练提供思路。

Abstract: Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, sentiment analysis, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via extensive supervised fine-tuning using curated text pairs. This contrasts with computer vision, where self-supervised training based on data augmentations has demonstrated remarkable success. Here we systematically compare the two most well-known augmentation strategies for positive pair generation in contrastive learning of text embeddings. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is below the supervised SOTA models, but for in-domain data, self-supervised fine-tuning produces high-quality text embeddings after very short fine-tuning, sometimes only marginally below the supervised SOTA. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

[106] fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval cs.CL | cs.AI | cs.IRPDF

Pranshu Rastogi

TL;DR: 本文介绍了SemEval-2025 Task 7的多语言和跨语言事实核查任务，使用基于双编码器和预训练transformer的学习排序方法，在轻量级模型下取得了显著成效。

Details

Motivation: 解决多语言和跨语言环境下的事实核查问题，提升检索效率和准确性。

Result: 在Kaggle T4 GPU上训练，多语言任务中达到92% Success@10，跨语言任务中排名第五。

Insight: 轻量级模型在多语言和跨语言任务中表现优异，显示预训练transformer和小规模参数模型的潜力。

Abstract: SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval is approached as a Learning-to-Rank task using a bi-encoder model fine-tuned from a pre-trained transformer optimized for sentence similarity. Training used both the source languages and their English translations for multilingual retrieval and only English translations for cross-lingual retrieval. Using lightweight models with fewer than 500M parameters and training on Kaggle T4 GPUs, the method achieved 92% Success@10 in multilingual and 80% Success@10 in 5th in crosslingual and 10th in multilingual tracks.

[107] EmbedGrad: Gradient-Based Prompt Optimization in Embedding Space for Large Language Models cs.CLPDF

Xiaoming Hou, Jiquan Zhang, Zibin Lin, DaCheng Tao, Shengli Zhang

TL;DR: EmbedGrad提出了一种基于梯度的嵌入空间提示优化框架，克服了离散文本优化和参数调整方法的局限性，显著提升了大型语言模型在多种任务上的表现。

Details

Motivation: 现有的离散提示优化方法缺乏精确性，而基于参数调整的方法则增加了复杂性和不可解释性，这限制了大型语言模型的任务适应能力。

Result: 在数学推理等任务中，EmbedGrad显著提升了模型表现，例如Qwen2.5-Math-1.5B的准确率从14.74%提高至58.96%。

Insight: 嵌入空间的梯度优化是一种高效且解释性强的任务适应新范式，尤其对小模型在复杂任务上的表现提升显著。

Abstract: Effectively adapting powerful pretrained foundation models to diverse tasks remains a key challenge in AI deployment. Current approaches primarily follow two paradigms:discrete optimization of text prompts through prompt engineering, or continuous adaptation via additional trainable parameters. Both exhibit limitations-discrete methods lack refinement precision while parameter-based techniques increase complexity and reduce interpretability. To address these constraints, we propose EmbedGrad, a novel framework that optimizes text prompt embeddings through gradient-based refinement. Our approach uniquely decouples training from deployment:during optimization,labeled examples guide precise embedding adjustments while preserving semantic meaning; during inference, only optimized embeddings integrate with user queries. This enables fine-grained calibration impossible in text space, such as enhancing the reasoning capability of prompts like please reason step by step. Comprehensive evaluations across mathematical reasoning, sentiment analysis, and causal judgment tasks demonstrate EmbedGrad’s effectiveness:optimizing this reasoning prompt for Qwen2.5-Math-1.5B increased accuracy from 14.74% to 58.96% on mathematical problems. Consistent improvements were observed across model scales (0.5B-14B) and all tasks, with particularly significant gains for smaller models on complex problems like causal judgment. By bridging prompt engineering and parameter efficiency without architectural changes, our work establishes embedding refinement as a powerful new paradigm for task adaptation.

[108] Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations cs.CLPDF

Peng Lai, Jianjie Zheng, Sijie Cheng, Yun Chen, Peng Li

TL;DR: 论文提出了一种轻量级框架 LAGER，通过利用大语言模型（LLM）中间层的内部表示来增强其作为评判者（LLM-as-a-Judge）与人类评分的对齐，而无需复杂提示或微调。

Details

Motivation: 当前的自动化评估方法通常依赖LLM的最后一层输出，但中间层可能包含更符合人类判断的语义和任务相关信息。论文希望通过利用这些内部表示提升对齐效果。

Result: 在标准对齐基准测试（Flask、HelpSteer和BIGGen）中，LAGER的Spearman相关性比最佳基线提高了7.5%，且无需推理步骤即可匹敌或超过基于推理的方法。

Insight: LLM的中间层可能比最后一层更符合人类判断，跨层信息的利用是提升自动化评估与人类对齐的关键。

Abstract: The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using large language models, a paradigm known as “LLMas-a-judge.” However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and taskrelevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a lightweight and efficient framework for enhancing LLM-as-a-Judge alignment with human scoring, via internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer scoretoken logits and computing the expected score from a softmax-based distribution, with the LLM backbone kept frozen. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the effectiveness of our method.

[109] Are We on the Right Way for Assessing Document Retrieval-Augmented Generation? cs.CL | cs.CV | cs.IRPDF

Wenxuan Shen, Mingjia Wang, Yaochen Wang, Dongping Chen, Junjie Yang

TL;DR: 本文提出了一个新的多语言、多模态评估系统Double-Bench，旨在解决当前文档RAG系统评估中的不足，并提供细粒度评估。

Details

Motivation: 现有的文档RAG系统评估方法存在局限性，如依赖合成数据、缺乏真实世界数据支持，无法全面反映系统性能。

Result: 实验显示文本和视觉嵌入模型差距缩小，但文档检索模型仍需改进。同时发现现有RAG框架存在过度自信问题。

Insight: 文档RAG系统的评估需更全面和真实的数据支持；文档检索模型的性能仍有提升空间；RAG框架需解决过度自信问题。

Abstract: Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.

[110] Can Large Vision-Language Models Understand Multimodal Sarcasm? cs.CL | cs.CVPDF

Xinyu Wang, Yue Zhang, Liqiang Jing

TL;DR: 该论文评估了大型视觉语言模型（LVLMs）在多模态讽刺分析（MSA）任务中的表现，指出其不足并提出了一种无需训练的方法来改进讽刺理解和解释能力。

Details

Motivation: 讽刺是一种复杂的语言现象，传统方法主要依赖文本，忽略了多模态信息。尽管LVLMs在其他任务中表现优异，但它们在MSA中的应用尚未充分探索。

Result: 实验结果表明，该框架在多个LVLMs上显著提升了多模态讽刺分析的性能。

Insight: 论文揭示了LVLMs在多模态讽刺分析中的关键局限（如视觉理解不足），为未来改进提供了方向。

Abstract: Sarcasm is a complex linguistic phenomenon that involves a disparity between literal and intended meanings, making it challenging for sentiment analysis and other emotion-sensitive tasks. While traditional sarcasm detection methods primarily focus on text, recent approaches have incorporated multimodal information. However, the application of Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA) remains underexplored. In this paper, we evaluate LVLMs in MSA tasks, specifically focusing on Multimodal Sarcasm Detection and Multimodal Sarcasm Explanation. Through comprehensive experiments, we identify key limitations, such as insufficient visual understanding and a lack of conceptual knowledge. To address these issues, we propose a training-free framework that integrates in-depth object extraction and external conceptual knowledge to improve the model’s ability to interpret and explain sarcasm in multimodal contexts. The experimental results on multiple models show the effectiveness of our proposed framework. The code is available at https://github.com/cp-cp/LVLM-MSA.

[111] CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction cs.CLPDF

Zixuan Li, Binzong Geng, Jing Xiong, Yong He, Yuxuan Hu

TL;DR: 论文《CTR-Sink》提出了一种针对点击率预测任务中语言模型（LMs）的注意力机制改进框架，通过引入行为级注意力汇（attention sinks），解决了用户行为序列与自然语言之间的结构不匹配问题，从而提升预测性能。

Details

Motivation: 现有的点击率（CTR）预测任务中，语言模型用于建模用户行为序列时，存在语义碎片化问题。这是因为用户行为序列由离散动作和语义空白的分隔符组成，而预训练语言模型适用于连贯的自然语言。这种结构性差异导致注意力分散，无法有效捕捉行为边界和关系。

Result: 在工业数据集和两个开源数据集（MovieLens, Kuairec）上的实验表明，该方法能够显著提升预测性能，并通过可视化结果验证了注意力的聚焦效果。

Insight: 通过定制化的注意力汇机制，可以弥合语言模型与用户行为序列之间的结构差异，从而更有效地捕捉行为间的关系，提升推荐系统的表现。

Abstract: Click-Through Rate (CTR) prediction, a core task in recommendation systems, estimates user click likelihood using historical behavioral data. Modeling user behavior sequences as text to leverage Language Models (LMs) for this task has gained traction, owing to LMs’ strong semantic understanding and contextual modeling capabilities. However, a critical structural gap exists: user behavior sequences consist of discrete actions connected by semantically empty separators, differing fundamentally from the coherent natural language in LM pre-training. This mismatch causes semantic fragmentation, where LM attention scatters across irrelevant tokens instead of focusing on meaningful behavior boundaries and inter-behavior relationships, degrading prediction performance. To address this, we propose $\textit{CTR-Sink}$, a novel framework introducing behavior-level attention sinks tailored for recommendation scenarios. Inspired by attention sink theory, it constructs attention focus sinks and dynamically regulates attention aggregation via external information. Specifically, we insert sink tokens between consecutive behaviors, incorporating recommendation-specific signals such as temporal distance to serve as stable attention sinks. To enhance generality, we design a two-stage training strategy that explicitly guides LM attention toward sink tokens and a attention sink mechanism that amplifies inter-sink dependencies to better capture behavioral correlations. Experiments on one industrial dataset and two open-source datasets (MovieLens, Kuairec), alongside visualization results, validate the method’s effectiveness across scenarios.

[112] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward cs.CL | cs.AIPDF

Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao

TL;DR: CompassVerifier是一个轻量级但精确且鲁棒的验证器模型，用于评估和奖励大语言模型（LLMs）的输出。通过多领域能力和处理多样答案类型，解决了当前方法在鲁棒性和通用性上的不足。

Details

Motivation: 当前LLM评估框架依赖于正则匹配或通用LLM进行答案验证，存在定制化需求高、鲁棒性和通用性不足的问题。

Result: CompassVerifier在多领域任务中表现出色，能够处理复杂边缘情况，并具有跨领域泛化能力。

Insight: 验证器的鲁棒性和通用性是LLM评估和优化的关键，CompassVerifier为此提供了新的基准和方法。

Abstract: Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.

eess.AS [Back]

Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu

TL;DR: SecoustiCodec提出了一种跨模态对齐的低比特率流式语音编解码器，解决了现有方法在语义编码中的残差副语言信息、语义完整性不足等问题，同时支持流式处理。

Details

Motivation: 现有的语音编解码器在语义编码中存在残差副语言信息（如音色、情感）、语义不完整、重建能力有限以及缺乏流式支持等问题。为了解决这些问题，作者提出了SecoustiCodec。

Result: SecoustiCodec在0.27/1 kbps下达到了1.77/2.58的PESQ评分（SOTA），重建质量显著提升。

Insight: 1.副语言编码对填补语义和声学信息空缺至关重要。2.对比学习是解耦语义和副语言信息的有效方法。3.多阶段优化策略能显著提升模型的收敛稳定性。

Abstract: Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of tokens while maintaining high codebook utilization. A semantic disentanglement method based on contrastive learning is proposed, which aligns text and speech in a joint multimodal frame-level space, effectively removing paralinguistic information from semantic encoding. An acoustic-constrained multi-stage optimization strategy is proposed to ensure robust and stable convergence. Figure~\ref{fig:pesq_kbps_below_2kbps} shows SecoustiCodec achieves SOTA (state-of-the-art) reconstruction quality (PESQ) of 1.77/2.58 at 0.27/1 kbps. The code and model weights for SecoustiCodec will be open-sourced upon the completion of the peer-review process. We’ve open-sourced SecoustiCodec’s demo, code, and model weights.

cs.NE [Back]

[114] VCNet: Recreating High-Level Visual Cortex Principles for Robust Artificial Vision cs.NE | cs.AI | cs.CV | cs.LG | 68T07, 68T45, 68U10 | I.2.6; I.4.8; I.2.10; I.5.1PDF

Brennen A. Hill, Zhang Xinyu, Timothy Putra Prasetio

TL;DR: 这篇论文提出了VCNet，一种受灵长类视觉皮层宏观组织启发的神经网络架构，旨在解决现代CNN的数据效率低、泛化能力差和对对抗攻击脆弱等问题，通过模拟生物视觉机制（如分层处理和双流信息分离）实现了高效的视觉任务性能。

Details

Motivation: 现代CNN在图像分类中虽表现优异，但仍存在数据效率低、泛化能力差和对抗攻击脆弱等问题。灵长类视觉系统的高效性和鲁棒性为其提供了改进方向。

Result: VCNet在Spots-10数据集上达到92.1%的准确率，在光场数据集上达到74.4%，优于同规模的其他模型。

Insight: 研究表明，将神经科学原理融入网络设计可以提升模型的效率和鲁棒性，为解决机器学习中长期存在的挑战提供了新方向。

Abstract: Despite their success in image classification, modern convolutional neural networks (CNNs) exhibit fundamental limitations, including data inefficiency, poor out-of-distribution generalization, and vulnerability to adversarial perturbations. The primate visual system, in contrast, demonstrates superior efficiency and robustness, suggesting that its architectural principles may offer a blueprint for more capable artificial vision systems. This paper introduces Visual Cortex Network (VCNet), a novel neural network architecture whose design is informed by the macro-scale organization of the primate visual cortex. VCNet emulates key biological mechanisms, including hierarchical processing across distinct cortical areas, dual-stream information segregation, and top-down predictive feedback. We evaluate VCNet on two specialized benchmarks: the Spots-10 animal pattern dataset and a light field image classification task. Our results show that VCNet achieves a classification accuracy of 92.1% on Spots-10 and 74.4% on the light field dataset, surpassing contemporary models of comparable size. This work demonstrates that integrating neuroscientific principles into network design can lead to more efficient and robust models, providing a promising direction for addressing long-standing challenges in machine learning.

cs.LG [Back]

[115] VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision cs.LG | cs.AI | cs.CLPDF

Dingwei Zhu, Shihan Dou, Zhiheng Xi, Senjie Jin, Guoqiang Zhang

TL;DR: 论文提出VRPO框架，通过强化价值模型在噪声环境中的作用，结合辅助损失和信息瓶颈方法，显著提升了PPO训练的鲁棒性。

Details

Motivation: 现实中的RLHF常受噪声奖励监督的困扰，导致策略不稳定和泛化能力下降，传统方法忽视了价值模型的关键作用。

Result: 在数学推理、科学问答和多轮对话任务中，VRPO在规则和模型噪声奖励下均优于PPO和GRPO基线。

Insight: 价值模型在RLHF中不仅是预测器，更是噪声调节器，其优化对噪声环境下的策略训练至关重要。

Abstract: Reinforcement Learning from Human Feedback (RLHF) often suffers from noisy or imperfect reward supervision in real-world settings, which undermines policy stability and generalization. Such noise may cause models to lose attention on key words during advantage estimation. While prior work focuses on reward denoising or filtering poor data, it often overlooks the critical role of the value model in policy optimization. In this work, we show that a strong value model is essential for mitigating noise by absorbing unstable signals and enabling more reliable advantage estimation. We propose VRPO, a value-centric framework for robust PPO training under noisy supervision. VRPO combines two core designs: (1) an auxiliary loss guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck. These mechanisms enhance the value model’s ability to filter out noise and capture key words from the context during advantage estimation, transforming it from a passive predictor into an active regulator of noise. Experiments on math reasoning, science QA, and multi-turn dialogue, under both rule-based and model-based noisy rewards, show that VRPO consistently outperforms PPO and GRPO baselines. Our findings underscore the often-overlooked importance of the value model in RLHF and offer a principled and practical approach to robust policy optimization in noisy real-world environments.

[116] Understanding the Embedding Models on Hyper-relational Knowledge Graph cs.LG | cs.CL | cs.SIPDF

Yubo Wang, Shimin Di, Zhili Wang, Haoyang Li, Fei Teng

TL;DR: 本文通过将超关系知识图（HKG）分解为传统知识图（KG）形式，评估了经典KGE模型在HKG上的性能，发现部分模型表现与HKGE模型相当。进一步分析指出分解方法破坏了HKG拓扑结构，并揭示了当前HKGE模型在长程依赖或信息压缩问题上的不足。作者提出FormerGNN框架，通过保留HKG拓扑和优化信息整合，显著提升了性能。

Details

Motivation: 超关系知识图（HKG）扩展了传统知识图（KG），能更真实地表示带附加修饰的事实。然而，现有HKGE模型的性能优势是否源于其基础KGE模型或专门设计的扩展模块尚不明确，因此需要深入研究。

Result: 实验表明，部分经典KGE模型与HKGE模型性能相当，而FormerGNN显著优于现有HKGE模型。

Insight: 1. 分解HKG会破坏其拓扑结构。2. 当前HKGE模型在长程依赖和主三元组-修饰信息整合上存在不足。3. FormerGNN框架为解决这些问题提供了潜在方向。

Abstract: Recently, Hyper-relational Knowledge Graphs (HKGs) have been proposed as an extension of traditional Knowledge Graphs (KGs) to better represent real-world facts with additional qualifiers. As a result, researchers have attempted to adapt classical Knowledge Graph Embedding (KGE) models for HKGs by designing extra qualifier processing modules. However, it remains unclear whether the superior performance of Hyper-relational KGE (HKGE) models arises from their base KGE model or the specially designed extension module. Hence, in this paper, we data-wise convert HKGs to KG format using three decomposition methods and then evaluate the performance of several classical KGE models on HKGs. Our results show that some KGE models achieve performance comparable to that of HKGE models. Upon further analysis, we find that the decomposition methods alter the original HKG topology and fail to fully preserve HKG information. Moreover, we observe that current HKGE models are either insufficient in capturing the graph’s long-range dependency or struggle to integrate main-triple and qualifier information due to the information compression issue. To further justify our findings and offer a potential direction for future HKGE research, we propose the FormerGNN framework. This framework employs a qualifier integrator to preserve the original HKG topology, and a GNN-based graph encoder to capture the graph’s long-range dependencies, followed by an improved approach for integrating main-triple and qualifier information to mitigate compression issues. Our experimental results demonstrate that FormerGNN outperforms existing HKGE models.

[117] Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning cs.LG | cs.CL | cs.SEPDF

Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich

TL;DR: 论文通过强化学习成功训练了一个能够处理多轮交互的软件工程代理，改进了现有方法在复杂任务中的表现。

Details

Motivation: 现有强化学习多关注单轮任务（如数学推理或单次代码生成），而软件工程等现实任务需要多轮交互和状态反馈。

Result: 在SWE-bench Verified基准测试上，成功率从20%提升到39%，并与其他领先开放权重模型表现相当。

Insight: 研究展示了强化学习在复杂多轮交互任务中的潜力，为基于开放模型构建更强大的自主代理提供了可行路径。

Abstract: Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent’s success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.

[118] MoKA: Mixture of Kronecker Adapters cs.LG | cs.AI | cs.CLPDF

Mohammadreza Sadeghi, Mahsa Ghazvini Nejad, MirHamed Jafarzadeh Asl, Yu Gu, Yuanhao Yu

TL;DR: MoKA提出了一种新的参数高效微调方法，通过混合Kronecker积建模权重更新，解决了低秩适配器表达能力不足的问题，实现了性能与参数效率的最佳平衡。

Details

Motivation: 当前的低秩适配器因秩约束导致表达能力有限，复杂任务表现不佳。

Result: 在低比特量化LLaMA2-7B和LLaMA3-8B模型上，MoKA在指令微调和常识推理任务中优于基线，参数减少27倍。

Insight: 混合Kronecker积和门控机制的结合为参数高效微调提供了新思路，平衡了性能和效率。

Abstract: Parameter-efficient fine-tuning (PEFT) is essential for reducing the computational overhead of large language models (LLMs). Low-rank family adapters are commonly used to control the parameter size efficiently while maintaining the generative power of LLMs. However, their limited expressiveness due to the rank constraint often restricts their performance on complex tasks. We propose Mixture of Kronecker Adapters (MoKA), a new generation of Kronecker adapters that addresses this limitation by modeling weight updates as a mixture of Kronecker products. Our proposed adapter leverages a gating mechanism that measures the importance of each Kronecker factor, enabling more expressive adaptation. Moreover, MoKA enables a rank flexibility that provides a better trade-off between parameter efficiency and accuracy. To ensure hardware efficiency, we reformulate Kronecker computations using standard matrix operations, allowing seamless deployment on GPU-optimized hardware. We conduct extensive experiments on instruction-tuning and commonsense reasoning tasks using low-bit quantized versions of LLaMA2-7B and LLaMA3-8B models. MoKA not only outperforms PEFT baselines, but also reduces the number of trainable parameters up to 27x, achieving state-of-the-art trade-offs between performance and parameter efficiency.

eess.IV [Back]

[119] REFLECT: Rectified Flows for Efficient Brain Anomaly Correction Transport eess.IV | cs.CVPDF

Farzad Beizaee, Sina Hajimiri, Ismail Ben Ayed, Gregory Lodygensky, Christian Desrosiers

TL;DR: REFLECT 是一种基于 rectified flows（整流流）的框架，用于高效修正脑部异常图像，通过单步推理直接将其映射到正常分布，显著优于现有无监督异常检测方法。

Details

Motivation: 脑部成像中的无监督异常检测（UAD）因脑部结构复杂和异常样本稀少而难以准确定位异常，现有基于扩散的模型需迭代采样，效率低。

Result: 在主流脑部分割基准测试中，REFLECT 显著优于现有无监督异常检测方法。

Insight: 整流流为异常检测提供了高效的直接映射方法，避免迭代采样的复杂性，适用于临床快速诊断。

Abstract: Unsupervised anomaly detection (UAD) in brain imaging is crucial for identifying pathologies without the need for labeled data. However, accurately localizing anomalies remains challenging due to the intricate structure of brain anatomy and the scarcity of abnormal examples. In this work, we introduce REFLECT, a novel framework that leverages rectified flows to establish a direct, linear trajectory for correcting abnormal MR images toward a normal distribution. By learning a straight, one-step correction transport map, our method efficiently corrects brain anomalies and can precisely localize anomalies by detecting discrepancies between anomalous input and corrected counterpart. In contrast to the diffusion-based UAD models, which require iterative stochastic sampling, rectified flows provide a direct transport map, enabling single-step inference. Extensive experiments on popular UAD brain segmentation benchmarks demonstrate that REFLECT significantly outperforms state-of-the-art unsupervised anomaly detection methods. The code is available at https://github.com/farzad-bz/REFLECT.

Puzhen Wu, Mingquan Lin, Qingyu Chen, Emily Y. Chew, Zhiyong Lu

TL;DR: 提出了AMD-Mamba，一种多模态框架，结合眼底图像、遗传变异和社会人口学变量，通过度量学习和Vision Mamba改进AMD预后性能。

Details

Motivation: AMD是不可逆视力丧失的主要原因，现有方法在捕捉疾病进展模式和全局信息方面不足，亟需更强大的预后框架。

Result: 在AREDS数据集上验证，新生物标志物对AMD进展具有显著意义，结合现有变量可早期识别高风险患者。

Insight: 多模态框架和全局-局部信息融合对改善AMD预后效果显著，新生物标志物为临床管理提供了新工具。

Abstract: Age-related macular degeneration (AMD) is a leading cause of irreversible vision loss, making effective prognosis crucial for timely intervention. In this work, we propose AMD-Mamba, a novel multi-modal framework for AMD prognosis, and further develop a new AMD biomarker. This framework integrates color fundus images with genetic variants and socio-demographic variables. At its core, AMD-Mamba introduces an innovative metric learning strategy that leverages AMD severity scale score as prior knowledge. This strategy allows the model to learn richer feature representations by aligning learned features with clinical phenotypes, thereby improving the capability of conventional prognosis methods in capturing disease progression patterns. In addition, unlike existing models that use traditional CNN backbones and focus primarily on local information, such as the presence of drusen, AMD-Mamba applies Vision Mamba and simultaneously fuses local and long-range global information, such as vascular changes. Furthermore, we enhance prediction performance through multi-scale fusion, combining image information with clinical variables at different resolutions. We evaluate AMD-Mamba on the AREDS dataset, which includes 45,818 color fundus photographs, 52 genetic variants, and 3 socio-demographic variables from 2,741 subjects. Our experimental results demonstrate that our proposed biomarker is one of the most significant biomarkers for the progression of AMD. Notably, combining this biomarker with other existing variables yields promising improvements in detecting high-risk AMD patients at early stages. These findings highlight the potential of our multi-modal framework to facilitate more precise and proactive management of AMD.

[121] ClinicalFMamba: Advancing Clinical Assessment using Mamba-based Multimodal Neuroimaging Fusion eess.IV | cs.AI | cs.CVPDF

Meng Zhou, Farzad Khalvati

TL;DR: 本文提出了一种名为ClinicalFMamba的新型端到端CNN-Mamba混合架构，结合了局部和全局特征建模，用于2D和3D医学图像融合。通过创新的三平面扫描策略，有效学习3D体积依赖性，并在多个数据集上展示了优异的融合性能和实时处理能力。此外，该方法在脑瘤分类任务中显著优于基线方法，适合临床实时部署。

Details

Motivation: 现有深度学习方法在医学图像融合中存在局限性：CNNs难以有效建模全局上下文，而Transformers虽然建模能力强但计算复杂度高，限制了临床应用。State Space Models (SSMs) 提供了高效的长距离依赖建模，但在3D体积数据和临床验证方面仍待探索。

Result: 在多个数据集上，ClinicalFMamba在融合性能和实时处理方面显著优于基线方法。在脑瘤分类任务中，性能优于现有方法，证明了其临床应用潜力。

Insight: 1. Mamba模型在医学图像融合中展现了高效的长距离建模能力。2. 三平面扫描策略有效解决了3D体积数据的依赖性建模问题。3. 该方法为临床实时部署提供了一种高效且性能优越的新范式。

Abstract: Multimodal medical image fusion integrates complementary information from different imaging modalities to enhance diagnostic accuracy and treatment planning. While deep learning methods have advanced performance, existing approaches face critical limitations: Convolutional Neural Networks (CNNs) excel at local feature extraction but struggle to model global context effectively, while Transformers achieve superior long-range modeling at the cost of quadratic computational complexity, limiting clinical deployment. Recent State Space Models (SSMs) offer a promising alternative, enabling efficient long-range dependency modeling in linear time through selective scan mechanisms. Despite these advances, the extension to 3D volumetric data and the clinical validation of fused images remains underexplored. In this work, we propose ClinicalFMamba, a novel end-to-end CNN-Mamba hybrid architecture that synergistically combines local and global feature modeling for 2D and 3D images. We further design a tri-plane scanning strategy for effectively learning volumetric dependencies in 3D images. Comprehensive evaluations on three datasets demonstrate the superior fusion performance across multiple quantitative metrics while achieving real-time fusion. We further validate the clinical utility of our approach on downstream 2D/3D brain tumor classification tasks, achieving superior performance over baseline methods. Our method establishes a new paradigm for efficient multimodal medical image fusion suitable for real-time clinical deployment.

[122] Nexus-INR: Diverse Knowledge-guided Arbitrary-Scale Multimodal Medical Image Super-Resolution eess.IV | cs.CVPDF

Bo Zhang, JianFei Huo, Zheng Zhang, Wufan Wang, Hui Gao

TL;DR: Nexus-INR是一种基于多样知识引导的多模态医学图像超分辨率框架，通过双分支编码器、知识蒸馏模块和分割模块，实现了高质量的自适应分辨率医学图像重建，并在BraTS2020数据集上表现优异。

Details

Motivation: 传统CNN方法难以实现任意分辨率超分辨率（ARSR），而现有的基于隐式神经表示（INR）的方法在处理多模态图像时仍存在局限性。Nexus-INR的目标是结合多样知识和下游任务，提升医学图像的超分辨率质量。

Result: 在BraTS2020数据集上，Nexus-INR在超分辨率和分割任务中优于现有方法。

Insight: 多模态知识和下游任务语义的融合，能够显著提升医学图像超分辨率和下游分析任务的性能。

Abstract: Arbitrary-resolution super-resolution (ARSR) provides crucial flexibility for medical image analysis by adapting to diverse spatial resolutions. However, traditional CNN-based methods are inherently ill-suited for ARSR, as they are typically designed for fixed upsampling factors. While INR-based methods overcome this limitation, they still struggle to effectively process and leverage multi-modal images with varying resolutions and details. In this paper, we propose Nexus-INR, a Diverse Knowledge-guided ARSR framework, which employs varied information and downstream tasks to achieve high-quality, adaptive-resolution medical image super-resolution. Specifically, Nexus-INR contains three key components. A dual-branch encoder with an auxiliary classification task to effectively disentangle shared anatomical structures and modality-specific features; a knowledge distillation module using cross-modal attention that guides low-resolution modality reconstruction with high-resolution reference, enhanced by self-supervised consistency loss; an integrated segmentation module that embeds anatomical semantics to improve both reconstruction quality and downstream segmentation performance. Experiments on the BraTS2020 dataset for both super-resolution and downstream segmentation demonstrate that Nexus-INR outperforms state-of-the-art methods across various metrics.

[123] GL-LCM: Global-Local Latent Consistency Models for Fast High-Resolution Bone Suppression in Chest X-Ray Images eess.IV | cs.CVPDF

Yifei Sun, Zhanghao Chen, Hao Zheng, Yuqing Lu, Lixin Duan

TL;DR: 论文提出了一种名为GL-LCM的全局-局部潜在一致性模型，用于快速高分辨率的胸部X光图像骨骼抑制，解决了现有方法在骨骼完全抑制与局部细节保留之间的平衡问题，同时显著提升了计算效率。

Details

Motivation: 胸部X光图像中骨骼结构的存在会遮挡关键诊断细节，现有基于扩散模型的方法在骨骼抑制和细节保留的平衡上表现不佳，且计算成本高，难以应用于临床。

Result: 在SZCH-X-Rays和JSRT数据集上，GL-LCM在骨骼抑制效果和计算效率上均显著优于其他竞争方法。

Insight: 1. 双路径采样和全局-局部融合策略是实现高效高分辨率骨骼抑制的关键；2. 局部增强引导可直接应用于推理阶段，避免了额外训练开销。

Abstract: Chest X-Ray (CXR) imaging for pulmonary diagnosis raises significant challenges, primarily because bone structures can obscure critical details necessary for accurate diagnosis. Recent advances in deep learning, particularly with diffusion models, offer significant promise for effectively minimizing the visibility of bone structures in CXR images, thereby improving clarity and diagnostic accuracy. Nevertheless, existing diffusion-based methods for bone suppression in CXR imaging struggle to balance the complete suppression of bones with preserving local texture details. Additionally, their high computational demand and extended processing time hinder their practical use in clinical settings. To address these limitations, we introduce a Global-Local Latent Consistency Model (GL-LCM) architecture. This model combines lung segmentation, dual-path sampling, and global-local fusion, enabling fast high-resolution bone suppression in CXR images. To tackle potential boundary artifacts and detail blurring in local-path sampling, we further propose Local-Enhanced Guidance, which addresses these issues without additional training. Comprehensive experiments on a self-collected dataset SZCH-X-Rays, and the public dataset JSRT, reveal that our GL-LCM delivers superior bone suppression and remarkable computational efficiency, significantly outperforming several competitive methods. Our code is available at https://github.com/diaoquesang/GL-LCM.

[124] Evaluating the Predictive Value of Preoperative MRI for Erectile Dysfunction Following Radical Prostatectomy eess.IV | cs.CVPDF

Gideon N. L. Rouwendaal, Daniël Boeke, Inge L. Cox, Henk G. van der Poel, Margriet C. van Dijk-de Haan

TL;DR: 术前MRI对根治性前列腺切除术后勃起功能障碍的预测价值有限，未显著超越临床特征的预测性能，但仍提示可能与未来多模态方法互补。

Details

Motivation: 准确预测根治性前列腺术后勃起功能障碍对患者咨询至关重要，但MRI的附加预测价值尚未明确。

Result: MRI模型（AUC 0.569）略优于手工特征（AUC 0.554），但不如临床基线（AUC 0.663）。多模态融合仅轻微提升（AUC 0.586）。

Insight: MRI模型虽未超越临床预测，但聚焦解剖相关区域（如前列腺和神经血管束），或为未来多模态方法提供补充。

Abstract: Accurate preoperative prediction of erectile dysfunction (ED) is important for counseling patients undergoing radical prostatectomy. While clinical features are established predictors, the added value of preoperative MRI remains underexplored. We investigate whether MRI provides additional predictive value for ED at 12 months post-surgery, evaluating four modeling strategies: (1) a clinical-only baseline, representing current state-of-the-art; (2) classical models using handcrafted anatomical features derived from MRI; (3) deep learning models trained directly on MRI slices; and (4) multimodal fusion of imaging and clinical inputs. Imaging-based models (maximum AUC 0.569) slightly outperformed handcrafted anatomical approaches (AUC 0.554) but fell short of the clinical baseline (AUC 0.663). Fusion models offered marginal gains (AUC 0.586) but did not exceed clinical-only performance. SHAP analysis confirmed that clinical features contributed most to predictive performance. Saliency maps from the best-performing imaging model suggested a predominant focus on anatomically plausible regions, such as the prostate and neurovascular bundles. While MRI-based models did not improve predictive performance over clinical features, our findings suggest that they try to capture patterns in relevant anatomical structures and may complement clinical predictors in future multimodal approaches.

cs.AI [Back]

[125] Efficient Agents: Building Effective Agents While Reducing Cost cs.AI | cs.CL | cs.MAPDF

Ningning Wang, Xavier Hu, Pai Liu, He Zhu, Yue Hou

TL;DR: 论文系统研究了现代Agent系统在效率与性能之间的权衡，提出了Efficient Agents框架，以显著降低成本的同时保持高性能。

Details

Motivation: 大语言模型驱动的Agent虽然在复杂多步任务上表现优异，但其高昂成本限制了可扩展性和普及性。因此，研究如何在保证性能的同时降低成本成为关键问题。

Result: Efficient Agents将操作成本从0.398美元降至0.228美元，成本效益提升28.4%，同时保留96.7%的性能。

Insight: 任务复杂度并非总是与性能成正比，优化框架设计可以显著提升效率，为AI驱动的解决方案提供了可持续性发展的方向。

Abstract: The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from $0.398 to $0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions.

[126] Defend LLMs Through Self-Consciousness cs.AI | cs.CL | cs.CRPDF

Boshi Huang, Fabio Nonato de Paula

TL;DR: 该论文提出了一种新颖的自我意识防御机制，通过利用大语言模型（LLMs）的内在推理能力来抵抗提示注入攻击，避免了依赖外部分类器的传统方法。

Details

Motivation: 传统防御方法依赖外部工具，增加了计算成本和复杂性，而基于自我意识的防御机制能够更高效、轻量地解决提示注入问题。

Result: 实验表明，该方法在多种LLMs上显著提高了防御成功率，部分模型在增强模式下实现了近乎完美的防御效果。

Insight: 自我意识防御为LLMs提供了一种轻量、低成本的伦理增强方案，尤其适用于生成式AI的广泛应用场景。

Abstract: This paper introduces a novel self-consciousness defense mechanism for Large Language Models (LLMs) to combat prompt injection attacks. Unlike traditional approaches that rely on external classifiers, our method leverages the LLM’s inherent reasoning capabilities to perform self-protection. We propose a framework that incorporates Meta-Cognitive and Arbitration Modules, enabling LLMs to evaluate and regulate their own outputs autonomously. Our approach is evaluated on seven state-of-the-art LLMs using two datasets: AdvBench and Prompt-Injection-Mixed-Techniques-2024. Experiment results demonstrate significant improvements in defense success rates across models and datasets, with some achieving perfect and near-perfect defense in Enhanced Mode. We also analyze the trade-off between defense success rate improvement and computational overhead. This self-consciousness method offers a lightweight, cost-effective solution for enhancing LLM ethics, particularly beneficial for GenAI use cases across various platforms.

[127] Unified Tool Integration for LLMs: A Protocol-Agnostic Approach to Function Calling cs.AI | cs.CL | cs.LGPDF

Peng Ding, Rick Stevens

TL;DR: 论文提出了一种协议无关的统一工具集成方法，用于简化LLM的工具增强开发，显著减少了开发成本和提升了执行性能。

Details

Motivation: 当前工具增强的LLM生态系统碎片化，开发者需要处理多种协议和复杂的工作流，亟需一种统一的解决方案。

Result: 实验显示代码量减少60-80%，性能提升高达3.1倍，同时兼容现有函数调用标准。

Insight: 协议无关设计和并发执行优化是提升LLM工具集成效率的关键。

Abstract: The proliferation of tool-augmented Large Language Models (LLMs) has created a fragmented ecosystem where developers must navigate multiple protocols, manual schema definitions, and complex execution workflows. We address this challenge by proposing a unified approach to tool integration that abstracts protocol differences while optimizing execution performance. Our solution demonstrates how protocol-agnostic design principles can significantly reduce development overhead through automated schema generation, dual-mode concurrent execution, and seamless multi-source tool management. Experimental results show 60-80% code reduction across integration scenarios, performance improvements up to 3.1x through optimized concurrency, and full compatibility with existing function calling standards. This work contributes both theoretical insights into tool integration architecture and practical solutions for real-world LLM application development.

[128] AGENTiGraph: A Multi-Agent Knowledge Graph Framework for Interactive, Domain-Specific LLM Chatbots cs.AI | cs.CLPDF

Xinjie Zhao, Moritz Blum, Fan Gao, Yingjian Chen, Boming Yang

TL;DR: AGENTiGraph是一个多智能体的知识图谱框架，支持用户通过自然语言交互和管理领域特定数据，适用于非技术用户构建和优化知识库。

Details

Motivation: 现有知识管理系统通常需要技术专业知识或专用查询语言，限制了非技术用户的使用。AGENTiGraph旨在提供一个直观、可视化的解决方案，简化知识管理过程。

Result: 在分类准确率（95.12%）和执行成功率（90.45%）上均优于零样本基线，展示了其在合规性关键领域（如法律和医疗）的潜力。

Insight: AGENTiGraph为LLM与结构化知识图谱的结合提供了新范式，尤其适合需要动态更新和多步查询的领域。

Abstract: AGENTiGraph is a user-friendly, agent-driven system that enables intuitive interaction and management of domain-specific data through the manipulation of knowledge graphs in natural language. It gives non-technical users a complete, visual solution to incrementally build and refine their knowledge bases, allowing multi-round dialogues and dynamic updates without specialized query languages. The flexible design of AGENTiGraph, including intent classification, task planning, and automatic knowledge integration, ensures seamless reasoning between diverse tasks. Evaluated on a 3,500-query benchmark within an educational scenario, the system outperforms strong zero-shot baselines (achieving 95.12% classification accuracy, 90.45% execution success), indicating potential scalability to compliance-critical or multi-step queries in legal and medical domains, e.g., incorporating new statutes or research on the fly. Our open-source demo offers a powerful new paradigm for multi-turn enterprise knowledge management that bridges LLMs and structured graphs.

[129] Toward Verifiable Misinformation Detection: A Multi-Tool LLM Agent Framework cs.AI | cs.CLPDF

Zikun Cui, Tianyi Huang, Chia-En Chiang, Cuiqianhe Du

TL;DR: 该论文提出了一种可验证的虚假信息检测LLM智能体框架，通过动态交互与多工具协同实现超越传统二元判断的检测能力。

Details

Motivation: 随着大语言模型（LLM）的普及，虚假信息检测变得愈发重要且复杂，传统二元判断方法难以满足透明度和可信度的需求。

Result: 实验表明，该框架在检测准确性、推理透明性和对抗信息重写的鲁棒性上优于基线方法。

Insight: 该研究为可信赖的AI辅助事实核查提供了新范式，强调了证据链与透明推理在虚假信息检测中的重要性。

Abstract: With the proliferation of Large Language Models (LLMs), the detection of misinformation has become increasingly important and complex. This research proposes an innovative verifiable misinformation detection LLM agent that goes beyond traditional true/false binary judgments. The agent actively verifies claims through dynamic interaction with diverse web sources, assesses information source credibility, synthesizes evidence, and provides a complete verifiable reasoning process. Our designed agent architecture includes three core tools: precise web search tool, source credibility assessment tool and numerical claim verification tool. These tools enable the agent to execute multi-step verification strategies, maintain evidence logs, and form comprehensive assessment conclusions. We evaluate using standard misinformation datasets such as FakeNewsNet, comparing with traditional machine learning models and LLMs. Evaluation metrics include standard classification metrics, quality assessment of reasoning processes, and robustness testing against rewritten content. Experimental results show that our agent outperforms baseline methods in misinformation detection accuracy, reasoning transparency, and resistance to information rewriting, providing a new paradigm for trustworthy AI-assisted fact-checking.

[130] A Comparative Study of Neurosymbolic AI Approaches to Interpretable Logical Reasoning cs.AI | cs.CL | cs.LG | cs.SCPDF

Michael K. Chen

TL;DR: 该论文比较了神经符号AI的两种方法（集成与混合）在领域无关逻辑推理中的表现，发现混合方法更具潜力，并提出了一个通用框架。

Details

Motivation: 大型语言模型在逻辑推理任务中缺乏确定性和可解释性，神经符号AI可能提供解决方案，但缺乏对领域无关任务的系统性比较研究。

Result: 混合方法的推理链更可解释，且保留了大型语言模型的优势，适用于通用逻辑推理任务。

Insight: 混合方法结合了符号推理和神经网络的优点，为开发通用逻辑推理AI提供了更可行的路径。

Abstract: General logical reasoning, defined as the ability to reason deductively on domain-agnostic tasks, continues to be a challenge for large language models (LLMs). Current LLMs fail to reason deterministically and are not interpretable. As such, there has been a recent surge in interest in neurosymbolic AI, which attempts to incorporate logic into neural networks. We first identify two main neurosymbolic approaches to improving logical reasoning: (i) the integrative approach comprising models where symbolic reasoning is contained within the neural network, and (ii) the hybrid approach comprising models where a symbolic solver, separate from the neural network, performs symbolic reasoning. Both contain AI systems with promising results on domain-specific logical reasoning benchmarks. However, their performance on domain-agnostic benchmarks is understudied. To the best of our knowledge, there has not been a comparison of the contrasting approaches that answers the following question: Which approach is more promising for developing general logical reasoning? To analyze their potential, the following best-in-class domain-agnostic models are introduced: Logic Neural Network (LNN), which uses the integrative approach, and LLM-Symbolic Solver (LLM-SS), which uses the hybrid approach. Using both models as case studies and representatives of each approach, our analysis demonstrates that the hybrid approach is more promising for developing general logical reasoning because (i) its reasoning chain is more interpretable, and (ii) it retains the capabilities and advantages of existing LLMs. To support future works using the hybrid approach, we propose a generalizable framework based on LLM-SS that is modular by design, model-agnostic, domain-agnostic, and requires little to no human input.

[131] T2UE: Generating Unlearnable Examples from Text Descriptions cs.AI | cs.CR | cs.CVPDF

Xingjun Ma, Hanxun Huang, Tianwei Song, Ye Sun, Yifeng Gao

TL;DR: T2UE提出了一种新型框架，通过仅使用文本描述生成不可学习样本（UEs），避免了原始图像数据的暴露，解决了隐私保护的矛盾问题，并在实验中证明了其对下游任务的有效性。

Details

Motivation: 现有方法生成不可学习样本需要联合优化图像和文本，计算开销大且必须依赖第三方服务，导致隐私矛盾：必须在保护前暴露数据。T2UE旨在解决这一问题。

Result: 实验表明，T2UE保护的数据显著降低了跨模态检索等下游任务的性能，且保护效果对不同架构和监督学习设置具有普适性。

Insight: 仅通过文本描述即可保护数据隐私，为可扩展的数据保护方案提供了新思路，避免了传统方法的隐私泄露风险。

Abstract: Large-scale pre-training frameworks like CLIP have revolutionized multimodal learning, but their reliance on web-scraped datasets, frequently containing private user data, raises serious concerns about misuse. Unlearnable Examples (UEs) have emerged as a promising countermeasure against unauthorized model training, employing carefully crafted unlearnable noise to disrupt the learning of meaningful representations from protected data. Current approaches typically generate UEs by jointly optimizing unlearnable noise for both images and their associated text descriptions (or labels). However, this optimization process is often computationally prohibitive for on-device execution, forcing reliance on external third-party services. This creates a fundamental privacy paradox: users must initially expose their data to these very services to achieve protection, thereby compromising privacy in the process. Such a contradiction has severely hindered the development of practical, scalable data protection solutions. To resolve this paradox, we introduce \textbf{Text-to-Unlearnable Example (T2UE)}, a novel framework that enables users to generate UEs using only text descriptions. T2UE circumvents the need for original image data by employing a text-to-image (T2I) model to map text descriptions into the image (noise) space, combined with an error-minimization framework to produce effective unlearnable noise. Extensive experiments show that T2UE-protected data substantially degrades performance in downstream tasks (e.g., cross-modal retrieval) for state-of-the-art models. Notably, the protective effect generalizes across diverse architectures and even to supervised learning settings. Our work demonstrates the feasibility of “zero-contact data protection”, where personal data can be safeguarded based solely on their textual descriptions, eliminating the need for direct data exposure.

cs.GR [Back]

[132] READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation cs.GR | cs.CV | cs.SD | eess.ASPDF

Haotian Wang, Yuzhe Weng, Jun Du, Haoran Xu, Xiaoyan Wu

TL;DR: 该论文提出了READ框架，通过时间VAE和SpeechAE生成压缩的潜在空间，结合A2V-DiT和异步噪声调度器（ANS），实现了高效且实时的音频驱动头部生成。

Details

Motivation: 扩散模型在音频驱动头部生成中表现优异，但推理速度过慢。研究旨在解决这一问题，实现实时高效的生成。

Result: 实验表明，READ在运行时间和生成质量上优于现有方法，并在长时生成中保持稳定性。

Insight: 压缩潜在空间与异步噪声调度是加速扩散模型推理的有效策略，同时保持生成质量。

Abstract: The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, the first real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference process of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.

cs.SE [Back]

[133] ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs cs.SE | cs.AI | cs.CL | cs.LGPDF

Peng Ding

TL;DR: ToolRegistry 是一个协议无关的工具管理库，旨在简化功能调用型 LLM 的工具集成问题。

Details

Motivation: 当前 LLM 应用的工具集成方法存在碎片化、协议限制和实现复杂的问题，导致开发成本高昂。

Result: 实验表明，ToolRegistry 减少 60-80% 的集成代码，性能提升 3.1 倍，且兼容性达 100%。

Insight: ToolRegistry 的开源性和协议灵活性为 LLM 工具集成提供了高效、可维护的解决方案。

Abstract: Large Language Model (LLM) applications are increasingly relying on external tools to extend their capabilities beyond text generation. However, current tool integration approaches suffer from fragmentation, protocol limitations, and implementation complexity, leading to substantial development overhead. This paper presents Toolregistry, a protocol-agnostic tool management library that simplifies tool registration, representation, execution, and lifecycle management via a unified interface. Our evaluation demonstrates that \toolregistry achieves 60-80% reduction in tool integration code, up to 3.1x performance improvements through concurrent execution, and 100% compatibility with OpenAI function calling standards. Real-world case studies show significant improvements in development efficiency and code maintainability across diverse integration scenarios. \toolregistry is open-source and available at https://github.com/Oaklight/ToolRegistry, with comprehensive documentation at https://toolregistry.readthedocs.io/.

astro-ph.IM [Back]

[134] Investigation on deep learning-based galaxy image translation models astro-ph.IM | astro-ph.GA | cs.CVPDF

Hengxin Ruan, Qiufan Lin, Shupei Chen, Yang Wang, Wei Zhang

TL;DR: 论文研究了基于深度学习的星系图像翻译模型，探讨其在保留高阶物理信息（如红移）方面的效果。

Details

Motivation: 尽管现有研究在像素和形态学层面上取得了进展，但对高阶物理信息的保留缺乏讨论，这对依赖于高保真图像翻译的研究至关重要。

Result: 模型在全局结构和形态学统计上表现良好，但红移信息的保留效果不理想，跨波段峰值流量虽具信息量但存在显著不确定性。

Insight: 不完美的翻译图像仍可能包含有价值信息，适用于不需要高保真图像的下游任务；研究为复杂物理信息在图像中的表现提供了新视角。

Abstract: Galaxy image translation is an important application in galaxy physics and cosmology. With deep learning-based generative models, image translation has been performed for image generation, data quality enhancement, information extraction, and generalized for other tasks such as deblending and anomaly detection. However, most endeavors on image translation primarily focus on the pixel-level and morphology-level statistics of galaxy images. There is a lack of discussion on the preservation of complex high-order galaxy physical information, which would be more challenging but crucial for studies that rely on high-fidelity image translation. Therefore, we investigated the effectiveness of generative models in preserving high-order physical information (represented by spectroscopic redshift) along with pixel-level and morphology-level information. We tested four representative models, i.e. a Swin Transformer, an SRGAN, a capsule network, and a diffusion model, using the SDSS and CFHTLS galaxy images. We found that these models show different levels of incapabilities in retaining redshift information, even if the global structures of galaxies and morphology-level statistics can be roughly reproduced. In particular, the cross-band peak fluxes of galaxies were found to contain meaningful redshift information, whereas they are subject to noticeable uncertainties in the translation of images, which may substantially be due to the nature of many-to-many mapping. Nonetheless, imperfect translated images may still contain a considerable amount of information and thus hold promise for downstream applications for which high image fidelity is not strongly required. Our work can facilitate further research on how complex physical information is manifested on galaxy images, and it provides implications on the development of image translation models for scientific use.

cs.CY [Back]

[135] Teaching at Scale: Leveraging AI to Evaluate and Elevate Engineering Education cs.CY | cs.AI | cs.CLPDF

Jean-Francois Chamberland, Martin C. Carlisle, Arul Jayaraman, Krishna R. Narayanan, Sunay Palsole

TL;DR: 论文提出了一种基于AI的框架，用于规模化评估和提升工程教育效果，通过大语言模型分析学生反馈，结合可视化工具提供可操作的建议，并在实际工程学院的部署中验证了其有效性。

Details

Motivation: 工程教育中大规模评估教学效果的挑战使得传统手动方法难以应对，亟需一种可扩展的自动化解决方案。

Result: 初步验证表明，该框架能够可靠地支持形成性评估和专业发展，成功应用于大型工程学院。

Insight: 设计透明的AI系统并嵌入共享治理机制，可以为学术机构实现规模化教学质量的持续改进。

Abstract: Evaluating teaching effectiveness at scale remains a persistent challenge for large universities, particularly within engineering programs that enroll tens of thousands of students. Traditional methods, such as manual review of student evaluations, are often impractical, leading to overlooked insights and inconsistent data use. This article presents a scalable, AI-supported framework for synthesizing qualitative student feedback using large language models. The system employs hierarchical summarization, anonymization, and exception handling to extract actionable themes from open-ended comments while upholding ethical safeguards. Visual analytics contextualize numeric scores through percentile-based comparisons, historical trends, and instructional load. The approach supports meaningful evaluation and aligns with best practices in qualitative analysis and educational assessment, incorporating student, peer, and self-reflective inputs without automating personnel decisions. We report on its successful deployment across a large college of engineering. Preliminary validation through comparisons with human reviewers, faculty feedback, and longitudinal analysis suggests that LLM-generated summaries can reliably support formative evaluation and professional development. This work demonstrates how AI systems, when designed with transparency and shared governance, can promote teaching excellence and continuous improvement at scale within academic institutions.

[136] The Architecture of Trust: A Framework for AI-Augmented Real Estate Valuation in the Era of Structured Data cs.CY | cs.AI | cs.CV | I.2.1; H.4.2; K.5.2; I.2.10; I.4.8; K.4.1; J.1PDF

Petteri Teikari, Mike Jarrell, Maryam Azh, Harri Pesola

TL;DR: 论文分析了2026年实施的Uniform Appraisal Dataset (UAD) 3.6对房地产估价的影响，结合AI技术提出了一个三层框架，旨在增强估值的可靠性和信任度。

Details

Motivation: UAD 3.6的强制实施将房地产估价从传统叙述性报告转向结构化数据格式，结合AI技术进步，为市场带来了根本性变革。本文旨在解决估值中的制度性问题和信任需求。

Result: 研究表明，成功的转型不仅需要技术成熟度，还需人机协作，以增强专业判断并消除历史偏差。

Insight: 通过AI增强估价可以显著提升效率和公平性，但需注意技术必须与专业实践结合，以避免系统性风险。

Abstract: The Uniform Appraisal Dataset (UAD) 3.6’s mandatory 2026 implementation transforms residential property valuation from narrative reporting to structured, machine-readable formats. This paper provides the first comprehensive analysis of this regulatory shift alongside concurrent AI advances in computer vision, natural language processing, and autonomous systems. We develop a three-layer framework for AI-augmented valuation addressing technical implementation and institutional trust requirements. Our analysis reveals how regulatory standardization converging with AI capabilities enables fundamental market restructuring with profound implications for professional practice, efficiency, and systemic risk. We make four key contributions: (1) documenting institutional failures including inter-appraiser variability and systematic biases undermining valuation reliability; (2) developing an architectural framework spanning physical data acquisition, semantic understanding, and cognitive reasoning that integrates emerging technologies while maintaining professional oversight; (3) addressing trust requirements for high-stakes financial applications including regulatory compliance, algorithmic fairness, and uncertainty quantification; (4) proposing evaluation methodologies beyond generic AI benchmarks toward domain-specific protocols. Our findings indicate successful transformation requires not merely technological sophistication but careful human-AI collaboration, creating systems that augment rather than replace professional expertise while addressing historical biases and information asymmetries in real estate markets.

cs.RO [Back]

[137] DiWA: Diffusion Policy Adaptation with World Models cs.RO | cs.CV | cs.LGPDF

Akshay L Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard

TL;DR: DiWA提出了一种基于世界模型（world model）的离线强化学习方法，用于高效微调扩散策略（diffusion policy），显著减少了实际环境交互需求。

Details

Motivation: 扩散策略在强化学习中微调困难，传统的RL方法需要大量实际交互，效率低下且不实用。DiWA旨在利用世界模型实现离线微调，解决这些问题。

Result: 在CALVIN基准测试中，DiWA仅需少量离线数据即可在八项任务中表现优于传统模型无关方法，且交互需求大幅降低。

Insight: 世界模型可以有效替代实际环境交互，为扩散策略的离线微调提供了高效路径，对现实机器人学习具有重要实用价值。

Abstract: Fine-tuning diffusion policies with reinforcement learning (RL) presents significant challenges. The long denoising sequence for each action prediction impedes effective reward propagation. Moreover, standard RL methods require millions of real-world interactions, posing a major bottleneck for practical fine-tuning. Although prior work frames the denoising process in diffusion policies as a Markov Decision Process to enable RL-based updates, its strong dependence on environment interaction remains highly inefficient. To bridge this gap, we introduce DiWA, a novel framework that leverages a world model for fine-tuning diffusion-based robotic skills entirely offline with reinforcement learning. Unlike model-free approaches that require millions of environment interactions to fine-tune a repertoire of robot skills, DiWA achieves effective adaptation using a world model trained once on a few hundred thousand offline play interactions. This results in dramatically improved sample efficiency, making the approach significantly more practical and safer for real-world robot learning. On the challenging CALVIN benchmark, DiWA improves performance across eight tasks using only offline adaptation, while requiring orders of magnitude fewer physical interactions than model-free baselines. To our knowledge, this is the first demonstration of fine-tuning diffusion policies for real-world robotic skills using an offline world model. We make the code publicly available at https://diwa.cs.uni-freiburg.de.

cs.IR [Back]

[138] MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation cs.IR | cs.CLPDF

Wenlong Wu, Haofen Wang, Bohan Li, Peixuan Huang, Xinzhe Zhao

TL;DR: MultiRAG是一个通过知识引导方法缓解多源检索增强生成中幻觉问题的框架，通过多源线图聚合逻辑关系和多级置信度机制减少信息冲突。

Details

Motivation: 多源检索增强生成虽能提供更多信息，但也导致数据稀疏和源间不一致，加剧了幻觉问题。MultiRAG旨在解决这些问题。

Result: 在四个多领域查询数据集和两个多跳QA数据集上验证了MultiRAG的可靠性和效率提升。

Insight: 多源线图和多级置信度机制是解决多源检索中数据稀疏和一致性问题的高效方法。

Abstract: Retrieval Augmented Generation (RAG) has emerged as a promising solution to address hallucination issues in Large Language Models (LLMs). However, the integration of multiple retrieval sources, while potentially more informative, introduces new challenges that can paradoxically exacerbate hallucination problems. These challenges manifest primarily in two aspects: the sparse distribution of multi-source data that hinders the capture of logical relationships and the inherent inconsistencies among different sources that lead to information conflicts. To address these challenges, we propose MultiRAG, a novel framework designed to mitigate hallucination in multi-source retrieval-augmented generation through knowledge-guided approaches. Our framework introduces two key innovations: (1) a knowledge construction module that employs multi-source line graphs to efficiently aggregate logical relationships across different knowledge sources, effectively addressing the sparse data distribution issue; and (2) a sophisticated retrieval module that implements a multi-level confidence calculation mechanism, performing both graph-level and node-level assessments to identify and eliminate unreliable information nodes, thereby reducing hallucinations caused by inter-source inconsistencies. Extensive experiments on four multi-domain query datasets and two multi-hop QA datasets demonstrate that MultiRAG significantly enhances the reliability and efficiency of knowledge retrieval in complex multi-source scenarios. \textcolor{blue}{Our code is available in https://github.com/wuwenlong123/MultiRAG.

[139] PyLate: Flexible Training and Retrieval for Late Interaction Models cs.IR | cs.CLPDF

Antoine Chaffin, Raphaël Sourty

TL;DR: PyLate 是一个基于 Sentence Transformers 的库，旨在简化多向量（late interaction）模型的训练和检索，解决单向量模型的局限性，并提供高效的工具以促进研究和实际应用。

Details

Motivation: 现代信息检索中，单向量搜索是主导范式，但其将信息压缩为单一向量，导致在域外、长上下文和复杂检索任务中性能下降。多向量模型（如 ColBERT）虽有优势，但缺乏易用的工具阻碍了其普及。

Result: PyLate 已成功开发出 GTE-ModernColBERT 和 Reason-ModernColBERT 等先进模型，展示了其在研究和生产环境中的实用性。

Insight: 多向量模型在复杂检索任务中表现优越，PyLate 的推出可以加速这类模型的采用和研究，推动现代信息检索系统的发展。

Abstract: Neural ranking has become a cornerstone of modern information retrieval. While single vector search remains the dominant paradigm, it suffers from the shortcoming of compressing all the information into a single vector. This compression leads to notable performance degradation in out-of-domain, long-context, and reasoning-intensive retrieval tasks. Multi-vector approaches pioneered by ColBERT aim to address these limitations by preserving individual token embeddings and computing similarity via the MaxSim operator. This architecture has demonstrated superior empirical advantages, including enhanced out-of-domain generalization, long-context handling, and performance in complex retrieval scenarios. Despite these compelling empirical results and clear theoretical advantages, the practical adoption and public availability of late interaction models remain low compared to their single-vector counterparts, primarily due to a lack of accessible and modular tools for training and experimenting with such models. To bridge this gap, we introduce PyLate, a streamlined library built on top of Sentence Transformers to support multi-vector architectures natively, inheriting its efficient training, advanced logging, and automated model card generation while requiring minimal code changes to code templates users are already familiar with. By offering multi-vector-specific features such as efficient indexes, PyLate aims to accelerate research and real-world application of late interaction models, thereby unlocking their full potential in modern IR systems. Finally, PyLate has already enabled the development of state-of-the-art models, including GTE-ModernColBERT and Reason-ModernColBERT, demonstrating its practical utility for both research and production environments.

Table of Contents

cs.CV [Back]

[1] PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation cs.CV | cs.LG | I.2.10; I.4.8; I.5.4PDF

[2] DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework cs.CVPDF

[3] Evaluation and Analysis of Deep Neural Transformers and Convolutional Neural Networks on Modern Remote Sensing Datasets cs.CV | cs.AI | cs.LGPDF

[4] VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction cs.CV | cs.CLPDF

[5] RDDPM: Robust Denoising Diffusion Probabilistic Model for Unsupervised Anomaly Segmentation cs.CV | 68T07 | I.4.9; I.2.10PDF

[6] How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes cs.CV | cs.SD | eess.ASPDF

[7] Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces cs.CV | cs.AI | cs.CL | cs.ROPDF

[8] X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio cs.CVPDF

[9] Towards Robust Image Denoising with Scale Equivariance cs.CVPDF

[10] Diffusion Models with Adaptive Negative Sampling Without External Resources cs.CVPDF

[11] Separating Shared and Domain-Specific LoRAs for Multi-Domain Learning cs.CVPDF

[12] MoExDA: Domain Adaptation for Edge-based Action Recognition cs.CVPDF

[13] Adversarial Attention Perturbations for Large Object Detection Transformers cs.CVPDF

[14] Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models cs.CVPDF

[15] Multi-Granularity Feature Calibration via VFM for Domain Generalized Semantic Segmentation cs.CVPDF

[16] Enhancing Long Video Question Answering with Scene-Localized Frame Grouping cs.CV | cs.AIPDF

[17] SA-3DGS: A Self-Adaptive Compression Method for 3D Gaussian Splatting cs.CVPDF

[18] MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention cs.CVPDF

[19] VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering cs.CV | cs.MMPDF

[20] Multi-human Interactive Talking Dataset cs.CVPDF

[21] Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation cs.CV | cs.AI | I.4.8PDF

[22] CHARM: Collaborative Harmonization across Arbitrary Modalities for Modality-agnostic Semantic Segmentation cs.CVPDF

[23] Exploring Fairness across Fine-Grained Attributes in Large Vision-Language Models cs.CVPDF

[24] Augmenting Continual Learning of Diseases with LLM-Generated Visual Concepts cs.CVPDF

[25] AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video cs.CVPDF

[26] Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning cs.CVPDF

[27] H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction cs.CVPDF

[28] Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery cs.CV | cs.AIPDF

[29] Uint: Building Uint Detection Dataset cs.CVPDF

[30] UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying cs.CVPDF

[31] ChartCap: Mitigating Hallucination of Dense Chart Captioning cs.CV | cs.AI | cs.CLPDF

[32] SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision cs.CVPDF

[33] Advancing Precision in Multi-Point Cloud Fusion Environments cs.CV | cs.GRPDF

[34] Neovascularization Segmentation via a Multilateral Interaction-Enhanced Graph Convolutional Network cs.CVPDF

[35] AlignCAT: Visual-Linguistic Alignment of Category and Attributefor Weakly Supervised Visual Grounding cs.CVPDF

[36] Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration cs.CVPDF

[37] GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations cs.CV | cs.AIPDF

[38] ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow cs.CVPDF

[39] Zero-shot Shape Classification of Nanoparticles in SEM Images using Vision Foundation Models cs.CVPDF

[40] Ultralight Polarity-Split Neuromorphic SNN for Event-Stream Super-Resolution cs.CV | cs.LGPDF

[41] V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models cs.CV | cs.AIPDF

[42] Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation cs.CVPDF

[43] VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation cs.CV | cs.AI | cs.CLPDF

[44] Efficient Multi-Slide Visual-Language Feature Fusion for Placental Disease Classification cs.CVPDF

[45] Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation cs.CVPDF

[46] Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation cs.CVPDF

[47] Beyond Meme Templates: Limitations of Visual Similarity Measures in Meme Matching cs.CV | cs.CLPDF

[48] LRDDv2: Enhanced Long-Range Drone Detection Dataset with Range Information and Comprehensive Real-World Challenges cs.CV | cs.ROPDF

[49] Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation cs.CVPDF

[50] Beyond Illumination: Fine-Grained Detail Preservation in Extreme Dark Image Restoration cs.CVPDF

[51] Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration cs.CVPDF

[52] CIVQLLIE: Causal Intervention with Vector Quantization for Low-Light Image Enhancement cs.CVPDF

[53] FedPromo: Federated Lightweight Proxy Models at the Edge Bring New Domains to Foundation Models cs.CV | cs.LGPDF

[54] GRASPing Anatomy to Improve Pathology Segmentation cs.CVPDF

[55] Neutralizing Token Aggregation via Information Augmentation for Efficient Test-Time Adaptation cs.CVPDF

[56] SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models cs.CV | cs.AI | cs.LGPDF

[57] Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling cs.CV | cs.AIPDF

[58] SlotMatch: Distilling Temporally Consistent Object-Centric Representations for Unsupervised Video Segmentation cs.CV | cs.AIPDF

[59] Learning Latent Representations for Image Translation using Frequency Distributed CycleGAN cs.CV | cs.AI | cs.GRPDF

[60] R2GenKG: Hierarchical Multi-modal Knowledge Graph for LLM-based Radiology Report Generation cs.CV | cs.AI | cs.LGPDF

[61] MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image Analysis cs.CVPDF

[62] RAAG: Ratio Aware Adaptive Guidance cs.CVPDF

[63] CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection cs.CVPDF

[64] Video Demoireing using Focused-Defocused Dual-Camera System cs.CVPDF

[65] AVPDN: Learning Motion-Robust and Scale-Adaptive Representations for Video-Based Polyp Detection cs.CVPDF

[66] IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models cs.CVPDF

[67] VideoGuard: Protecting Video Content from Unauthorized Editing cs.CV | cs.AIPDF

[68] When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models cs.CV | cs.AIPDF

[69] ParticleSAM: Small Particle Segmentation for Material Quality Monitoring in Recycling Processes cs.CVPDF

[70] Prototype-Enhanced Confidence Modeling for Cross-Modal Medical Image-Report Retrieval cs.CVPDF

[71] EditGarment: An Instruction-Based Garment Editing Dataset Constructed with Automated MLLM Synthesis and Semantic-Aware Evaluation cs.CVPDF

[72] Distribution-aware Knowledge Unification and Association for Non-exemplar Lifelong Person Re-identification cs.CVPDF

[73] CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation cs.CVPDF

[74] Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and Detection cs.CVPDF

[75] Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences cs.CVPDF

[76] A Scalable Machine Learning Pipeline for Building Footprint Detection in Historical Maps cs.CV | I.4PDF

[77] RadProPoser: A Framework for Human Pose Estimation with Uncertainty Quantification from Raw Radar Data cs.CVPDF

[78] DyCAF-Net: Dynamic Class-Aware Fusion Network cs.CV | cs.LGPDF