cs.CV [Total: 86]
cs.CL [Total: 43]
astro-ph.EP [Total: 1]
cs.SE [Total: 1]
cs.CY [Total: 1]
cs.LG [Total: 9]
cs.RO [Total: 7]
cs.AI [Total: 5]
cs.GR [Total: 1]
cs.IR [Total: 1]
q-bio.QM [Total: 1]
physics.ins-det [Total: 1]
cs.CG [Total: 1]
eess.IV [Total: 10]
cs.SI [Total: 1]

cs.CV [Back]

[1] A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion cs.CV | eess.IVPDF

Fangzhou Lin, Zilin Dai, Rigved Sanku, Songlin Hou, Kazunori D Yamada

TL;DR: 该论文提出了一个仅依赖点云输入的视图无关基线方法，用于单视角图像引导点云补全任务，通过分层自融合机制显著提升性能，并质疑多模态方法的必要性。

Details

Motivation: 研究旨在探讨单视角图像引导点云补全任务中图像引导的根本必要性，并通过设计一个无需图像输入的强基线方法验证观点。

Result: 在ShapeNet-ViPC数据集上的实验表明，该方法优于现有基于图像引导的SVIPC方法。

Insight: 研究发现图像引导在多模态任务中可能并非必需，提出了一种更高效的纯点云输入解决方案。

Abstract: The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at https://github.com/Zhang-VISLab.

[2] VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service cs.CV | cs.CLPDF

Xiasi Wang, Tianliang Yao, Simin Chen, Runqi Wang, Lei YE

TL;DR: VLMInferSlow提出了一种在ML-as-a-service环境下评估视觉语言模型（VLM）效率鲁棒性的新方法，通过黑盒设置生成对抗样本以提高计算成本。

Details

Motivation: 现有研究主要关注VLM的准确性，而效率鲁棒性被忽视，尤其是在实时应用的背景下。传统方法需要访问模型架构和参数，不适用于实际API部署的ML-as-a-service场景。

Result: 实验显示，VLMInferSlow生成的对抗样本可将计算成本提高高达128.47%。

Insight: VLM的效率鲁棒性是一个重要问题，需在真实黑盒环境中进行评估。对抗样本可能对实际部署的VLM服务产生显著效率影响。

Abstract: Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unrealistic assumptions, requiring access to the model architecture and parameters – an impractical scenario in ML-as-a-service settings, where VLMs are deployed via inference APIs. To address this gap, we propose VLMInferSlow, a novel approach for evaluating VLM efficiency robustness in a realistic black-box setting. VLMInferSlow incorporates fine-grained efficiency modeling tailored to VLM inference and leverages zero-order optimization to search for adversarial examples. Experimental results show that VLMInferSlow generates adversarial images with imperceptible perturbations, increasing the computational cost by up to 128.47%. We hope this research raises the community’s awareness about the efficiency robustness of VLMs.

Ruoyu Wang, Tong Yu, Junda Wu, Yao Liu, Julian McAuley

TL;DR: 论文提出了一种弱监督的部分对比学习（WPCL）方法，用于提升视觉语言导航（VLN）任务中代理在动态视角下识别物体的能力，有效整合预训练的视觉语言模型（VLM）知识，且无需微调。

Details

Motivation: 现有VLN方法依赖预训练模型但难以应对动态视角，且未微调的预训练LLM/VLM性能受限，微调则带来高计算成本。

Result: 在多个基准测试中超越基线方法，验证了方法的有效性、鲁棒性和泛化性。

Insight: 无需微调VLM即可高效利用其知识，为VLN任务提供了一种计算高效的解决方案。

Abstract: Visual Language Navigation (VLN) is a fundamental task within the field of Embodied AI, focusing on the ability of agents to navigate complex environments based on natural language instructions. Despite the progress made by existing methods, these methods often present some common challenges. First, they rely on pre-trained backbone models for visual perception, which struggle with the dynamic viewpoints in VLN scenarios. Second, the performance is limited when using pre-trained LLMs or VLMs without fine-tuning, due to the absence of VLN domain knowledge. Third, while fine-tuning LLMs and VLMs can improve results, their computational costs are higher than those without fine-tuning. To address these limitations, we propose Weakly-supervised Partial Contrastive Learning (WPCL), a method that enhances an agent’s ability to identify objects from dynamic viewpoints in VLN scenarios by effectively integrating pre-trained VLM knowledge into the perception process, without requiring VLM fine-tuning. Our method enhances the agent’s ability to interpret and respond to environmental cues while ensuring computational efficiency. Experimental results have shown that our method outperforms the baseline methods on multiple benchmarks, which validate the effectiveness, robustness and generalizability of our method.

[4] ADAM-Dehaze: Adaptive Density-Aware Multi-Stage Dehazing for Improved Object Detection in Foggy Conditions cs.CVPDF

Fatmah AlHindaassi, Mohammed Talha Alam, Fakhri Karray

TL;DR: ADAM-Dehaze是一个适应性密度感知的多阶段去雾框架，通过动态路由和自适应损失优化图像恢复和目标检测，显著提升了去雾效果和检测性能。

Details

Motivation: 雾霾等恶劣天气条件严重影响自动驾驶和监控系统，现有方法难以适应不同雾霾密度。

Result: 在Cityscapes和RTTS数据集上，PSNR提升2.1dB，FADE降低30%，目标检测mAP提升13点，推理时间减少20%。

Insight: 密度感知和任务联合优化是实现高效去雾和下游任务性能提升的关键。

Abstract: Adverse weather conditions, particularly fog, pose a significant challenge to autonomous vehicles, surveillance systems, and other safety-critical applications by severely degrading visual information. We introduce ADAM-Dehaze, an adaptive, density-aware dehazing framework that jointly optimizes image restoration and object detection under varying fog intensities. A lightweight Haze Density Estimation Network (HDEN) classifies each input as light, medium, or heavy fog. Based on this score, the system dynamically routes the image through one of three CORUN branches: Light, Medium, or Complex, each tailored to its haze regime. A novel adaptive loss balances physical-model coherence and perceptual fidelity, ensuring both accurate defogging and preservation of fine details. On Cityscapes and the real-world RTTS benchmark, ADAM-Dehaze improves PSNR by up to 2.1 dB, reduces FADE by 30 percent, and increases object detection mAP by up to 13 points, while cutting inference time by 20 percent. These results highlight the importance of intensity-specific processing and seamless integration with downstream vision tasks. Code available at: https://github.com/talha-alam/ADAM-Dehaze.

[5] EchoShot: Multi-Shot Portrait Video Generation cs.CVPDF

Jiahao Wang, Hualian Sheng, Sijia Cai, Weizhan Zhang, Caixia Yan

TL;DR: EchoShot提出了一种基于视频扩散模型的多镜头肖像视频生成框架，通过创新的镜头感知位置嵌入机制和高质量数据集PortraitGala，实现了身份一致性与内容可控性。

Details

Motivation: 现有视频扩散模型仅支持单镜头生成，但实际应用需要多镜头且身份一致的视频生成能力。

Result: 实验显示EchoShot在多镜头视频生成中实现了身份一致性和属性级可控性。

Insight: EchoShot的设计为通用多镜头视频建模提供了基础范例。

Abstract: Video diffusion models substantially boost the productivity of artistic workflows with high-quality portrait video generative capacity. However, prevailing pipelines are primarily constrained to single-shot creation, while real-world applications urge for multiple shots with identity consistency and flexible content controllability. In this work, we propose EchoShot, a native and scalable multi-shot framework for portrait customization built upon a foundation video diffusion model. To start with, we propose shot-aware position embedding mechanisms within video diffusion transformer architecture to model inter-shot variations and establish intricate correspondence between multi-shot visual content and their textual descriptions. This simple yet effective design enables direct training on multi-shot video data without introducing additional computational overhead. To facilitate model training within multi-shot scenario, we construct PortraitGala, a large-scale and high-fidelity human-centric video dataset featuring cross-shot identity consistency and fine-grained captions such as facial attributes, outfits, and dynamic motions. To further enhance applicability, we extend EchoShot to perform reference image-based personalized multi-shot generation and long video synthesis with infinite shot counts. Extensive evaluations demonstrate that EchoShot achieves superior identity consistency as well as attribute-level controllability in multi-shot portrait video generation. Notably, the proposed framework demonstrates potential as a foundational paradigm for general multi-shot video modeling.

[6] Privacy-Preserving in Connected and Autonomous Vehicles Through Vision to Text Transformation cs.CV | cs.LGPDF

Abdolazim Rezaei, Mehdi Sookhak, Ahmad Patooghy

TL;DR: 本文提出了一种新的隐私保护框架，通过将图像转换为语义等效的文本描述，保护CAV中AI相机捕获的敏感视觉信息，结合反馈强化学习和视觉语言模型，显著提升了隐私保护和文本质量。

Details

Motivation: CAV中的路边设备（如AI相机）处理隐私敏感数据，传统技术（如模糊处理）仍存在隐私泄露风险，如通过衣物特征追踪个人。

Result: 相比现有方法，隐私保护和文本质量显著提升，独特词数增加约77%，细节密度提升约50%。

Insight: 视觉到文本的转换是一种有效的隐私保护方法，通过强化学习优化可以平衡语义准确性和隐私保障。

Abstract: Connected and Autonomous Vehicles (CAVs) rely on a range of devices that often process privacy-sensitive data. Among these, roadside units play a critical role particularly through the use of AI-equipped (AIE) cameras for applications such as violation detection. However, the privacy risks associated with captured imagery remain a major concern, as such data can be misused for identity theft, profiling, or unauthorized commercial purposes. While traditional techniques such as face blurring and obfuscation have been applied to mitigate privacy risks, individual privacy remains at risk, as individuals can still be tracked using other features such as their clothing. This paper introduces a novel privacy-preserving framework that leverages feedback-based reinforcement learning (RL) and vision-language models (VLMs) to protect sensitive visual information captured by AIE cameras. The main idea is to convert images into semantically equivalent textual descriptions, ensuring that scene-relevant information is retained while visual privacy is preserved. A hierarchical RL strategy is employed to iteratively refine the generated text, enhancing both semantic accuracy and privacy. Evaluation results demonstrate significant improvements in both privacy protection and textual quality, with the Unique Word Count increasing by approximately 77% and Detail Density by around 50% compared to existing approaches.

[7] Visual symbolic mechanisms: Emergent symbol processing in vision language models cs.CVPDF

Rim Assouel, Declan Campbell, Taylor Webb

TL;DR: 该论文探讨了视觉语言模型（VLMs）中通过内容独立的空间索引方案实现的符号机制，揭示了这些机制在解决绑定问题中的作用及其失败原因。

Details

Motivation: 视觉语言模型在需要特征绑定的任务中表现不佳，研究旨在探索VLMs是否像语言模型那样使用符号化机制来解决绑定问题。

Result: 研究发现VLMs中存在符号化的索引机制，且绑定错误与这些机制的失败直接相关。

Insight: 论文为理解VLMs中的符号处理机制提供了新视角，并提出了改进绑定任务的潜在方向。

Abstract: To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this ‘binding problem’ via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by vision language models (VLMs). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a set of emergent symbolic mechanisms that support binding in VLMs via a content-independent, spatial indexing scheme. Moreover, we find that binding errors can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for addressing the persistent binding failures exhibited by these models.

[8] MoiréXNet: Adaptive Multi-Scale Demoiréing with Linear Attention Test-Time Training and Truncated Flow Matching Prior cs.CV | cs.AI | eess.IVPDF

Liangyan Li, Yimo Ning, Kevin Le, Wei Dong, Yunzhe Li

TL;DR: 论文提出了一种结合MAP估计与深度学习的自适应多尺度去摩尔纹框架，通过线性注意力测试时训练和截断流匹配先验，有效解决了非线性退化问题。

Details

Motivation: 现有方法在去摩尔纹任务中难以完全去除摩尔纹或导致结果过于平滑，主要受限于模型容量和训练数据不足。生成模型虽在线性退化恢复中表现优异，但在非线性任务中效果不佳且易引入伪影。

Result: 框架在去摩尔纹任务中表现出色，能够高效恢复高频细节并抑制伪影。

Insight: 结合监督学习与生成模型优点的混合方法，在非线性图像退化任务中具有显著优势，为类似问题提供了新思路。

Abstract: This paper introduces a novel framework for image and video demoir'eing by integrating Maximum A Posteriori (MAP) estimation with advanced deep learning techniques. Demoir'eing addresses inherently nonlinear degradation processes, which pose significant challenges for existing methods. Traditional supervised learning approaches either fail to remove moir'e patterns completely or produce overly smooth results. This stems from constrained model capacity and scarce training data, which inadequately represent the clean image distribution and hinder accurate reconstruction of ground-truth images. While generative models excel in image restoration for linear degradations, they struggle with nonlinear cases such as demoir'eing and often introduce artifacts. To address these limitations, we propose a hybrid MAP-based framework that integrates two complementary components. The first is a supervised learning model enhanced with efficient linear attention Test-Time Training (TTT) modules, which directly learn nonlinear mappings for RAW-to-sRGB demoir'eing. The second is a Truncated Flow Matching Prior (TFMP) that further refines the outputs by aligning them with the clean image distribution, effectively restoring high-frequency details and suppressing artifacts. These two components combine the computational efficiency of linear attention with the refinement abilities of generative models, resulting in improved restoration performance.

[9] Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization cs.CV | cs.AI | cs.MMPDF

Yosub Shin, Igor Molybog

TL;DR: 本文提出了VideoSync框架，旨在解决视频同步问题，摆脱了对音频或特定视觉特征的依赖，提高了通用性。同时，构建了新数据集并纠正了现有方法的偏差，展示了VideoSync的优势。

Details

Motivation: 现有的视频同步方法过度依赖音频或特定视觉特征（如人体姿态），限制了在无此类信号场景中的适用性，且缺乏通用性和可复现性基准。

Result: 在公平实验条件下，VideoSync表现优于现有方法（如SeSyn-Net），且适用于人类、多人及非人类场景。

Insight: 研究揭示了依赖特定特征的同步方法可能存在偏差，提出通用框架能显著提升鲁棒性和适用性。

Abstract: Video synchronization-aligning multiple video streams capturing the same event from different angles-is crucial for applications such as reality TV show production, sports analysis, surveillance, and autonomous systems. Prior work has heavily relied on audio cues or specific visual events, limiting applicability in diverse settings where such signals may be unreliable or absent. Additionally, existing benchmarks for video synchronization lack generality and reproducibility, restricting progress in the field. In this work, we introduce VideoSync, a video synchronization framework that operates independently of specific feature extraction methods, such as human pose estimation, enabling broader applicability across different content types. We evaluate our system on newly composed datasets covering single-human, multi-human, and non-human scenarios, providing both the methodology and code for dataset creation to establish reproducible benchmarks. Our analysis reveals biases in prior SOTA work, particularly in SeSyn-Net’s preprocessing pipeline, leading to inflated performance claims. We correct these biases and propose a more rigorous evaluation framework, demonstrating that VideoSync outperforms existing approaches, including SeSyn-Net, under fair experimental conditions. Additionally, we explore various synchronization offset prediction methods, identifying a convolutional neural network (CNN)-based model as the most effective. Our findings advance video synchronization beyond domain-specific constraints, making it more generalizable and robust for real-world applications.

[10] Polyline Path Masked Attention for Vision Transformer cs.CVPDF

Zhongchen Zhao, Chaodong Xiao, Hui Lin, Qi Xie, Lei Zhang

TL;DR: 本文提出了一种新型的注意力机制Polyline Path Masked Attention (PPMA)，结合了ViTs的自注意力机制和Mamba2的结构化掩码，通过2D折线路径扫描策略优化了空间邻接关系建模，在图像分类、目标检测和分割任务中表现优于现有方法。

Details

Motivation: 全局依赖建模和空间位置建模是深度学习架构设计的核心问题。ViTs在计算机视觉中表现出色，但缺乏显式的空间邻接建模能力；Mamba2通过结构化掩码在自然语言处理中展现了潜力。本文旨在结合两者的优势，优化视觉任务的性能。

Result: 在ADE20K语义分割任务中，PPMA-T/S/B模型的mIoU分别达到48.7%/51.1%/52.3%，优于RMT-T/S/B。

Insight: 通过显式建模空间邻接关系，PPMA在视觉任务中显著提升了性能，展示了结合自注意力与结构化掩码的潜力。

Abstract: Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.

[11] LBMamba: Locally Bi-directional Mamba cs.CVPDF

Jingwei Zhang, Xi Han, Hong Qin, Mahdi S. Hosseini, Dimitris Samaras

TL;DR: LBMamba通过本地双向扫描优化Mamba模型的全局双向扫描，减少计算负载，提升效率，在多个视觉任务中表现优异。

Details

Motivation: Mamba因其单向性限制了上下文信息的完整性，现有方法通过全局双向扫描提升性能但导致计算负担加倍。LBMamba旨在通过本地双向扫描解决这一问题。

Result: 在ImageNet等数据集上，LBVim在相同计算量下性能显著提升；病理图像分类任务中，集成LBMamba后AUC等指标提升。

Insight: 本地双向扫描是平衡Mamba效率与性能的有效策略，适用于多个视觉任务。

Abstract: Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel selective scan, has recently emerged as a linearly-scaling, efficient alternative to self-attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba-based computer-vision methods typically overcome this limitation by augmenting Mamba’s global forward scan with a global backward scan, forming a bi-directional scan that restores a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward selective scan and executes it entirely in per-thread registers. Building on LBMamba, we present LBVim, a scalable vision backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate the versatility of our approach on both natural images and whole slide images (WSIs). We show that our LBVim constantly offers a superior performance-throughput trade-off. That is under the same throughput, LBVim achieves 0.8% to 1.6% higher top-1 accuracy on the ImageNet-1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher APb and 1.1% higher APm on the COCO detection dataset. We also integrate LBMamba into the SOTA pathology multiple instance learning (MIL) approach, MambaMIL, which uses single directional scan. Experiments on 3 public WSI classification datasets for show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy.

[12] Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization cs.CV | cs.AIPDF

Cong Wang, Zexuan Deng, Zhiwei Jiang, Fei Shen, Yafeng Yin

TL;DR: SignViP是一种新的手语视频生成框架，通过多条件离散化表示提高生成质量。

Details

Motivation: 现有方法依赖单一粗略条件（如骨骼序列）作为中介，限制了生成视频的自然性和表现力。

Result: 实验显示SignViP在视频质量、时间连贯性和语义保真度上优于现有方法。

Insight: 离散化多条件表示是提升生成质量的关键，且FSQ Autoencoder在条件压缩中表现优异。

Abstract: Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/umnooob/signvip/.

[13] PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models cs.CV | cs.GRPDF

Tianchen Zhao, Ke Hong, Xinhao Yang, Xuefeng Xiao, Huixia Li

TL;DR: PAROAttention是一种针对视觉生成模型中稀疏和量化注意力机制的高效重排序技术，通过将分散的注意力模式统一为硬件友好的块状模式，显著提升了稀疏化和量化的效率。

Details

Motivation: 视觉生成中注意力机制的二次复杂度导致高内存和计算成本，尤其在处理高分辨率图像或多帧视频时。稀疏化和量化虽然被探索，但在低密度和低比特位宽下面临挑战。

Result: 在视频和图像生成中，PAROAttention实现了与全精度基准几乎相同的性能，同时运行密度低至20%-30%，比特位宽为INT8/INT4，端到端延迟加速1.9x至2.7x。

Insight: 注意力模式的重排序比直接适配稀疏化和量化更有效，局部聚合特性为硬件友好模式设计提供了灵感。

Abstract: In visual generation, the quadratic complexity of attention mechanisms results in high memory and computational costs, especially for longer token sequences required in high-resolution image or multi-frame video generation. To address this, prior research has explored techniques such as sparsification and quantization. However, these techniques face significant challenges under low density and reduced bitwidths. Through systematic analysis, we identify that the core difficulty stems from the dispersed and irregular characteristics of visual attention patterns. Therefore, instead of introducing specialized sparsification and quantization design to accommodate such patterns, we propose an alternative strategy: reorganizing the attention pattern to alleviate the challenges. Inspired by the local aggregation nature of visual feature extraction, we design a novel Pattern-Aware token ReOrdering (PARO) technique, which unifies the diverse attention patterns into a hardware-friendly block-wise pattern. This unification substantially simplifies and enhances both sparsification and quantization. We evaluate the performance-efficiency trade-offs of various design choices and finalize a methodology tailored for the unified pattern. Our approach, PAROAttention, achieves video and image generation with lossless metrics, and nearly identical results from full-precision (FP) baselines, while operating at notably lower density (~20%-30%) and bitwidth (INT8/INT4), achieving a 1.9x to 2.7x end-to-end latency speedup.

[14] Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation cs.CVPDF

Yong Liu, SongLi Wu, Sule Bai, Jiahao Wang, Yitong Wang

TL;DR: 该论文提出了一个新的基准测试OpenBench，用于更准确地评估开放词汇分割模型的性能，同时提出了一种名为OVSNet的方法，通过异构特征融合和训练空间扩展实现了最佳性能。

Details

Motivation: 现有开放词汇分割基准测试的语义空间与训练集相似，无法有效评估模型对真实世界多样化概念的理解能力。

Result: OVSNet在现有数据集和OpenBench上均取得SOTA结果。

Insight: 开放词汇分割模型的性能评估需要更广泛和差异化的测试环境。

Abstract: Open-vocabulary segmentation aims to achieve segmentation of arbitrary categories given unlimited text inputs as guidance. To achieve this, recent works have focused on developing various technical routes to exploit the potential of large-scale pre-trained vision-language models and have made significant progress on existing benchmarks. However, we find that existing test sets are limited in measuring the models’ comprehension of ``open-vocabulary” concepts, as their semantic space closely resembles the training space, even with many overlapping categories. To this end, we present a new benchmark named OpenBench that differs significantly from the training semantics. It is designed to better assess the model’s ability to understand and segment a wide range of real-world concepts. When testing existing methods on OpenBench, we find that their performance diverges from the conclusions drawn on existing test sets. In addition, we propose a method named OVSNet to improve the segmentation performance for diverse and open scenarios. Through elaborate fusion of heterogeneous features and cost-free expansion of the training space, OVSNet achieves state-of-the-art results on both existing datasets and our proposed OpenBench. Corresponding analysis demonstrate the soundness and effectiveness of our proposed benchmark and method.

[15] STAR-Pose: Efficient Low-Resolution Video Human Pose Estimation via Spatial-Temporal Adaptive Super-Resolution cs.CVPDF

Yucheng Jin, Jinyan Chen, Ziyue He, Baojun Han, Furan An

TL;DR: STAR-Pose是一个高效的低分辨率视频人体姿态估计框架，通过空间-时间自适应超分辨率技术，解决了现有方法在资源受限环境中部署的局限性。

Details

Motivation: 低分辨率视频中的人体姿态估计是一个重要但具有挑战性的任务，现有方法通常假设高质量输入或采用计算密集的级联处理，限制了其在实际应用中的适用性。

Result: 在多个主流视频人体姿态估计数据集上，STAR-Pose在极低分辨率（64x48）条件下提升了5.2%的mAP，且推理速度比级联方法快2.8x至4.4x。

Insight: 任务导向的超分辨率设计（如姿态感知损失）能显著提升低分辨率视频中姿态估计的性能，同时也为其他低分辨率视觉任务提供了借鉴。

Abstract: Human pose estimation in low-resolution videos presents a fundamental challenge in computer vision. Conventional methods either assume high-quality inputs or employ computationally expensive cascaded processing, which limits their deployment in resource-constrained environments. We propose STAR-Pose, a spatial-temporal adaptive super-resolution framework specifically designed for video-based human pose estimation. Our method features a novel spatial-temporal Transformer with LeakyReLU-modified linear attention, which efficiently captures long-range temporal dependencies. Moreover, it is complemented by an adaptive fusion module that integrates parallel CNN branch for local texture enhancement. We also design a pose-aware compound loss to achieve task-oriented super-resolution. This loss guides the network to reconstruct structural features that are most beneficial for keypoint localization, rather than optimizing purely for visual quality. Extensive experiments on several mainstream video HPE datasets demonstrate that STAR-Pose outperforms existing approaches. It achieves up to 5.2% mAP improvement under extremely low-resolution (64x48) conditions while delivering 2.8x to 4.4x faster inference than cascaded approaches.

[16] PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning cs.CVPDF

Yizhe Li, Sanping Zhou, Zheng Qin, Le Wang

TL;DR: PR-DETR是一个新的密集视频描述框架，通过在检测变换器中注入显式的位置和关系先验，提升了事件定位精度和描述质量。

Details

Motivation: 当前基于变换器的密集视频描述方法隐式学习事件位置和语义，需要大量训练数据且性能受限。PR-DETR通过显式引入位置和关系先验，提升模型效率和性能。

Result: 实验证明位置和关系先验的有效性，PR-DETR在ActivityNet Captions和YouCook2数据集上表现优异。

Insight: 显式引入先验知识可以显著提升密集视频描述任务的性能，尤其在数据有限的情况下。

Abstract: Dense video captioning is a challenging task that aims to localize and caption multiple events in an untrimmed video. Recent studies mainly follow the transformer-based architecture to jointly perform the two sub-tasks, i.e., event localization and caption generation, in an end-to-end manner. Based on the general philosophy of detection transformer, these methods implicitly learn the event locations and event semantics, which requires a large amount of training data and limits the model’s performance in practice. In this paper, we propose a novel dense video captioning framework, named PR-DETR, which injects the explicit position and relation prior into the detection transformer to improve the localization accuracy and caption quality, simultaneously. On the one hand, we first generate a set of position-anchored queries to provide the scene-specific position and semantic information about potential events as position prior, which serves as the initial event search regions to eliminate the implausible event proposals. On the other hand, we further design an event relation encoder to explicitly calculate the relationship between event boundaries as relation prior to guide the event interaction to improve the semantic coherence of the captions. Extensive ablation studies are conducted to verify the effectiveness of the position and relation prior. Experimental results also show the competitive performance of our method on ActivityNet Captions and YouCook2 datasets.

[17] AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models cs.CVPDF

Yuan Zhang, Chun-Kai Fan, Tao Huang, Ming Lu, Sicheng Yu

TL;DR: AutoV是一种自动选择最优视觉提示的方法，用于增强大型视觉语言模型的推理能力，通过排名候选提示并训练模型选择最优解，提升了多个任务的性能。

Details

Motivation: 传统方法手动设计视觉提示费时费力且效果有限，AutoV旨在自动选择最优提示，以提升模型的推理能力。

Result: 实验表明AutoV显著提升了多个LVLM的性能，如在LLaVA-OV和Qwen2.5-VL上的准确率分别提升了1.7%和1.9%。

Insight: 自动学习最优视觉提示的方法优于传统手动设计，能够显著提升模型性能，为视觉提示的研究提供了新方向。

Abstract: Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we developed an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experimental results indicate that AutoV enhances the performance of various LVLMs across multiple popular image understanding tasks. For instance, LLaVA-OV with AutoV achieves $\textbf{1.7}%$ accuracy gain on LLaVA$^{\text{Wild}}$, and AutoV boosts Qwen2.5-VL by $\textbf{1.9}%$ on MMMU, highlighting its potential as an optimal visual prompting method for LVLMs.

[18] FastInit: Fast Noise Initialization for Temporally Consistent Video Generation cs.CVPDF

Chengyu Bai, Yuming Li, Zhongyu Zhao, Jintao Chen, Peidong Jia

TL;DR: FastInit是一种快速噪声初始化方法，通过单次前向传播生成优化的初始噪声，显著提高了视频生成的效率和时间一致性，无需迭代优化。

Details

Motivation: 现有视频生成方法（如FreeInit）通过迭代优化初始噪声来提高时间一致性，但计算成本高。FastInit旨在解决这一问题。

Result: 实验表明，FastInit显著提升了生成视频的质量和时间一致性，可广泛应用于多种文本到视频模型。

Insight: 单次前向传播即可生成优化噪声，为视频生成领域提供了一种高效且实用的解决方案。

Abstract: Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video generation. In this paper, we introduce FastInit, a fast noise initialization method that eliminates the need for iterative refinement. FastInit learns a Video Noise Prediction Network (VNPNet) that takes random noise and a text prompt as input, generating refined noise in a single forward pass. Therefore, FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames. To train the VNPNet, we create a large-scale dataset consisting of pairs of text prompts, random noise, and refined noise. Extensive experiments with various text-to-video models show that our method consistently improves the quality and temporal consistency of the generated videos. FastInit not only provides a substantial improvement in video generation but also offers a practical solution that can be applied directly during inference. The code and dataset will be released.

[19] Neurosymbolic Object-Centric Learning with Distant Supervision cs.CVPDF

Stefano Colamonaco, David Debot, Giuseppe Marra

TL;DR: 提出了一种神经符号化方法DeepObjectLog，直接从原始感知数据中学习对象中心表示，仅需远距离监督，通过结合感知模块与符号推理层，提升泛化能力。

Details

Motivation: 现有方法依赖对象级监督或预定义分解，限制了从原始数据中学习对象中心表示的灵活性。

Result: 在未见过的对象组合、任务和数量上，优于神经和神经符号化基线方法。

Insight: 符号推理层可有效补充感知模块的不足，提升模型的泛化能力。

Abstract: Relational learning enables models to generalize across structured domains by reasoning over objects and their interactions. While recent advances in neurosymbolic reasoning and object-centric learning bring us closer to this goal, existing systems rely either on object-level supervision or on a predefined decomposition of the input into objects. In this work, we propose a neurosymbolic formulation for learning object-centric representations directly from raw unstructured perceptual data and using only distant supervision. We instantiate this approach in DeepObjectLog, a neurosymbolic model that integrates a perceptual module, which extracts relevant object representations, with a symbolic reasoning layer based on probabilistic logic programming. By enabling sound probabilistic logical inference, the symbolic component introduces a novel learning signal that further guides the discovery of meaningful objects in the input. We evaluate our model across a diverse range of generalization settings, including unseen object compositions, unseen tasks, and unseen number of objects. Experimental results show that our method outperforms neural and neurosymbolic baselines across the tested settings.

[20] GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning cs.CV | cs.AI | cs.CL | cs.LGPDF

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng

TL;DR: 该论文提出了GRPO-CARE，一种针对多模态大语言模型（MLLMs）的强化学习框架，通过一致性感知的奖励机制提高推理的逻辑连贯性和答案准确性。

Details

Motivation: 现有强化学习方法（如GRPO）在多模态环境中表现不佳，存在推理逻辑与答案不一致的问题，缺乏对复杂视频数据的适应性和评估标准。

Result: 在SEED-Bench-R1上，GRPO-CARE比标准GRPO性能提升6.7%，一致性提高24.5%，并展示出强大的跨基准迁移能力。

Insight: 一致性奖励机制能有效解决强化学习中仅优化最终答案导致的逻辑不一致问题，为多模态模型的训练提供了新思路。

Abstract: Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model’s reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.

[21] MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models cs.CVPDF

Xingbai Chen, Tingchao Fu, Renyang Liu, Wei Zhou, Chao Yi

TL;DR: 该论文提出了一种针对Referring Expression Segmentation (RES)模型的多模态双向攻击方法，通过在图像和文本模态上进行联合优化，提高了对抗样本的跨文本可迁移性。

Details

Motivation: RES模型虽然在自然语言描述的图像分割任务中表现优异，但其对抗鲁棒性研究较少。现有对抗攻击方法在多模态结构上表现不佳，且难以适应多样的文本输入。

Result: 在多个RES模型和基准数据集上的实验表明，该方法在对抗攻击效果上优于现有方法。

Insight: 多模态模型的对抗鲁棒性需要同时考虑图像和文本模态的联合优化，通过双向攻击可以更好地暴露其脆弱性。

Abstract: Referring Expression Segmentation (RES) enables precise object segmentation in images based on natural language descriptions, offering high flexibility and broad applicability in real-world vision tasks. Despite its impressive performance, the robustness of RES models against adversarial examples remains largely unexplored. While prior adversarial attack methods have explored adversarial robustness on conventional segmentation models, they perform poorly when directly applied to RES, failing to expose vulnerabilities in its multimodal structure. Moreover, in practical open-world scenarios, users typically issue multiple, diverse referring expressions to interact with the same image, highlighting the need for adversarial examples that generalize across varied textual inputs. To address these multimodal challenges, we propose a novel adversarial attack strategy termed \textbf{Multimodal Bidirectional Attack}, tailored for RES models. Our method introduces learnable proxy textual embedding perturbation and jointly performs visual-aligned optimization on the image modality and textual-adversarial optimization on the textual modality during attack generation. This dual optimization framework encourages adversarial images to actively adapt to more challenging text embedding during optimization, thereby enhancing their cross-text transferability, which refers to the ability of adversarial examples to remain effective under a variety of unseen or semantically diverse textual inputs. Extensive experiments conducted on multiple RES models and benchmark datasets demonstrate the superior effectiveness of our method compared to existing methods.

[22] Co-Speech Gesture and Facial Expression Generation for Non-Photorealistic 3D Characters cs.CV | I.2.10PDF

Taisei Omine, Naoyuki Kawabata, Fuminori Homma

TL;DR: 这篇论文提出了一种为非真实感3D角色生成协同语音手势和面部表情的方法，重点是通过漫画中的表情数据和对话特定的语义手势来实现夸张的情感表达。用户研究表明，其效果显著优于现有研究。

Details

Motivation: 随着对话式AI的发展，肢体表达（如手势和面部表情）的研究也在推进。然而，现有研究多集中于真实感虚拟角色，不适用于动漫等非真实感角色。本文旨在填补这一空白。

Result: 用户研究表明，该方法在多个方面显著优于现有研究，特别是在非真实感角色的情感表达上表现突出。

Insight: 非真实感角色的情感表达需要独特的夸张手法，漫画数据为此提供了丰富的资源。对话语义与手势的结合能够增强表达的准确性和个性化。

Abstract: With the advancement of conversational AI, research on bodily expressions, including gestures and facial expressions, has also progressed. However, many existing studies focus on photorealistic avatars, making them unsuitable for non-photorealistic characters, such as those found in anime. This study proposes methods for expressing emotions, including exaggerated expressions unique to non-photorealistic characters, by utilizing expression data extracted from comics and dialogue-specific semantic gestures. A user study demonstrated significant improvements across multiple aspects when compared to existing research.

[23] Align the GAP: Prior-based Unified Multi-Task Remote Physiological Measurement Framework For Domain Generalization and Personalization cs.CVPDF

Jiyao Wang, Xiao Yang, Hao Lu, Dengbo He, Kaishun Wu

TL;DR: 该论文提出了一种基于先验的统一多任务远程生理测量框架（GAP），旨在解决多源域泛化（MSSDG）和测试时间个性化适应（TTPA）之间的差距问题，并在多个数据集上验证了其有效性。

Details

Motivation: 多源域泛化和个性化适应之间存在显著差距，且现有方法难以融合。此外，部分标签和环境噪声会影响任务特定精度。因此，需要一种统一框架来解决这些问题。

Result: 在六个公共数据集和新引入的驾驶数据集上进行了广泛实验，验证了框架的有效性。

Insight: 通过分解信息和引入先验，GAP框架能够统一处理泛化和个性化问题，为远程生理测量提供了一种高效且灵活的解决方案。

Abstract: Multi-source synsemantic domain generalization (MSSDG) for multi-task remote physiological measurement seeks to enhance the generalizability of these metrics and attracts increasing attention. However, challenges like partial labeling and environmental noise may disrupt task-specific accuracy. Meanwhile, given that real-time adaptation is necessary for personalized products, the test-time personalized adaptation (TTPA) after MSSDG is also worth exploring, while the gap between previous generalization and personalization methods is significant and hard to fuse. Thus, we proposed a unified framework for MSSD\textbf{G} and TTP\textbf{A} employing \textbf{P}riors (\textbf{GAP}) in biometrics and remote photoplethysmography (rPPG). We first disentangled information from face videos into invariant semantics, individual bias, and noise. Then, multiple modules incorporating priors and our observations were applied in different stages and for different facial information. Then, based on the different principles of achieving generalization and personalization, our framework could simultaneously address MSSDG and TTPA under multi-task remote physiological estimation with minimal adjustments. We expanded the MSSDG benchmark to the TTPA protocol on six publicly available datasets and introduced a new real-world driving dataset with complete labeling. Extensive experiments that validated our approach, and the codes along with the new dataset will be released.

[24] Integrating Generative Adversarial Networks and Convolutional Neural Networks for Enhanced Traffic Accidents Detection and Analysis cs.CVPDF

Zhenghao Xi, Xiang Liu, Yaqi Liu, Yitong Cai, Yangyu Zheng

TL;DR: 该论文通过结合生成对抗网络（GAN）和卷积神经网络（CNN）解决了交通事故检测系统中的数据稀缺和监控问题，提出了一种高精度的实时事故检测框架，取得了最高95%的准确率。

Details

Motivation: 全球交通事故数量的上升促使需要一种智能、高效且自动化的交通事故检测方法，以提升交通安全性并挽救生命。

Result: FTCNN和VIT模型分别取得了94%和95%的准确率，显著优于传统CNN模型的88%。

Insight: 该框架为智能监控系统提供了基础，适用于实时交通监控和智慧城市应急管理系统的集成。

Abstract: Accident detection using Closed Circuit Television (CCTV) footage is one of the most imperative features for enhancing transport safety and efficient traffic control. To this end, this research addresses the issues of supervised monitoring and data deficiency in accident detection systems by adapting excellent deep learning technologies. The motivation arises from rising statistics in the number of car accidents worldwide; this calls for innovation and the establishment of a smart, efficient and automated way of identifying accidents and calling for help to save lives. Addressing the problem of the scarcity of data, the presented framework joins Generative Adversarial Networks (GANs) for synthesizing data and Convolutional Neural Networks (CNN) for model training. Video frames for accidents and non-accidents are collected from YouTube videos, and we perform resizing, image enhancement and image normalisation pixel range adjustments. Three models are used: CNN, Fine-tuned Convolutional Neural Network (FTCNN) and Vision Transformer (VIT) worked best for detecting accidents from CCTV, obtaining an accuracy rate of 94% and 95%, while the CNN model obtained 88%. Such results show that the proposed framework suits traffic safety applications due to its high real-time accident detection capabilities and broad-scale applicability. This work lays the foundation for intelligent surveillance systems in the future for real-time traffic monitoring, smart city framework, and integration of intelligent surveillance systems into emergency management systems.

[25] VideoGAN-based Trajectory Proposal for Automated Vehicles cs.CV | cs.LGPDF

Annajoyce Mariani, Kira Maag, Hanno Gottschalk

TL;DR: 该论文提出了一种基于视频GAN的轨迹生成方法，用于自动驾驶车辆的轨迹规划，通过鸟瞰视角的视频训练GAN模型，提取抽象轨迹数据，并在短时间内实现高效训练和推理。

Details

Motivation: 传统的轨迹生成方法难以有效捕捉未来轨迹的复杂多模态分布，因此作者探索使用GAN模型从鸟瞰视角视频中生成准确的轨迹数据。

Result: 在100 GPU小时内完成训练，推理时间低于20毫秒，生成的轨迹在空间和动态参数上与真实数据分布对齐。

Insight: GAN模型在轨迹生成任务中具有高效性，能够捕捉复杂多模态分布，而鸟瞰视角视频为轨迹生成提供了丰富的信息源。

Abstract: Being able to generate realistic trajectory options is at the core of increasing the degree of automation of road vehicles. While model-driven, rule-based, and classical learning-based methods are widely used to tackle these tasks at present, they can struggle to effectively capture the complex, multimodal distributions of future trajectories. In this paper we investigate whether a generative adversarial network (GAN) trained on videos of bird’s-eye view (BEV) traffic scenarios can generate statistically accurate trajectories that correctly capture spatial relationships between the agents. To this end, we propose a pipeline that uses low-resolution BEV occupancy grid videos as training data for a video generative model. From the generated videos of traffic scenarios we extract abstract trajectory data using single-frame object detection and frame-to-frame object matching. We particularly choose a GAN architecture for the fast training and inference times with respect to diffusion models. We obtain our best results within 100 GPU hours of training, with inference times under 20,ms. We demonstrate the physical realism of the proposed trajectories in terms of distribution alignment of spatial and dynamic parameters with respect to the ground truth videos from the Waymo Open Motion Dataset.

[26] FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models cs.CVPDF

Xinting Liao, Weiming Liu, Jiaming Qian, Pengyang Zhou, Jiahe Xu

TL;DR: 该论文提出了FOCoOp框架，通过ID全局提示、局部提示和OOD提示来增强联邦提示学习（FPL）在分布外（OOD）场景下的鲁棒性，解决了现有方法在性能和鲁棒性之间的权衡问题。

Details

Motivation: 现有的联邦提示学习方法在分布外（OOD）场景下表现不佳，且由于客户端数据的异质性，难以平衡性能与鲁棒性。FOCoOp旨在解决这一问题。

Result: 在真实数据集上的实验表明，FOCoOp能有效捕捉分布异质性并显著提升对OOD变化的鲁棒性。

Insight: 通过全局与局部提示的结合以及OOD提示的校准，联邦学习可以在保护隐私的同时提升模型在分布外场景下的泛化能力。

Abstract: Federated prompt learning (FPL) for vision-language models is a powerful approach to collaboratively adapt models across distributed clients while preserving data privacy. However, existing FPL approaches suffer from a trade-off between performance and robustness, particularly in out-of-distribution (OOD) shifts, limiting their reliability in real-world scenarios. The inherent in-distribution (ID) data heterogeneity among different clients makes it more challenging to maintain this trade-off. To fill this gap, we introduce a Federated OOD-aware Context Optimization (FOCoOp) framework, which captures diverse distributions among clients using ID global prompts, local prompts, and OOD prompts. Specifically, FOCoOp leverages three sets of prompts to create both class-level and distribution-level separations, which adapt to OOD shifts through bi-level distributionally robust optimization. Additionally, FOCoOp improves the discrimination consistency among clients, i.e., calibrating global prompts, seemingly OOD prompts, and OOD prompts by semi-unbalanced optimal transport. The extensive experiments on real-world datasets demonstrate that FOCoOp effectively captures decentralized heterogeneous distributions and enhances robustness of different OOD shifts. The project is available at GitHub.

[27] R3eVision: A Survey on Robust Rendering, Restoration, and Enhancement for 3D Low-Level Vision cs.CVPDF

Weeyoung Kwon, Jeahun Sung, Minkyu Jeon, Chanho Eom, Jihyong Oh

TL;DR: 论文《R3eVision》综述了3D低级别视觉（3D LLV）中的鲁棒渲染、恢复和增强技术，重点讨论了在噪声、模糊等退化条件下的3D重建与视图合成问题。

Details

Motivation: 现有神经渲染方法（如NeRF、3DGS）通常假设输入为清洁高分辨率的多视角图像，但实际场景中常存在噪声、模糊等退化问题，限制了模型的鲁棒性。因此，将传统2D低级别视觉任务扩展到3D空间的需求日益迫切。

Result: 研究表明，3D LLV技术在自动驾驶、AR/VR等领域具有重要应用价值，能够从退化输入中实现可靠的3D感知。

Insight: 3D LLV是提升真实场景中3D内容生成和重建鲁棒性的关键方向，未来的工作需进一步解决时空一致性等挑战。

Abstract: Neural rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved significant progress in photorealistic 3D scene reconstruction and novel view synthesis. However, most existing models assume clean and high-resolution (HR) multi-view inputs, which limits their robustness under real-world degradations such as noise, blur, low-resolution (LR), and weather-induced artifacts. To address these limitations, the emerging field of 3D Low-Level Vision (3D LLV) extends classical 2D Low-Level Vision tasks including super-resolution (SR), deblurring, weather degradation removal, restoration, and enhancement into the 3D spatial domain. This survey, referred to as R\textsuperscript{3}eVision, provides a comprehensive overview of robust rendering, restoration, and enhancement for 3D LLV by formalizing the degradation-aware rendering problem and identifying key challenges related to spatio-temporal consistency and ill-posed optimization. Recent methods that integrate LLV into neural rendering frameworks are categorized to illustrate how they enable high-fidelity 3D reconstruction under adverse conditions. Application domains such as autonomous driving, AR/VR, and robotics are also discussed, where reliable 3D perception from degraded inputs is critical. By reviewing representative methods, datasets, and evaluation protocols, this work positions 3D LLV as a fundamental direction for robust 3D content generation and scene-level reconstruction in real-world environments.

[28] Fine-grained Image Retrieval via Dual-Vision Adaptation cs.CV | cs.MMPDF

Xin Jiang, Meiqi Cao, Hao Tang, Fei Shen, Zechao Li

TL;DR: 论文提出了一种名为Dual-Vision Adaptation（DVA）的方法，用于细粒度图像检索（FGIR），通过协同样本和特征适配引导预训练模型完成任务，同时避免了过拟合和泛化能力下降的问题。

Details

Motivation: 现有的细粒度图像检索方法容易过拟合训练数据，且忽略了预训练模型的大规模知识，导致泛化能力不足。

Result: DVA在三个分布内和三个分布外的细粒度数据集上表现优异，且学习参数较少。

Insight: 通过冻结预训练参数并引入适配机制，可以在保持泛化能力的同时提升细粒度任务性能。

Abstract: Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.

[29] SycnMapV2: Robust and Adaptive Unsupervised Segmentation cs.CV | cs.AI | cs.LGPDF

Heng Zhang, Zikang Wan, Danilo Vasconcellos Vargas

TL;DR: SyncMapV2是一种无监督分割方法，具有卓越的鲁棒性和自适应能力，能在噪声和失真条件下保持高性能，且无需重新初始化。

Details

Motivation: 人类视觉能够在无监督条件下高效分割视觉信号，并在噪声增加时保持鲁棒性，而现有AI算法在此类条件下表现不佳。

Result: 在数字失真条件下，mIoU仅下降0.01%，远优于其他SOTA方法（下降23.8%）。在噪声、天气和模糊条件下也表现出色。

Insight: SyncMapV2的在线自适应能力为下一代鲁棒和自适应智能系统提供了新的设计思路。

Abstract: Human vision excels at segmenting visual cues without the need for explicit training, and it remains remarkably robust even as noise severity increases. In contrast, existing AI algorithms struggle to maintain accuracy under similar conditions. Here, we present SyncMapV2, the first to solve unsupervised segmentation with state-of-the-art robustness. SyncMapV2 exhibits a minimal drop in mIoU, only 0.01%, under digital corruption, compared to a 23.8% drop observed in SOTA methods.This superior performance extends across various types of corruption: noise (7.3% vs. 37.7%), weather (7.5% vs. 33.8%), and blur (7.0% vs. 29.5%). Notably, SyncMapV2 accomplishes this without any robust training, supervision, or loss functions. It is based on a learning paradigm that uses self-organizing dynamical equations combined with concepts from random networks. Moreover,unlike conventional methods that require re-initialization for each new input, SyncMapV2 adapts online, mimicking the continuous adaptability of human vision. Thus, we go beyond the accurate and robust results, and present the first algorithm that can do all the above online, adapting to input rather than re-initializing. In adaptability tests, SyncMapV2 demonstrates near-zero performance degradation, which motivates and fosters a new generation of robust and adaptive intelligence in the near future.

[30] Segment Anything for Satellite Imagery: A Strong Baseline and a Regional Dataset for Automatic Field Delineation cs.CV | cs.AIPDF

Carmelo Scribano, Elena Govi, Paolo bertellini, Simone Parisi, Giorgia Franchini

TL;DR: 论文提出了一种基于Segment Anything Model（SAM）的农田边界自动分割流水线，并通过微调策略优化SAM在该任务上的表现。同时，发布了一个新的区域数据集ERAS，填补了现有数据的空白。实验验证了方法的准确性和泛化能力。

Details

Motivation: 高分辨率卫星影像的农田边界自动分割对农业高效运营至关重要，而现有方法依赖昂贵的地面调查。计算机视觉技术可以提供替代方案，但需要适应特定任务的模型和数据集。

Result: 实验表明，该方法在农田分割任务中表现出高准确性和良好的泛化能力。ERAS数据集的发布填补了现有数据集的不足。

Insight: 微调预训练的通用分割模型（如SAM）可以显著提升特定任务（如农田分割）的表现；同时，补充区域数据集的构建对模型泛化至关重要。

Abstract: Accurate mapping of agricultural field boundaries is essential for the efficient operation of agriculture. Automatic extraction from high-resolution satellite imagery, supported by computer vision techniques, can avoid costly ground surveys. In this paper, we present a pipeline for field delineation based on the Segment Anything Model (SAM), introducing a fine-tuning strategy to adapt SAM to this task. In addition to using published datasets, we describe a method for acquiring a complementary regional dataset that covers areas beyond current sources. Extensive experiments assess segmentation accuracy and evaluate the generalization capabilities. Our approach provides a robust baseline for automated field delineation. The new regional dataset, known as ERAS, is now publicly available.

Arpit Jadon, Haoran Wang, Phillip Thomas, Michael Stanley, S. Nathaniel Cibik

TL;DR: RealDriveSim是一个用于自动驾驶的多模态、多任务合成数据集，提供了高真实性和精细标注（多达64类），支持2D计算机视觉和LiDAR应用，其表现优于现有合成数据集。

Details

Motivation: 现有合成数据集的真实性和任务多样性不足，限制了其在自动驾驶感知模型训练中的应用。RealDriveSim旨在解决这一问题，提供高真实性和多任务支持的数据。

Result: 在多个应用和领域中评估，RealDriveSim的表现优于现有合成数据集，达到SOTA水平。

Insight: 高真实性和多模态支持是提升合成数据集效用的关键，为自动驾驶感知模型的训练提供了更高效的解决方案。

Abstract: As perception models continue to develop, the need for large-scale datasets increases. However, data annotation remains far too expensive to effectively scale and meet the demand. Synthetic datasets provide a solution to boost model performance with substantially reduced costs. However, current synthetic datasets remain limited in their scope, realism, and are designed for specific tasks and applications. In this work, we present RealDriveSim, a realistic multi-modal synthetic dataset for autonomous driving that not only supports popular 2D computer vision applications but also their LiDAR counterparts, providing fine-grained annotations for up to 64 classes. We extensively evaluate our dataset for a wide range of applications and domains, demonstrating state-of-the-art results compared to existing synthetic benchmarks. The dataset is publicly available at https://realdrivesim.github.io/.

[32] Reliable Few-shot Learning under Dual Noises cs.CV | cs.AIPDF

Ji Zhang, Jingkuan Song, Lianli Gao, Nicu Sebe, Heng Tao Shen

TL;DR: 本文提出DETA++方法，通过去噪任务适应和噪声鲁棒性设计，解决小样本学习中的ID和OOD噪声问题。

Details

Motivation: 小样本学习在开放世界任务中常受ID和OOD噪声影响，传统方法在有限样本下难以适应和预测。

Result: 实验验证了DETA++在噪声环境下的有效性和灵活性。

Insight: 结合去噪和自适应策略可显著提升小样本学习的鲁棒性。

Abstract: Recent advances in model pre-training give rise to task adaptation-based few-shot learning (FSL), where the goal is to adapt a pre-trained task-agnostic model for capturing task-specific knowledge with a few-labeled support samples of the target task.Nevertheless, existing approaches may still fail in the open world due to the inevitable in-distribution (ID) and out-of-distribution (OOD) noise from both support and query samples of the target task. With limited support samples available, i) the adverse effect of the dual noises can be severely amplified during task adaptation, and ii) the adapted model can produce unreliable predictions on query samples in the presence of the dual noises. In this work, we propose DEnoised Task Adaptation (DETA++) for reliable FSL. DETA++ uses a Contrastive Relevance Aggregation (CoRA) module to calculate image and region weights for support samples, based on which a clean prototype loss and a noise entropy maximization loss are proposed to achieve noise-robust task adaptation. Additionally,DETA++ employs a memory bank to store and refine clean regions for each inner-task class, based on which a Local Nearest Centroid Classifier (LocalNCC) is devised to yield noise-robust predictions on query samples. Moreover, DETA++ utilizes an Intra-class Region Swapping (IntraSwap) strategy to rectify ID class prototypes during task adaptation, enhancing the model’s robustness to the dual noises. Extensive experiments demonstrate the effectiveness and flexibility of DETA++.

[33] Transparency Techniques for Neural Networks trained on Writer Identification and Writer Verification cs.CVPDF

Viktoria Pundy, Marco Peer, Florian Kleber

TL;DR: 该论文首次将两种透明性技术应用于神经网络的作者识别和验证任务，通过像素级和点特异性显著性图提升模型的可解释性，支持法医专家分析手写文本相似性。

Details

Motivation: 神经网络在作者识别和验证任务中表现出色，但其”黑盒”特性限制了性能提升和可靠性。论文旨在通过透明性技术揭示模型决策依据，以支持法医专家的工作。

Result: 像素级显著性图在定量评估中表现优于点特异性显著性图，且其高亮区域与法医专家关注的区域一致，适合支持法医分析。

Insight: 透明性技术不仅能提升模型的可解释性，还能帮助专家理解模型决策过程，从而改进模型性能和应用效果。

Abstract: Neural Networks are the state of the art for many tasks in the computer vision domain, including Writer Identification (WI) and Writer Verification (WV). The transparency of these “black box” systems is important for improvements of performance and reliability. For this work, two transparency techniques are applied to neural networks trained on WI and WV for the first time in this domain. The first technique provides pixel-level saliency maps, while the point-specific saliency maps of the second technique provide information on similarities between two images. The transparency techniques are evaluated using deletion and insertion score metrics. The goal is to support forensic experts with information on similarities in handwritten text and to explore the characteristics selected by a neural network for the identification process. For the qualitative evaluation, the highlights of the maps are compared to the areas forensic experts consider during the identification process. The evaluation results show that the pixel-wise saliency maps outperform the point-specific saliency maps and are suitable for the support of forensic experts.

[34] MambaHash: Visual State Space Deep Hashing Model for Large-Scale Image Retrieval cs.CVPDF

Chao He, Hongxi Wei

TL;DR: MambaHash is a visual state space deep hashing model designed for large-scale image retrieval, leveraging Mamba’s linear time complexity and advanced modules to achieve superior performance.

Details

Motivation: The paper explores the suitability of Mamba-based models for large-scale image retrieval, aiming to improve efficiency and performance over existing deep hashing methods.

Result: MambaHash outperforms state-of-the-art deep hashing methods on CIFAR-10, NUS-WIDE, and IMAGENET datasets, demonstrating efficiency and superior retrieval performance.

Insight: Mamba’s linear complexity and structured approach to channel interaction and feature enhancement can effectively scale to large retrieval tasks, offering a promising direction for future research.

Abstract: Deep image hashing aims to enable effective large-scale image retrieval by mapping the input images into simple binary hash codes through deep neural networks. More recently, Vision Mamba with linear time complexity has attracted extensive attention from researchers by achieving outstanding performance on various computer tasks. Nevertheless, the suitability of Mamba for large-scale image retrieval tasks still needs to be explored. Towards this end, we propose a visual state space hashing model, called MambaHash. Concretely, we propose a backbone network with stage-wise architecture, in which grouped Mamba operation is introduced to model local and global information by utilizing Mamba to perform multi-directional scanning along different groups of the channel. Subsequently, the proposed channel interaction attention module is used to enhance information communication across channels. Finally, we meticulously design an adaptive feature enhancement module to increase feature diversity and enhance the visual representation capability of the model. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that compared with the state-of-the-art deep hashing methods, our proposed MambaHash has well efficiency and superior performance to effectively accomplish large-scale image retrieval tasks. Source code is available https://github.com/shuaichaochao/MambaHash.git

[35] Prompt-based Dynamic Token Pruning to Guide Transformer Attention in Efficient Segmentation cs.CVPDF

Pallabi Dutta, Anubhab Maity, Sushmita Mitra

TL;DR: 该论文提出了一种基于提示的动态令牌修剪方法，通过减少ViT中不相关令牌的处理，提高医疗图像分割的效率和准确性。

Details

Motivation: Vision Transformers（ViT）在医疗图像分析中因高计算需求受限，需一种方法以动态修剪不相关令牌，提高计算效率。

Result: 实验显示令牌数量减少35-55%，计算成本降低，同时保持了分割准确性。

Insight: 动态令牌修剪可显著提升ViT的效率，尤其在资源受限的医疗图像分析中具有实际应用价值。

Abstract: The high computational demands of Vision Transformers (ViTs), in processing a huge number of tokens, often constrain their practical application in analyzing medical images. This research proposes an adaptive prompt-guided pruning method to selectively reduce the processing of irrelevant tokens in the segmentation pipeline. The prompt-based spatial prior helps to rank the tokens according to their relevance. Tokens with low-relevance scores are down-weighted, ensuring that only the relevant ones are propagated for processing across subsequent stages. This data-driven pruning strategy facilitates end-to-end training, maintains gradient flow, and improves segmentation accuracy by focusing computational resources on essential regions. The proposed framework is integrated with several state-of-the-art models to facilitate the elimination of irrelevant tokens; thereby, enhancing computational efficiency while preserving segmentation accuracy. The experimental results show a reduction of $\sim$ 35-55% tokens; thus reducing the computational costs relative to the baselines. Cost-effective medical image processing, using our framework, facilitates real-time diagnosis by expanding its applicability in resource-constrained environments.

[36] AGC-Drive: A Large-Scale Dataset for Real-World Aerial-Ground Collaboration in Driving Scenarios cs.CVPDF

Yunhao Hou, Bochao Zou, Min Zhang, Ran Chen, Shangdong Yang

TL;DR: AGC-Drive是首个针对空中-地面协作3D感知的大规模真实世界数据集，填补了无人机与车辆协作感知领域的空白。

Details

Motivation: 目前大多数协作感知研究集中在车辆与车辆或基础设施之间，忽视了无人机提供的动态俯视角视野。缺乏高质量的空中-地面协作数据集限制了相关研究的发展。

Result: 数据集包含400个场景，每个场景约100帧，标注了13类物体的3D边界框。并提供了车辆协作和无人机协作的基准测试任务。

Insight: 无人机提供的俯视角视野在缓解遮挡和监测大规模交互环境方面具有独特优势，未来研究可进一步探索其在复杂交通场景中的应用。

Abstract: By sharing information across multiple agents, collaborative perception helps autonomous vehicles mitigate occlusions and improve overall perception accuracy. While most previous work focus on vehicle-to-vehicle and vehicle-to-infrastructure collaboration, with limited attention to aerial perspectives provided by UAVs, which uniquely offer dynamic, top-down views to alleviate occlusions and monitor large-scale interactive environments. A major reason for this is the lack of high-quality datasets for aerial-ground collaborative scenarios. To bridge this gap, we present AGC-Drive, the first large-scale real-world dataset for Aerial-Ground Cooperative 3D perception. The data collection platform consists of two vehicles, each equipped with five cameras and one LiDAR sensor, and one UAV carrying a forward-facing camera and a LiDAR sensor, enabling comprehensive multi-view and multi-agent perception. Consisting of approximately 120K LiDAR frames and 440K images, the dataset covers 14 diverse real-world driving scenarios, including urban roundabouts, highway tunnels, and on/off ramps. Notably, 19.5% of the data comprises dynamic interaction events, including vehicle cut-ins, cut-outs, and frequent lane changes. AGC-Drive contains 400 scenes, each with approximately 100 frames and fully annotated 3D bounding boxes covering 13 object categories. We provide benchmarks for two 3D perception tasks: vehicle-to-vehicle collaborative perception and vehicle-to-UAV collaborative perception. Additionally, we release an open-source toolkit, including spatiotemporal alignment verification tools, multi-agent visualization systems, and collaborative annotation utilities. The dataset and code are available at https://github.com/PercepX/AGC-Drive.

[37] CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset cs.CV | cs.AI | cs.LGPDF

Santosh Patapati, Trisanth Srinivasan, Amith Adiraju

TL;DR: 本文提出了一种基于CLIP的改进模型CLIP-MG，通过融合骨骼姿态特征和RGB数据来识别微手势，在iMiGUE数据集上取得了61.82%的Top-1准确率。

Details

Motivation: 微手势识别因手势的细微性和不自主性而极具挑战性，现有方法难以捕捉其特征。本文旨在利用CLIP模型的强大能力，结合姿态信息提升识别效果。

Result: 在iMiGUE数据集上达到61.82%的Top-1准确率，表明该方法有潜力但仍有改进空间。

Insight: 视觉-语言模型（如CLIP）结合姿态信息可提升微手势识别效果，但完全适应此类任务仍需进一步研究。

Abstract: Micro-gesture recognition is a challenging task in affective computing due to the subtle, involuntary nature of the gestures and their low movement amplitude. In this paper, we introduce a Pose-Guided Semantics-Aware CLIP-based architecture, or CLIP for Micro-Gesture recognition (CLIP-MG), a modified CLIP model tailored for micro-gesture classification on the iMiGUE dataset. CLIP-MG integrates human pose (skeleton) information into the CLIP-based recognition pipeline through pose-guided semantic query generation and a gated multi-modal fusion mechanism. The proposed model achieves a Top-1 accuracy of 61.82%. These results demonstrate both the potential of our approach and the remaining difficulty in fully adapting vision-language models like CLIP for micro-gesture recognition.

[38] HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis cs.CVPDF

Peixiang Huang, Yanyan Huang, Weiqin Zhao, Junjun He, Lequan Yu

TL;DR: HyperPath 是一种新颖的方法，利用双曲空间建模 WSI 的语义层次结构，结合视觉和文本特征，提升了 WSI 分类的性能。

Details

Motivation: 传统方法主要依赖欧几里得嵌入，难以充分捕捉 WSI 的语义层次结构，导致分类性能受限。

Result: 在多个任务上表现优于现有方法，验证了双曲嵌入在 WSI 分析中的潜力。

Insight: 双曲空间更适合建模 WSI 的层次结构，结合跨模态信息可显著提升语义关联的准确性。

Abstract: Pathology is essential for cancer diagnosis, with multiple instance learning (MIL) widely used for whole slide image (WSI) analysis. WSIs exhibit a natural hierarchy – patches, regions, and slides – with distinct semantic associations. While some methods attempt to leverage this hierarchy for improved representation, they predominantly rely on Euclidean embeddings, which struggle to fully capture semantic hierarchies. To address this limitation, we propose HyperPath, a novel method that integrates knowledge from textual descriptions to guide the modeling of semantic hierarchies of WSIs in hyperbolic space, thereby enhancing WSI classification. Our approach adapts both visual and textual features extracted by pathology vision-language foundation models to the hyperbolic space. We design an Angular Modality Alignment Loss to ensure robust cross-modal alignment, while a Semantic Hierarchy Consistency Loss further refines feature hierarchies through entailment and contradiction relationships and thus enhance semantic coherence. The classification is performed with geodesic distance, which measures the similarity between entities in the hyperbolic semantic hierarchy. This eliminates the need for linear classifiers and enables a geometry-aware approach to WSI analysis. Extensive experiments show that our method achieves superior performance across tasks compared to existing methods, highlighting the potential of hyperbolic embeddings for WSI analysis.

Dong Nguyen Tien, Dung D. Le

TL;DR: 论文提出首个多模态对抗攻击框架，评估OCR文档理解模型的鲁棒性，发现复合攻击和行级攻击对模型性能影响最大。

Details

Motivation: 视觉文档理解(VDU)系统虽然性能强，但在对抗扰动下的鲁棒性研究不足，亟需系统评估。

Result: 在四个数据集和六类模型上，行级攻击和复合扰动(BBox+Pixel+Text)导致最大性能下降；PGD-BBox扰动优于随机扰动。

Insight: 模型对布局扰动的敏感性更高，复合攻击的效果显著，需关注VDU系统的多模态防御。

Abstract: Visual Document Understanding (VDU) systems have achieved strong performance in information extraction by integrating textual, layout, and visual signals. However, their robustness under realistic adversarial perturbations remains insufficiently explored. We introduce the first unified framework for generating and evaluating multi-modal adversarial attacks on OCR-based VDU models. Our method covers six gradient-based layout attack scenarios, incorporating manipulations of OCR bounding boxes, pixels, and texts across both word and line granularities, with constraints on layout perturbation budget (e.g., IoU >= 0.6) to preserve plausibility. Experimental results across four datasets (FUNSD, CORD, SROIE, DocVQA) and six model families demonstrate that line-level attacks and compound perturbations (BBox + Pixel + Text) yield the most severe performance degradation. Projected Gradient Descent (PGD)-based BBox perturbations outperform random-shift baselines in all investigated models. Ablation studies further validate the impact of layout budget, text modification, and adversarial transferability.

[40] Efficient Transformations in Deep Learning Convolutional Neural Networks cs.CV | cs.AI | eess.IV | eess.SP | 68T07, 68T10, 94A08, 42C10PDF

Berk Yilmaz, Daniel Fidel Harvey, Prajit Dhuri

TL;DR: 该论文研究了将信号处理变换（FFT、WHT、DCT）集成到ResNet50 CNN模型中，以权衡计算效率、能耗和分类准确性。实验表明，WHT显著降低了能耗并提高了精度。

Details

Motivation: 探索如何在深度学习中利用信号处理变换（如WHT）提高计算效率和降低能耗，同时保持或提升模型性能。

Result: 使用WHT的改进模型精度从66%提升到79%，能耗从25,606 kJ降至39 kJ。

Insight: 信号处理变换（尤其是WHT）在CNN中具有实际应用潜力，能在能耗和性能之间实现良好平衡。

Abstract: This study investigates the integration of signal processing transformations – Fast Fourier Transform (FFT), Walsh-Hadamard Transform (WHT), and Discrete Cosine Transform (DCT) – within the ResNet50 convolutional neural network (CNN) model for image classification. The primary objective is to assess the trade-offs between computational efficiency, energy consumption, and classification accuracy during training and inference. Using the CIFAR-100 dataset (100 classes, 60,000 images), experiments demonstrated that incorporating WHT significantly reduced energy consumption while improving accuracy. Specifically, a baseline ResNet50 model achieved a testing accuracy of 66%, consuming an average of 25,606 kJ per model. In contrast, a modified ResNet50 incorporating WHT in the early convolutional layers achieved 74% accuracy, and an enhanced version with WHT applied to both early and late layers achieved 79% accuracy, with an average energy consumption of only 39 kJ per model. These results demonstrate the potential of WHT as a highly efficient and effective approach for energy-constrained CNN applications.

[41] How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? cs.CVPDF

Giuseppe Lando, Rosario Forte, Giovanni Maria Farinella, Antonino Furnari

TL;DR: 该论文研究了现成的多模态大语言模型（MLLMs）能否在不额外训练的情况下处理在线情景记忆视频问答任务。通过将流式自我中心视频转换为轻量级文本记忆，并在QA基准测试中表现出色。

Details

Motivation: 探讨现成的MLLMs能否解决在线情景记忆视频问答任务，避免额外训练的高成本。

Result: 在QAEgo4D-Closed基准测试中达到56%准确率，存储效率是现有系统的10^4/10^5倍。

Insight: 现成的MLLMs在轻量化设计下可以高效处理情景记忆任务，未来的优化方向包括组件设计和内存管理。

Abstract: We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with 3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 104/105 times more memory-efficient. Extensive ablations provides insights into the role of each component and design choice, and highlight directions of improvement for future research.

[42] Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors cs.CV | cs.AI | cs.CRPDF

Riccardo Ziglio, Cecilia Pasquini, Silvio Ranise

TL;DR: 论文研究了基于CNN的检测器在识别视频中面部交换技术引入的视觉伪影时的表现，发现其在同源数据上表现优异，但在跨数据集和交换算法的泛化能力上存在局限。

Details

Motivation: 随着自动化和实时面部交换工具的进步，视频流中的面部交换操纵在远程视频通信中的威胁日益增加。论文旨在探索如何利用视觉伪影检测这些操纵。

Result: 结果表明，CNN架构在同源数据上表现优异，但在跨数据集和交换算法下的泛化能力较差，尤其是难以稳定捕捉基于遮挡的视觉线索。

Insight: 论文指出需要开发专门的检测策略来应对面部交换技术引入的特定视觉伪影，以提升检测的鲁棒性和泛化能力。

Abstract: Face swapping manipulations in video streams represents an increasing threat in remote video communications, due to advances in automated and real-time tools. Recent literature proposes to characterize and exploit visual artifacts introduced in video frames by swapping algorithms when dealing with challenging physical scenes, such as face occlusions. This paper investigates the effectiveness of this approach by benchmarking CNN-based data-driven models on two data corpora (including a newly collected one) and analyzing generalization capabilities with respect to different acquisition sources and swapping algorithms. The results confirm excellent performance of general-purpose CNN architectures when operating within the same data source, but a significant difficulty in robustly characterizing occlusion-based visual cues across datasets. This highlights the need for specialized detection strategies to deal with such artifacts.

[43] SafeTriage: Facial Video De-identification for Privacy-Preserving Stroke Triage cs.CVPDF

Tongan Cai, Haomiao Ni, Wenchao Ma, Yuan Xue, Qian Ma

TL;DR: SafeTriage是一种新颖的去识别化方法，旨在保护患者隐私的同时保留中风诊断所需的面部运动特征，通过视频运动转移技术和条件生成模型实现。

Details

Motivation: 中风急诊中的面部视频分析需要患者数据，但直接使用这些数据存在隐私和伦理问题。SafeTriage旨在解决这一矛盾。

Result: 实验表明，SafeTriage生成的视频能有效保留中风相关面部特征，同时提供强隐私保护，且不影响AI分类的准确性。

Insight: 结合视频运动转移和条件生成技术，可以在隐私保护和医疗诊断之间找到平衡，为跨机构数据共享和AI分析提供新思路。

Abstract: Effective stroke triage in emergency settings often relies on clinicians’ ability to identify subtle abnormalities in facial muscle coordination. While recent AI models have shown promise in detecting such patterns from patient facial videos, their reliance on real patient data raises significant ethical and privacy challenges – especially when training robust and generalizable models across institutions. To address these concerns, we propose SafeTriage, a novel method designed to de-identify patient facial videos while preserving essential motion cues crucial for stroke diagnosis. SafeTriage leverages a pretrained video motion transfer (VMT) model to map the motion characteristics of real patient faces onto synthetic identities. This approach retains diagnostically relevant facial dynamics without revealing the patients’ identities. To mitigate the distribution shift between normal population pre-training videos and patient population test videos, we introduce a conditional generative model for visual prompt tuning, which adapts the input space of the VMT model to ensure accurate motion transfer without needing to fine-tune the VMT model backbone. Comprehensive evaluation, including quantitative metrics and clinical expert assessments, demonstrates that SafeTriage-produced synthetic videos effectively preserve stroke-relevant facial patterns, enabling reliable AI-based triage. Our evaluations also show that SafeTriage provides robust privacy protection while maintaining diagnostic accuracy, offering a secure and ethically sound foundation for data sharing and AI-driven clinical analysis in neurological disorders.

[44] Spatially-Aware Evaluation of Segmentation Uncertainty cs.CV | cs.AI | cs.PF | stat.MLPDF

Tal Zeevi, Eléonore V. Lieffrig, Lawrence H. Staib, John A. Onofrey

TL;DR: 该论文提出了一种考虑空间上下文和解剖结构的医学图像分割不确定性评估方法，对比传统方法，其更能区分重要与无意义的区域。

Details

Motivation: 传统的不确定性评估方法忽略空间上下文和解剖结构，导致无法区分不同模式的不确定性（如分散与边界对齐）。

Result: 实验表明，新方法能更好地与临床重要因素对齐，并能区分有意义与无意义的不确定性模式。

Insight: 空间上下文和结构信息在医学图像分割不确定性评估中至关重要，有助于提高临床实用性。

Abstract: Uncertainty maps highlight unreliable regions in segmentation predictions. However, most uncertainty evaluation metrics treat voxels independently, ignoring spatial context and anatomical structure. As a result, they may assign identical scores to qualitatively distinct patterns (e.g., scattered vs. boundary-aligned uncertainty). We propose three spatially aware metrics that incorporate structural and boundary information and conduct a thorough validation on medical imaging data from the prostate zonal segmentation challenge within the Medical Segmentation Decathlon. Our results demonstrate improved alignment with clinically important factors and better discrimination between meaningful and spurious uncertainty patterns.

[45] Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge cs.CVPDF

Ruiming Chen, Junming Yang, Shiyu Xia, Xu Yang, Jing Wang

TL;DR: 本文提出了MM-LG框架，用于从CLIP中提取多模态可泛化知识，并通过加权和的方式初始化不同规模的子模型。相比现有方法，MM-LG在性能和效率上均有显著提升。

Details

Motivation: CLIP的多模态可泛化知识对下游任务非常重要，但大规模预训练的计算开销高。传统Learngene方法无法处理多模态场景，因此需要一种新方法来提取和利用多模态可泛化知识。

Result: 在多个数据集上（如Oxford-IIIT PET和Flickr30k）上性能优于现有Learngene方法，且存储和预训练成本大幅降低。

Insight: MM-LG展示了多模态可泛化知识的高效提取和利用潜力，为多模态模型的轻量化和高效部署提供了新思路。

Abstract: CLIP (Contrastive Language-Image Pre-training) has attracted widespread attention for its multimodal generalizable knowledge, which is significant for downstream tasks. However, the computational overhead of a large number of parameters and large-scale pre-training poses challenges of pre-training a different scale of CLIP. Learngene extracts the generalizable components termed as learngene from an ancestry model and initializes diverse descendant models with it. Previous Learngene paradigms fail to handle the generalizable knowledge in multimodal scenarios. In this paper, we put forward the idea of utilizing a multimodal block to extract the multimodal generalizable knowledge, which inspires us to propose MM-LG (Multimodal Learngene), a novel framework designed to extract and leverage generalizable components from CLIP. Specifically, we first establish multimodal and unimodal blocks to extract the multimodal and unimodal generalizable knowledge in a weighted-sum manner. Subsequently, we employ these components to numerically initialize descendant models of varying scales and modalities. Extensive experiments demonstrate MM-LG’s effectiveness, which achieves performance gains over existing learngene approaches (e.g.,+3.1% on Oxford-IIIT PET and +4.13% on Flickr30k) and comparable or superior results to the pre-training and fine-tuning paradigm (e.g.,+1.9% on Oxford-IIIT PET and +3.65% on Flickr30k). Notably, MM-LG requires only around 25% of the parameter storage while reducing around 2.8 times pre-training costs for diverse model scales compared to the pre-training and fine-tuning paradigm, making it particularly suitable for efficient deployment across diverse downstream tasks.

[46] LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation cs.CVPDF

Tongtian Yue, Longteng Guo, Yepeng Tang, Zijia Zhao, Xinxin Zhu

TL;DR: LaVi提出了一种新型的大规模视觉-语言模型，通过内部特征调制实现高效的视觉-语言融合，避免了长上下文扩展的计算负担，显著提升了效率和性能。

Details

Motivation: 现有的大规模视觉-语言模型在视觉-语言集成上效率低下，破坏了模型内在结构或引入长上下文计算负担，限制了可扩展性和效率。

Result: 在15个图像和视频基准测试中达到最优性能，相比LLaVA-OV-7B，计算量减少94.0%，推理速度提升3.1倍，内存占用减半。

Insight: 内部特征调制是一种高效的多模态融合机制，能够在不破坏语言模型结构的情况下实现视觉与语言的精准对齐。

Abstract: Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model’s inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion by introducing a lightweight and adaptive transformation, which incorporates visual context by injecting token-wise vision-conditioned deltas into the affine parameters of layer normalization. This mechanism directly modulates linguistic hidden states based on visual input, ensuring precise vision-language alignment while preserving the LLM’s linguistic priors and drastically reducing computational costs. Extensive evaluations across 15 image and video benchmarks demonstrate that LaVi not only achieves state-of-the-art multimodal performance but also dramatically enhances efficiency. Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half - establishing LaVi as a scalable and practical solution for real-time multimodal reasoning. The code and models will be released soon.

[47] Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition cs.CVPDF

Xiaodan Hu, Chuhang Zou, Suchen Wang, Jaechul Kim, Narendra Ahuja

TL;DR: 本文提出一种结合语言驱动常识知识的框架，用于视频动作识别，通过生成场景描述和推理后续活动，结合视觉与文本线索提升识别效果。

Details

Motivation: 现有的视频动作识别方法未能充分利用语言模型中蕴含的丰富常识先验知识，而人类依赖这些上下文理解场景。本文旨在填补这一空白。

Result: 在Action Genome和Charades数据集上验证了方法的有效性。

Insight: 语言驱动的常识推理能够显著提升复杂场景（如遮挡严重的视频）中的动作识别能力。

Abstract: Recent video action recognition methods have shown excellent performance by adapting large-scale pre-trained language-image models to the video domain. However, language models contain rich common sense priors - the scene contexts that humans use to constitute an understanding of objects, human-object interactions, and activities - that have not been fully exploited. In this paper, we introduce a framework incorporating language-driven common sense priors to identify cluttered video action sequences from monocular views that are often heavily occluded. We propose: (1) A video context summary component that generates candidate objects, activities, and the interactions between objects and activities; (2) A description generation module that describes the current scene given the context and infers subsequent activities, through auxiliary prompts and common sense reasoning; (3) A multi-modal activity recognition head that combines visual and textual cues to recognize video actions. We demonstrate the effectiveness of our approach on the challenging Action Genome and Charades datasets.

[48] Few-Shot Generalized Category Discovery With Retrieval-Guided Decision Boundary Enhancement cs.CVPDF

Yunhan Ren, Feng Luo, Siyu Huang

TL;DR: 该论文提出了Few-shot Generalized Category Discovery（FSGCD）任务，旨在在已知信息稀缺的条件下实现GCD任务中的竞争性能。作者提出了一种基于检索的决策边界增强框架，通过两阶段优化策略提升已知类别的边界并迁移到未知类别。

Details

Motivation: 虽然现有GCD模型取得了显著成功，但在有限标注样本和少量已知类别下的性能仍未被充分探索。本研究旨在解决这一挑战。

Result: 实验结果表明，该方法在FSGCD设置下优于现有方法，在六个公共GCD基准上取得了最佳性能。

Insight: 该研究揭示了在有限标注数据下如何有效利用检索和伪标注提升GCD性能，为少样本学习提供了新的思路。

Abstract: While existing Generalized Category Discovery (GCD) models have achieved significant success, their performance with limited labeled samples and a small number of known categories remains largely unexplored. In this work, we introduce the task of Few-shot Generalized Category Discovery (FSGCD), aiming to achieve competitive performance in GCD tasks under conditions of known information scarcity. To tackle this challenge, we propose a decision boundary enhancement framework with affinity-based retrieval. Our framework is designed to learn the decision boundaries of known categories and transfer these boundaries to unknown categories. First, we use a decision boundary pre-training module to mitigate the overfitting of pre-trained information on known category boundaries and improve the learning of these decision boundaries using labeled samples. Second, we implement a two-stage retrieval-guided decision boundary optimization strategy. Specifically, this strategy further enhances the severely limited known boundaries by using affinity-retrieved pseudo-labeled samples. Then, these refined boundaries are applied to unknown clusters via guidance from affinity-based feature retrieval. Experimental results demonstrate that our proposed method outperforms existing methods on six public GCD benchmarks under the FSGCD setting. The codes are available at: https://github.com/Ryh1218/FSGCD

[49] TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion cs.CVPDF

Mingrui Zhu, Xiru Chen, Xin Wei, Nannan Wang, Xinbo Gao

TL;DR: 论文提出了一种基于文本语义引导的红外与可见光图像融合方法（TeSG），通过多层次的文本语义信息引导融合过程，显著提升了融合图像在下游任务（如检测和分割）中的性能。

Details

Motivation: 现有的红外与可见光图像融合方法中，文本引导的融合虽展现了潜力，但文本语义信息的有效整合和利用仍未充分研究。为此，论文探索了文本语义信息在图像融合中的多层次应用。

Result: 实验表明，TeSG在图像融合质量和下游任务性能上均优于现有方法，尤其在检测和分割任务中表现突出。

Insight: 文本语义信息的多层次利用（掩码与文本语义）可以显著增强图像融合的针对性和适应性，为下游任务提供更优的信息支持。

Abstract: Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities, producing more informative and comprehensive outputs. Recently, text-guided IVF has shown great potential due to its flexibility and versatility. However, the effective integration and utilization of textual semantic information remains insufficiently studied. To tackle these challenges, we introduce textual semantics at two levels: the mask semantic level and the text semantic level, both derived from textual descriptions extracted by large Vision-Language Models (VLMs). Building on this, we propose Textual Semantic Guidance for infrared and visible image fusion, termed TeSG, which guides the image synthesis process in a way that is optimized for downstream tasks such as detection and segmentation. Specifically, TeSG consists of three core components: a Semantic Information Generator (SIG), a Mask-Guided Cross-Attention (MGCA) module, and a Text-Driven Attentional Fusion (TDAF) module. The SIG generates mask and text semantics based on textual descriptions. The MGCA module performs initial attention-based fusion of visual features from both infrared and visible images, guided by mask semantics. Finally, the TDAF module refines the fusion process with gated attention driven by text semantics. Extensive experiments demonstrate the competitiveness of our approach, particularly in terms of performance on downstream tasks, compared to existing state-of-the-art methods.

[50] 3DeepRep: 3D Deep Low-rank Tensor Representation for Hyperspectral Image Inpainting cs.CV | eess.IVPDF

Yunshan Li, Wenwu Gong, Qianqian Wang, Chao Wang, Lili Yang

TL;DR: 该论文提出了一种新的3DeepRep模型，通过在所有三个HSI张量模式上执行深度非线性变换，结合3方向TNN正则化，显著提升了高光谱图像修复的性能。

Details

Motivation: 现有高光谱图像修复方法通常仅在光谱模式上应用深度变换，忽略了其他张量模式的低秩特性，限制了性能提升。

Result: 实验表明，3DeepRep在多个真实HSI数据集上表现出优于现有技术的修复性能，质量和定量指标均显著提升。

Insight: 全面利用张量模式的低秩特性是提升高光谱图像修复性能的关键，多方向深度变换与融合策略是有效途径。

Abstract: Recent approaches based on transform-based tensor nuclear norm (TNN) have demonstrated notable effectiveness in hyperspectral image (HSI) inpainting by leveraging low-rank structures in latent representations. Recent developments incorporate deep transforms to improve low-rank tensor representation; however, existing approaches typically restrict the transform to the spectral mode, neglecting low-rank properties along other tensor modes. In this paper, we propose a novel 3-directional deep low-rank tensor representation (3DeepRep) model, which performs deep nonlinear transforms along all three modes of the HSI tensor. To enforce low-rankness, the model minimizes the nuclear norms of mode-i frontal slices in the corresponding latent space for each direction (i=1,2,3), forming a 3-directional TNN regularization. The outputs from the three directional branches are subsequently fused via a learnable aggregation module to produce the final result. An efficient gradient-based optimization algorithm is developed to solve the model in a self-supervised manner. Extensive experiments on real-world HSI datasets demonstrate that the proposed method achieves superior inpainting performance compared to existing state-of-the-art techniques, both qualitatively and quantitatively.

Liu Zongzhen, Luo Hui, Wang Zhixing, Wei Yuxing, Zuo Haorui

TL;DR: 该论文提出了一种名为CoDAF的统一框架，用于解决无人机（UAV）多模态目标检测中的弱对齐问题，通过动态对齐和融合提升检测性能。

Details

Motivation: 无人机多模态（RGB和红外）目标检测中，由于运动和非同步成像导致的空间错位问题严重影响检测性能。现有方法通常单独处理对齐或融合问题，效果有限。

Result: 在DroneVehicle数据集上，CoDAF实现了78.6%的mAP，验证了其有效性。

Insight: 联合处理对齐与融合问题比单独处理更高效，动态注意力机制能够有效平衡多模态信息的贡献。

Abstract: Unmanned aerial vehicle (UAV) object detection plays a vital role in applications such as environmental monitoring and urban security. To improve robustness, recent studies have explored multimodal detection by fusing visible (RGB) and infrared (IR) imagery. However, due to UAV platform motion and asynchronous imaging, spatial misalignment frequently occurs between modalities, leading to weak alignment. This introduces two major challenges: semantic inconsistency at corresponding spatial locations and modality conflict during feature fusion. Existing methods often address these issues in isolation, limiting their effectiveness. In this paper, we propose Cross-modal Offset-guided Dynamic Alignment and Fusion (CoDAF), a unified framework that jointly tackles both challenges in weakly aligned UAV-based object detection. CoDAF comprises two novel modules: the Offset-guided Semantic Alignment (OSA), which estimates attention-based spatial offsets and uses deformable convolution guided by a shared semantic space to align features more precisely; and the Dynamic Attention-guided Fusion Module (DAFM), which adaptively balances modality contributions through gating and refines fused features via spatial-channel dual attention. By integrating alignment and fusion in a unified design, CoDAF enables robust UAV object detection. Experiments on standard benchmarks validate the effectiveness of our approach, with CoDAF achieving a mAP of 78.6% on the DroneVehicle dataset.

[52] Uncertainty-Aware Variational Information Pursuit for Interpretable Medical Image Analysis cs.CVPDF

Md Nahiduzzaman, Ruwan Tennakoon, Steven Korevaar, Zongyuan Ge, Alireza Bab-Hadiashar

TL;DR: 本文提出了不确定性感知的变分信息追踪方法（UAV-IP），在医学图像分析中平衡了准确性与可解释性，并通过量化不确定性提升了模型的鲁棒性和简洁性。

Details

Motivation: 医学影像AI系统需要兼具准确性和可解释性以赢得用户信任。现有V-IP方法忽视了生成查询-答案过程中的不确定性，限制了模型的可靠性。

Result: 在四个医学影像数据集（PH2、Derm7pt、BrEaST、SkinCon）上，UAV-IP比基线方法平均AUC提升3.2%，同时生成20%更简洁的解释。

Insight: 在可解释医学AI中，不确定性感知是提升模型鲁棒性和临床实用性的关键因素。

Abstract: In medical imaging, AI decision-support systems must balance accuracy and interpretability to build user trust and support effective clinical decision-making. Recently, Variational Information Pursuit (V-IP) and its variants have emerged as interpretable-by-design modeling techniques, aiming to explain AI decisions in terms of human-understandable, clinically relevant concepts. However, existing V-IP methods overlook instance-level uncertainties in query-answer generation, which can arise from model limitations (epistemic uncertainty) or variability in expert responses (aleatoric uncertainty). This paper introduces Uncertainty-Aware V-IP (UAV-IP), a novel framework that integrates uncertainty quantification into the V-IP process. We evaluate UAV-IP across four medical imaging datasets, PH2, Derm7pt, BrEaST, and SkinCon, demonstrating an average AUC improvement of approximately 3.2% while generating 20% more concise explanations compared to baseline V-IP, without sacrificing informativeness. These findings highlight the importance of uncertainty-aware reasoning in interpretable by design models for robust and reliable medical decision-making.

[53] Noise-Informed Diffusion-Generated Image Detection with Anomaly Attention cs.CVPDF

Weinan Guan, Wei Wang, Bo Peng, Ziwen He, Jing Dong

TL;DR: 该论文提出了一种基于噪声感知自注意力模块（NASA）的检测方法，用于识别由扩散模型生成的图像，特别是在面对未见过的扩散模型时表现优异。

Details

Motivation: 随着扩散模型生成的图像质量不断提升，信息安全隐患日益突出。为了解决这一问题，研究需要对未见过的扩散模型生成的图像进行有效检测。

Result: 实验表明，NASA-Swin在检测未知扩散模型生成的图像时表现优异，达到了当前最佳性能。

Insight: 关键洞察是不同扩散模型生成的图像具有相似的噪声模式，这可以用于区分真实与合成图像。

Abstract: With the rapid development of image generation technologies, especially the advancement of Diffusion Models, the quality of synthesized images has significantly improved, raising concerns among researchers about information security. To mitigate the malicious abuse of diffusion models, diffusion-generated image detection has proven to be an effective countermeasure.However, a key challenge for forgery detection is generalising to diffusion models not seen during training. In this paper, we address this problem by focusing on image noise. We observe that images from different diffusion models share similar noise patterns, distinct from genuine images. Building upon this insight, we introduce a novel Noise-Aware Self-Attention (NASA) module that focuses on noise regions to capture anomalous patterns. To implement a SOTA detection model, we incorporate NASA into Swin Transformer, forming an novel detection architecture NASA-Swin. Additionally, we employ a cross-modality fusion embedding to combine RGB and noise images, along with a channel mask strategy to enhance feature learning from both modalities. Extensive experiments demonstrate the effectiveness of our approach in enhancing detection capabilities for diffusion-generated images. When encountering unseen generation methods, our approach achieves the state-of-the-art performance.Our code is available at https://github.com/WeinanGuan/NASA-Swin.

[54] Infrared and Visible Image Fusion Based on Implicit Neural Representations cs.CVPDF

Shuchen Sun, Ligen Shi, Chang Liu, Lina Wu, Jun Qiu

TL;DR: 本文提出了一种基于隐式神经表示（INR）的红外与可见光图像融合方法（INRFuse），通过神经网络的连续函数参数化表示多模态信息，突破了传统依赖离散像素或显式特征的限制。该方法实现了分辨率无关的图像融合，并在主观视觉和客观评价指标上优于现有方法。

Details

Motivation: 红外与可见光图像融合的目标是综合利用两种模态的优势生成信息丰富且满足视觉或计算需求的图像。然而，传统方法通常依赖于离散像素或显性特征，限制了融合效果。本文希望通过隐式神经表示（INR）克服这些限制。

Result: 实验结果表明，INRFuse在主观视觉和客观评价指标上均优于现有方法，生成的融合图像结构清晰、细节自然、信息丰富，且无需训练数据支持。

Insight: 1. 隐式神经表示（INR）在图像融合任务中展现出潜力，尤其是其分辨率无关的特性。2. 通过连续函数参数化信息可以实现更灵活、高质量的融合效果。3. 多损失函数联合优化是保留多模态图像关键信息的有效手段。

Abstract: Infrared and visible light image fusion aims to combine the strengths of both modalities to generate images that are rich in information and fulfill visual or computational requirements. This paper proposes an image fusion method based on Implicit Neural Representations (INR), referred to as INRFuse. This method parameterizes a continuous function through a neural network to implicitly represent the multimodal information of the image, breaking through the traditional reliance on discrete pixels or explicit features. The normalized spatial coordinates of the infrared and visible light images serve as inputs, and multi-layer perceptrons is utilized to adaptively fuse the features of both modalities, resulting in the output of the fused image. By designing multiple loss functions, the method jointly optimizes the similarity between the fused image and the original images, effectively preserving the thermal radiation information of the infrared image while maintaining the texture details of the visible light image. Furthermore, the resolution-independent characteristic of INR allows for the direct fusion of images with varying resolutions and achieves super-resolution reconstruction through high-density coordinate queries. Experimental results indicate that INRFuse outperforms existing methods in both subjective visual quality and objective evaluation metrics, producing fused images with clear structures, natural details, and rich information without the necessity for a training dataset.

[55] TextBraTS: Text-Guided Volumetric Brain Tumor Segmentation with Innovative Dataset Development and Fusion Module Exploration cs.CV | cs.MMPDF

Xiaoyu Shi, Rahul Kumar Jain, Yinhao Li, Ruibo Hou, Jingliang Cheng

TL;DR: 本文介绍了首个公开的多模态数据集TextBraTS，结合MRI扫描与文本注释，并提出了文本引导的容积医学图像分割方法，显著提升了脑肿瘤分割的准确性。

Details

Motivation: 当前脑肿瘤分析领域缺乏结合影像与文本的综合性数据集，限制了多模态方法的探索。本文旨在填补这一空白，通过引入TextBraTS数据集和文本引导的分割方法，提升分割精度。

Result: 提出的方法显著提升了脑肿瘤分割的准确性，实验验证了多模态融合的优势。

Insight: 结合文本信息与影像数据可有效提升医学图像分割性能，TextBraTS数据集为未来多模态研究提供了重要基础。

Abstract: Deep learning has demonstrated remarkable success in medical image segmentation and computer-aided diagnosis. In particular, numerous advanced methods have achieved state-of-the-art performance in brain tumor segmentation from MRI scans. While recent studies in other medical imaging domains have revealed that integrating textual reports with visual data can enhance segmentation accuracy, the field of brain tumor analysis lacks a comprehensive dataset that combines radiological images with corresponding textual annotations. This limitation has hindered the exploration of multimodal approaches that leverage both imaging and textual data. To bridge this critical gap, we introduce the TextBraTS dataset, the first publicly available volume-level multimodal dataset that contains paired MRI volumes and rich textual annotations, derived from the widely adopted BraTS2020 benchmark. Building upon this novel dataset, we propose a novel baseline framework and sequential cross-attention method for text-guided volumetric medical image segmentation. Through extensive experiments with various text-image fusion strategies and templated text formulations, our approach demonstrates significant improvements in brain tumor segmentation accuracy, offering valuable insights into effective multimodal integration techniques. Our dataset, implementation code, and pre-trained models are publicly available at https://github.com/Jupitern52/TextBraTS.

[56] RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought cs.CVPDF

Junbo Qiao, Miaomiao Cai, Wei Li, Yutong Liu, Xudong Huang

TL;DR: 论文提出了RealSR-R1，通过引入视觉-语言链式思维（VLCoT）框架和GRPO算法，提升了真实世界图像超分辨率的性能，能够更准确地理解退化和生成高保真图像。

Details

Motivation: 现有方法在真实世界图像超分辨率任务中，由于对退化图像内容的理解不足，导致重建结果保真度低且不自然。

Result: 实验表明，RealSR-R1在语义丰富的场景或严重退化图像中，能生成更真实的细节并准确理解内容。

Insight: 将语言模型的链式思维引入视觉任务，结合强化学习优化，可以显著提升真实世界图像超分辨率的性能。

Abstract: Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation.

[57] Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation cs.CVPDF

Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano

TL;DR: 通过法医导向的数据增强提升AI生成视频检测的泛化能力，聚焦低频伪影而非语义瑕疵。

Details

Motivation: 现有AI生成视频检测器泛化能力差，依赖特定模型的高层语义瑕疵而非内在低频伪影。

Result: 在单生成模型训练下，对多模型测试显著优于SOTA，包括最新模型如NOVA和FLUX。

Insight: 低频伪影是跨模型共享的判别性特征，法医导向增强能有效提升泛化性。

Abstract: Synthetic video generation is progressing very rapidly. The latest models can produce very realistic high-resolution videos that are virtually indistinguishable from real ones. Although several video forensic detectors have been recently proposed, they often exhibit poor generalization, which limits their applicability in a real-world scenario. Our key insight to overcome this issue is to guide the detector towards seeing what really matters. In fact, a well-designed forensic classifier should focus on identifying intrinsic low-level artifacts introduced by a generative architecture rather than relying on high-level semantic flaws that characterize a specific model. In this work, first, we study different generative architectures, searching and identifying discriminative features that are unbiased, robust to impairments, and shared across models. Then, we introduce a novel forensic-oriented data augmentation strategy based on the wavelet decomposition and replace specific frequency-related bands to drive the model to exploit more relevant forensic cues. Our novel training paradigm improves the generalizability of AI-generated video detectors, without the need for complex algorithms and large datasets that include multiple synthetic generators. To evaluate our approach, we train the detector using data from a single generative model and test it against videos produced by a wide range of other models. Despite its simplicity, our method achieves a significant accuracy improvement over state-of-the-art detectors and obtains excellent results even on very recent generative models, such as NOVA and FLUX. Code and data will be made publicly available.

[58] Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes cs.CVPDF

Chao Chen, Nobel Dang, Juexiao Zhang, Wenkai Sun, Pengfei Zheng

TL;DR: 该论文介绍了Co-VisiON基准，用于评估稀疏图像集中的共视性推理能力，发现现有视觉模型在此任务上表现不足，并提出了新方法Covis。

Details

Motivation: 人类在稀疏图像中识别共视性的能力对3D视觉和机器人感知至关重要，但现有视觉模型在此任务上的表现是否达到人类水平尚不明确。

Result: 实验表明，现有模型在稀疏条件下的共视性推理表现不佳，Covis在纯视觉模型中表现最佳。

Insight: 共视性任务需要超越低层次特征匹配的综合空间推理能力，现有模型仍需改进。

Abstract: Humans exhibit a remarkable ability to recognize co-visibility-the overlapping regions visible in multiple images-even when these images are sparsely distributed across a complex scene. This capability is foundational in 3D vision and robotic perception. Despite significant progress in vision learning, it remains unclear whether current vision models have reached human-level proficiency in co-visibility analysis. In this work, we introduce the Co-Visibility reasONing (Co-VisiON) benchmark, designed to directly evaluate co-visibility reasoning on sparse image sets across over 1000 indoor scenarios. Our experiments reveal that while co-visibility is typically treated as a low-level feature matching task, it poses a significant challenge for existing vision models under sparse conditions. Notably, a proprietary vision-language model outperforms all purely vision-based approaches, with all models lagging substantially behind human performance. This gap underscores the need for more than basic pairwise vision processing-it calls for a comprehensive spatial understanding through high-level reasoning across multiple views. Inspired by human visual cognition, we propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM. We hope our benchmark and findings will spur further advancements in developing vision models capable of robust, high-level reasoning in challenging, sparse environments. Our dataset and source code can be found at: https://ai4ce.github.io/CoVISION

[59] FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation cs.CVPDF

Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao

TL;DR: FOCUS提出了一种统一的视觉语言模型，通过端到端框架整合了分割感知的视觉理解和可控的对象中心生成，实现了多模态理解、参考分割和可控图像生成的高效统一。

Details

Motivation: 当前的大型视觉语言模型（LVLMs）通常将视觉理解与生成任务分离，依赖多个独立模型，限制了交互式编辑的灵活性和准确性。FOCUS通过统一框架解决了这一问题。

Result: 在多项任务（多模态理解、参考分割、可控生成）中表现优异，验证了统一框架的有效性。

Insight: 通过整合分割和生成任务，FOCUS展示了端到端视觉语言建模的潜力，为更灵活的交互式编辑提供了新思路。

Abstract: Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat “what to see” and “how to edit” separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder. This strategy aligns visual encoding, segmentation, and generation modules, effectively bridging segmentation-aware perception with fine-grained visual synthesis. Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.

[60] Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection cs.CV | cs.AIPDF

Yuchu Jiang, Jiaming Chu, Jian Zhao, Xin Zhang, Xu Yang

TL;DR: Loupe是一个轻量级框架，通过结合补丁级分类器和分割模块，实现图像伪造的全局分类和细粒度定位，并通过伪标签引导的测试时适应机制提升泛化性。在DDL数据集上表现优异。

Details

Motivation: 生成模型的普及导致视觉内容伪造问题日益严重，现有方法要么泛化性有限，要么依赖复杂架构。Loupe旨在解决这些问题。

Result: 在DDL数据集上取得SOTA性能，IJCAI 2025挑战赛中总分0.846，排名第一。

Insight: 补丁级融合和条件查询设计能同时提升分类和定位性能，测试时适应机制可增强模型鲁棒性。

Abstract: The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight yet effective framework for joint deepfake detection and localization. Loupe integrates a patch-aware classifier and a segmentation module with conditional queries, allowing simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness against distribution shifts of test set, Loupe introduces a pseudo-label-guided test-time adaptation mechanism by leveraging patch-level predictions to supervise the segmentation head. Extensive experiments on the DDL dataset demonstrate that Loupe achieves state-of-the-art performance, securing the first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846. Our results validate the effectiveness of the proposed patch-level fusion and conditional query design in improving both classification accuracy and spatial localization under diverse forgery patterns. The code is available at https://github.com/Kamichanw/Loupe.

[61] Self-supervised Feature Extraction for Enhanced Ball Detection on Soccer Robots cs.CVPDF

Can Lin, Daniele Affinita, Marco E. P. Zimmatore, Daniele Nardi, Domenico D. Bloisi

TL;DR: 本文提出了一种自监督学习框架，用于增强足球机器人在动态环境中的球体检测性能，通过伪标签生成和多任务学习减少对人工标注的依赖，并引入新数据集验证方法的有效性。

Details

Motivation: 在动态且具有挑战性的RoboCup户外赛场中，传统的有监督球检测方法需要大量人工标注，成本高昂且耗时。本文旨在通过自监督学习降低对标注数据的依赖，提升检测性能。

Result: 实验表明，该方法在准确率、F1分数和IoU上优于基线模型，且收敛速度更快。

Insight: 自监督学习可有效减少对人工标注的依赖，结合多任务学习和元学习能进一步提升模型在新场景中的适应能力。

Abstract: Robust and accurate ball detection is a critical component for autonomous humanoid soccer robots, particularly in dynamic and challenging environments such as RoboCup outdoor fields. However, traditional supervised approaches require extensive manual annotation, which is costly and time-intensive. To overcome this problem, we present a self-supervised learning framework for domain-adaptive feature extraction to enhance ball detection performance. The proposed approach leverages a general-purpose pretrained model to generate pseudo-labels, which are then used in a suite of self-supervised pretext tasks – including colorization, edge detection, and triplet loss – to learn robust visual features without relying on manual annotations. Additionally, a model-agnostic meta-learning (MAML) strategy is incorporated to ensure rapid adaptation to new deployment scenarios with minimal supervision. A new dataset comprising 10,000 labeled images from outdoor RoboCup SPL matches is introduced, used to validate the method, and made available to the community. Experimental results demonstrate that the proposed pipeline outperforms baseline models in terms of accuracy, F1 score, and IoU, while also exhibiting faster convergence.

[62] AnyTraverse: An off-road traversability framework with VLM and human operator in the loop cs.CV | cs.AI | cs.ROPDF

Sattwik Sahu, Agamdeep Singh, Karthik Nambiar, Srikanth Saripalli, P. B. Sujit

TL;DR: 论文提出了AnyTraverse框架，结合自然语言提示与人类操作员辅助，解决了无人自主导航在非结构化环境中的适应性不足问题。

Details

Motivation: 目前的方法在非结构化环境中表现不佳，且难以适应不同类型的机器人。需要一种能够结合语言提示和人类监督的解决方案。

Result: 在多个数据集和机器人平台上验证，性能优于GA-NAV和Off-seg，同时减少了人类监督的负担。

Insight: 自然语言提示和选择性人类监督的结合可以有效提升非结构化环境中的导航适应性，同时降低人工干预的需求。

Abstract: Off-road traversability segmentation enables autonomous navigation with applications in search-and-rescue, military operations, wildlife exploration, and agriculture. Current frameworks struggle due to significant variations in unstructured environments and uncertain scene changes, and are not adaptive to be used for different robot types. We present AnyTraverse, a framework combining natural language-based prompts with human-operator assistance to determine navigable regions for diverse robotic vehicles. The system segments scenes for a given set of prompts and calls the operator only when encountering previously unexplored scenery or unknown class not part of the prompt in its region-of-interest, thus reducing active supervision load while adapting to varying outdoor scenes. Our zero-shot learning approach eliminates the need for extensive data collection or retraining. Our experimental validation includes testing on RELLIS-3D, Freiburg Forest, and RUGD datasets and demonstrate real-world deployment on multiple robot platforms. The results show that AnyTraverse performs better than GA-NAV and Off-seg while offering a vehicle-agnostic approach to off-road traversability that balances automation with targeted human supervision.

[63] Camera Calibration via Circular Patterns: A Comprehensive Framework with Measurement Uncertainty and Unbiased Projection Model cs.CV | cs.ROPDF

Chaehyeon Song, Dongjae Lee, Jongwoo Lim, Ayoung Kim

TL;DR: 本文提出了一种基于圆形标定图案的相机校准框架，包括无偏投影模型和测量不确定性引入，显著提高了校准精度和鲁棒性。

Details

Motivation: 传统的圆形标定图案在镜头畸变下的投影模型存在偏差，影响校准精度。为此，作者提出改进方案以提高性能。

Result: 实验表明，该方法在校准精度和鲁棒性上显著优于传统方法（如棋盘格标定）。

Insight: 测量不确定性的引入能够有效提升校准组件的整体性能，为中心点投影模型的优化提供了新思路。

Abstract: Camera calibration using planar targets has been widely favored, and two types of control points have been mainly considered as measurements: the corners of the checkerboard and the centroid of circles. Since a centroid is derived from numerous pixels, the circular pattern provides more precise measurements than the checkerboard. However, the existing projection model of circle centroids is biased under lens distortion, resulting in low performance. To surmount this limitation, we propose an unbiased projection model of the circular pattern and demonstrate its superior accuracy compared to the checkerboard. Complementing this, we introduce uncertainty into circular patterns to enhance calibration robustness and completeness. Defining centroid uncertainty improves the performance of calibration components, including pattern detection, optimization, and evaluation metrics. We also provide guidelines for performing good camera calibration based on the evaluation metric. The core concept of this approach is to model the boundary points of a two-dimensional shape as a Markov random field, considering its connectivity. The shape distribution is propagated to the centroid uncertainty through an appropriate shape representation based on the Green theorem. Consequently, the resulting framework achieves marked gains in calibration accuracy and robustness. The complete source code and demonstration video are available at https://github.com/chaehyeonsong/discocal.

[64] Controllable and Expressive One-Shot Video Head Swapping cs.CVPDF

Chaonan Ji, Jinwei Qi, Peng Zhang, Bang Zhang, Liefeng Bo

TL;DR: 论文提出了一种基于扩散模型的多条件可控视频头部替换框架，能够从静态图像中无缝移植头部到动态视频，同时支持头部表情和动作的调整。

Details

Motivation: 现有的人脸替换方法主要关注局部面部替换，忽视了整体头部形态；而头部替换方法在处理发型多样性和复杂背景时表现不佳，且无法在替换后调整表情。

Result: 实验表明，该方法在背景无缝融合和身份保持方面表现优异，同时在真实和虚拟角色中展示了优越的表情迁移能力。

Insight: 解耦身份与表情、尺度感知的重定向策略是实现高质量头部替换和表情编辑的关键。

Abstract: In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. Existing face-swapping methods mainly focus on localized facial replacement neglecting holistic head morphology, while head-swapping approaches struggling with hairstyle diversity and complex backgrounds, and none of these methods allow users to modify the transplanted head expressions after swapping. To tackle these challenges, our method incorporates several innovative strategies through a unified latent diffusion paradigm. 1) Identity-preserving context fusion: We propose a shape-agnostic mask strategy to explicitly disentangle foreground head identity features from background/body contexts, combining hair enhancement strategy to achieve robust holistic head identity preservation across diverse hair types and complex backgrounds. 2) Expression-aware landmark retargeting and editing: We propose a disentangled 3DMM-driven retargeting module that decouples identity, expression, and head poses, minimizing the impact of original expressions in input images and supporting expression editing. While a scale-aware retargeting strategy is further employed to minimize cross-identity expression distortion for higher transfer precision. Experimental results demonstrate that our method excels in seamless background integration while preserving the identity of the source portrait, as well as showcasing superior expression transfer capabilities applicable to both real and virtual characters.

[65] ParkFormer: A Transformer-Based Parking Policy with Goal Embedding and Pedestrian-Aware Control cs.CV | cs.AIPDF

Jun Fu, Bin Tian, Haonan Chen, Shi Meng, Tingting Yao

TL;DR: 本文提出了一种基于Transformer的端到端自动泊车框架ParkFormer，融合目标点嵌入和行人感知控制，在复杂环境中实现高精度泊车。

Details

Motivation: 传统基于规则的泊车系统在动态或拥挤场景中适应性不足，而人类驾驶员能直觉式泊车，因此作者希望模仿专家行为设计更鲁棒的自主泊车系统。

Result: 在CARLA仿真中达到96.57%的成功率，平均位置误差0.21米，方向误差0.41度。消融实验验证了关键模块的作用。

Insight: 纯数据驱动方法（模仿学习）结合行人动态建模可显著提升自主泊车系统的鲁棒性和安全性。

Abstract: Autonomous parking plays a vital role in intelligent vehicle systems, particularly in constrained urban environments where high-precision control is required. While traditional rule-based parking systems struggle with environmental uncertainties and lack adaptability in crowded or dynamic scenes, human drivers demonstrate the ability to park intuitively without explicit modeling. Inspired by this observation, we propose a Transformer-based end-to-end framework for autonomous parking that learns from expert demonstrations. The network takes as input surround-view camera images, goal-point representations, ego vehicle motion, and pedestrian trajectories. It outputs discrete control sequences including throttle, braking, steering, and gear selection. A novel cross-attention module integrates BEV features with target points, and a GRU-based pedestrian predictor enhances safety by modeling dynamic obstacles. We validate our method on the CARLA 0.9.14 simulator in both vertical and parallel parking scenarios. Experiments show our model achieves a high success rate of 96.57%, with average positional and orientation errors of 0.21 meters and 0.41 degrees, respectively. The ablation studies further demonstrate the effectiveness of key modules such as pedestrian prediction and goal-point attention fusion. The code and dataset will be released at: https://github.com/little-snail-f/ParkFormer.

[66] With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You cs.CV | cs.AI | cs.LGPDF

Fabian Gröger, Shuo Wen, Huyen Le, Maria Brbić

TL;DR: 本文提出了一种名为STRUCTURE的正则化技术，用于在有限数据下进行多模态对齐，显著提升了小样本学习的性能。

Details

Motivation: 现有多模态模型依赖海量配对数据，而许多领域难以获取如此规模的数据。本文探索如何在有限配对数据下实现高质量的多模态对齐。

Result: 在零样本分类和检索任务中平均相对提升51.6%和91.8%。

Insight: 1. 小样本多模态学习可行；2. 几何结构保护和层级选择是实现高效对齐的关键；3. 方法广泛适用于资源受限领域。

Abstract: Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$\unicode{x2013}$less than $1%$ of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of $51.6%$ in classification and $91.8%$ in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.

[67] LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models cs.CV | cs.LGPDF

Fanfei Li, Thomas Klein, Wieland Brendel, Robert Geirhos, Roland S. Zimmermann

TL;DR: LAION-C 是一个专为网页规模视觉模型设计的OOD（分布外）基准测试，包含六种新颖的失真类型，即使对于 LAION 等大规模数据集也是OOD，挑战了当前最先进的模型。

Details

Motivation: 现有的 OOD 基准测试（如 ImageNet-C）在大规模网页数据集时代不再适用，因为这些失真类型已包含在训练数据中，导致模型性能饱和，无法真实反映OOD泛化能力。

Result: LAION-C 显著挑战了当代模型，甚至多模态模型。最佳模型的OOD泛化能力已与人类观察者相当或超越。

Insight: 提出了OOD泛化的范式转变：从人类优于模型，到最佳模型匹配或超越人类鲁棒性，凸显了现代模型在OOD任务上的进步。

Abstract: Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today’s large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

[68] Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs cs.CV | cs.AI | cs.CLPDF

Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li

TL;DR: 论文提出了一种名为MICS的新型推理路径搜索方案，用于生成严谨且有效的医疗领域链式思维（CoT）训练数据，并通过构建难度分级的MMRP数据集和基于课程学习的Chiron-o1模型，显著提升了医疗多模态大语言模型的推理能力。

Details

Motivation: 虽然多模态大语言模型在通用任务中展现了强大的推理能力，但其在医疗领域的应用仍处于早期阶段。目前的方法在搜索和评估关键诊断的有效推理路径方面缺乏全面框架。

Result: Chiron-o1在多项医疗视觉问答和推理基准测试中达到最先进水平。

Insight: 1. 协作搜索机制能有效提升推理路径的质量。2. 难度分级的数据集和课程学习策略对模型的泛化能力至关重要。

Abstract: Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at GitHub - manglu097/Chiron-o1: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

[69] Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments cs.CV | cs.LGPDF

Yasir Ali Farrukh, Syed Wali, Irfan Khan, Nathaniel D. Bastian

TL;DR: Prmpt2Adpt是一个轻量级的零样本领域自适应框架，通过提示驱动的特征对齐和教师-学生范式，实现了在资源受限环境中的高效领域自适应。

Details

Motivation: 在资源受限的环境中（如无人机），现有的无监督领域自适应方法通常依赖大型视觉语言模型且需要访问源域数据，限制了其适用性。

Result: 在MDS-A数据集上，Prmpt2Adpt与最先进方法性能相当，同时实现了7倍的自适应加速和5倍的推理加速。

Insight: 通过提示驱动的语义对齐和轻量级设计，Prmpt2Adpt为资源受限环境中的实时领域自适应提供了一种实用且可扩展的解决方案。

Abstract: Unsupervised Domain Adaptation (UDA) is a critical challenge in real-world vision systems, especially in resource-constrained environments like drones, where memory and computation are limited. Existing prompt-driven UDA methods typically rely on large vision-language models and require full access to source-domain data during adaptation, limiting their applicability. In this work, we propose Prmpt2Adpt, a lightweight and efficient zero-shot domain adaptation framework built around a teacher-student paradigm guided by prompt-based feature alignment. At the core of our method is a distilled and fine-tuned CLIP model, used as the frozen backbone of a Faster R-CNN teacher. A small set of low-level source features is aligned to the target domain semantics-specified only through a natural language prompt-via Prompt-driven Instance Normalization (PIN). These semantically steered features are used to briefly fine-tune the detection head of the teacher model. The adapted teacher then generates high-quality pseudo-labels, which guide the on-the-fly adaptation of a compact student model. Experiments on the MDS-A dataset demonstrate that Prmpt2Adpt achieves competitive detection performance compared to state-of-the-art methods, while delivering up to 7x faster adaptation and 5x faster inference speed using few source images-making it a practical and scalable solution for real-time adaptation in low-resource domains.

[70] A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X Autonomous Driving cs.CVPDF

Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian

TL;DR: 这篇论文提出了一个用于协同3D语义占据预测的合成基准，通过增强现有数据集并提供密集注释，解决了单车辆感知的局限性。

Details

Motivation: 单车辆感知在遮挡、传感器范围受限和视角狭窄的情况下性能受限，因此需要协同感知来交换互补信息，提升感知的完整性和准确性。

Result: 基线模型在扩展预测范围内持续优于单代理模型，且随着范围扩大增益增加。

Insight: 协同感知在3D语义占据预测中极具潜力，尤其是在远距离和大范围场景下，能够显著提升感知性能。

Abstract: 3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy. In the absence of a dedicated dataset for collaborative 3D semantic occupancy prediction, we augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor to provide dense and comprehensive occupancy annotations. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. Experimental results demonstrate that our baseline model consistently outperforms single-agent models, with increasing gains observed as the prediction range expands.

[71] Unsupervised Image Super-Resolution Reconstruction Based on Real-World Degradation Patterns cs.CV | eess.IVPDF

Yiyang Tie, Hong Zhu, Yunyun Luo, Jing Shi

TL;DR: 该论文提出了一种基于真实世界退化模式的无监督图像超分辨率重建方法，通过TripleGAN框架解决真实低分辨率图像中复杂退化模式建模的挑战。

Details

Motivation: 真实世界超分辨率重建模型的训练依赖于反映真实退化模式的数据集，但仅从低分辨率图像提取退化模式存在挑战，特别是模糊、噪声和色彩偏移等复杂退化。

Result: 实验表明，该方法在定量指标上具有优势，能保持清晰的超分辨率重建效果，避免了过度平滑的伪影。

Insight: TripleGAN框架能够有效学习真实世界的退化模式，并生成对齐的数据集，从而提升模型在真实场景中的超分辨率重建性能。

Abstract: The training of real-world super-resolution reconstruction models heavily relies on datasets that reflect real-world degradation patterns. Extracting and modeling degradation patterns for super-resolution reconstruction using only real-world low-resolution (LR) images remains a challenging task. When synthesizing datasets to simulate real-world degradation, relying solely on degradation extraction methods fails to capture both blur and diverse noise characteristics across varying LR distributions, as well as more implicit degradations such as color gamut shifts. Conversely, domain translation alone cannot accurately approximate real-world blur characteristics due to the significant degradation domain gap between synthetic and real data. To address these challenges, we propose a novel TripleGAN framework comprising two strategically designed components: The FirstGAN primarily focuses on narrowing the domain gap in blur characteristics, while the SecondGAN performs domain-specific translation to approximate target-domain blur properties and learn additional degradation patterns. The ThirdGAN is trained on pseudo-real data generated by the FirstGAN and SecondGAN to reconstruct real-world LR images. Extensive experiments on the RealSR and DRealSR datasets demonstrate that our method exhibits clear advantages in quantitative metrics while maintaining sharp reconstructions without over-smoothing artifacts. The proposed framework effectively learns real-world degradation patterns from LR observations and synthesizes aligned datasets with corresponding degradation characteristics, thereby enabling the trained network to achieve superior performance in reconstructing high-quality SR images from real-world LR inputs.

[72] Stretching Beyond the Obvious: A Gradient-Free Framework to Unveil the Hidden Landscape of Visual Invariance cs.CV | cs.NEPDF

Lorenzo Tausani, Paolo Muratore, Morgan B. Talbot, Giacomo Amerio, Gabriel Kreiman

TL;DR: 该论文提出了名为Stretch-and-Squeeze（SnS）的无梯度框架，用于系统性地揭示视觉单元的不变性特征及其对对抗性扰动的脆弱性。

Details

Motivation: 现有方法仅能揭示视觉单元的最兴奋图像，无法捕捉其在不变性变换下的响应特性，而这些特性对视觉系统的泛化能力至关重要。

Result: 在卷积神经网络（CNN）中，SnS揭示了与传统仿射变换相比更强的像素空间变化但对目标单元响应更强的保护。不同层级图像表征优化发现的不变性图像差异显著。

Insight: 鲁棒性网络生成的图像对人类更易识别，支持其作为视觉系统模型的更高保真度。

Abstract: Uncovering which features’ combinations high-level visual units encode is critical to understand how images are transformed into representations that support recognition. While existing feature visualization approaches typically infer a unit’s most exciting images, this is insufficient to reveal the manifold of transformations under which responses remain invariant, which is key to generalization in vision. Here we introduce Stretch-and-Squeeze (SnS), an unbiased, model-agnostic, and gradient-free framework to systematically characterize a unit’s invariance landscape and its vulnerability to adversarial perturbations in both biological and artificial visual systems. SnS frames these transformations as bi-objective optimization problems. To probe invariance, SnS seeks image perturbations that maximally alter the representation of a reference stimulus in a given processing stage while preserving unit activation. To probe adversarial sensitivity, SnS seeks perturbations that minimally alter the stimulus while suppressing unit activation. Applied to convolutional neural networks (CNNs), SnS revealed image variations that were further from a reference image in pixel-space than those produced by affine transformations, while more strongly preserving the target unit’s response. The discovered invariant images differed dramatically depending on the choice of image representation used for optimization: pixel-level changes primarily affected luminance and contrast, while stretching mid- and late-layer CNN representations altered texture and pose respectively. Notably, the invariant images from robust networks were more recognizable by human subjects than those from standard networks, supporting the higher fidelity of robust CNNs as models of the visual system.

[73] Assembler: Scalable 3D Part Assembly via Anchor Point Diffusion cs.CVPDF

Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan

TL;DR: Assembler是一个可扩展的3D零件组装框架，通过稀疏锚点扩散模型和生成方法解决多样化零件组装问题，并在PartNet数据集上取得最优性能。

Details

Motivation: 现有的3D零件组装方法多依赖于确定性姿态预测和类别特定训练，难以处理多样化、复杂场景下的不确定性和多样性问题。

Result: 在PartNet数据集上达到SOTA性能，并首次实现复杂真实物体的高质量组装。

Insight: 通过生成方法解决零件组装的多样性问题，锚点云表示简化了复杂姿态预测，适用于交互式设计。

Abstract: We present Assembler, a scalable and generalizable framework for 3D part assembly that reconstructs complete objects from input part meshes and a reference image. Unlike prior approaches that mostly rely on deterministic part pose prediction and category-specific training, Assembler is designed to handle diverse, in-the-wild objects with varying part counts, geometries, and structures. It addresses the core challenges of scaling to general 3D part assembly through innovations in task formulation, representation, and data. First, Assembler casts part assembly as a generative problem and employs diffusion models to sample plausible configurations, effectively capturing ambiguities arising from symmetry, repeated parts, and multiple valid assemblies. Second, we introduce a novel shape-centric representation based on sparse anchor point clouds, enabling scalable generation in Euclidean space rather than SE(3) pose prediction. Third, we construct a large-scale dataset of over 320K diverse part-object assemblies using a synthesis and filtering pipeline built on existing 3D shape repositories. Assembler achieves state-of-the-art performance on PartNet and is the first to demonstrate high-quality assembly for complex, real-world objects. Based on Assembler, we further introduce an interesting part-aware 3D modeling system that generates high-resolution, editable objects from images, demonstrating potential for interactive and compositional design. Project page: https://assembler3d.github.io

[74] MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation cs.CV | cs.AI | cs.CLPDF

Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal

TL;DR: 论文提出了MEXA框架，通过动态多专家模型聚合来解决多模态推理的挑战，无需训练即可适应不同任务和模态。

Details

Motivation: 多模态任务的复杂性和模态多样性使得构建统一框架具有挑战性，而现有多方法难以灵活适应多种任务。

Result: MEXA在多种多模态基准测试中表现优于基线模型，展示了方法的广泛适用性。

Insight: 模块化设计和动态专家选择为多模态推理提供了透明且可扩展的解决方案。

Abstract: Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

[75] RGBTrack: Fast, Robust Depth-Free 6D Pose Estimation and Tracking cs.CV | cs.ROPDF

Teng Guo, Jingjin Yu

TL;DR: RGBTrack是一种无需深度输入的实时6D姿态估计与跟踪框架，通过二元搜索和渲染比较机制高效恢复深度信息，并在动态场景中实现稳定跟踪。

Details

Motivation: 传统6D姿态估计依赖深度输入，而RGBTrack旨在仅通过RGB数据实现高效且鲁棒的姿态估计与跟踪，适用于动态场景。

Result: RGBTrack在基准测试中表现出竞争性的精度和实时性能，适用于机器人、增强现实等领域。

Insight: 仅依靠RGB数据即可实现高精度的6D姿态跟踪，为实际应用提供了轻量化且高效的解决方案。

Abstract: We introduce a robust framework, RGBTrack, for real-time 6D pose estimation and tracking that operates solely on RGB data, thereby eliminating the need for depth input for such dynamic and precise object pose tracking tasks. Building on the FoundationPose architecture, we devise a novel binary search strategy combined with a render-and-compare mechanism to efficiently infer depth and generate robust pose hypotheses from true-scale CAD models. To maintain stable tracking in dynamic scenarios, including rapid movements and occlusions, RGBTrack integrates state-of-the-art 2D object tracking (XMem) with a Kalman filter and a state machine for proactive object pose recovery. In addition, RGBTrack’s scale recovery module dynamically adapts CAD models of unknown scale using an initial depth estimate, enabling seamless integration with modern generative reconstruction techniques. Extensive evaluations on benchmark datasets demonstrate that RGBTrack’s novel depth-free approach achieves competitive accuracy and real-time performance, making it a promising practical solution candidate for application areas including robotics, augmented reality, and computer vision. The source code for our implementation will be made publicly available at https://github.com/GreatenAnoymous/RGBTrack.git.

[76] Dynamic Watermark Generation for Digital Images using Perimeter Gated SPAD Imager PUFs cs.CVPDF

Md Sakibur Sajal, Marc Dandin

TL;DR: 该论文提出了一种基于pgSPAD成像器的动态水印生成方法，利用制造过程中的暗信号不均匀性（DSNU）实现数字图像的安全功能，并验证了其在源识别和篡改检测中的有效性。

Details

Motivation: 尽管CMOS图像传感器（CIS）和主动像素传感器（APS）已被用于数字水印技术，但单光子雪崩二极管（SPAD）成像器尚未被探索。论文旨在填补这一空白，研究基于SPAD成像器的水印生成方法。

Result: 实验结果表明，该方法能够有效地实现源识别和篡改检测，并提供了灵敏度与鲁棒性之间的可控权衡。

Insight: 制造过程中的DSNU特性可以被有效用于生成独特且难以复制的动态水印，为图像安全技术提供了新的硬件支持。

Abstract: Digital image watermarks as a security feature can be derived from the imager’s physically unclonable functions (PUFs) by utilizing the manufacturing variations, i.e., the dark signal non-uniformity (DSNU). While a few demonstrations focused on the CMOS image sensors (CIS) and active pixel sensors (APS), single photon avalanche diode (SPAD) imagers have never been investigated for this purpose. In this work, we have proposed a novel watermarking technique using perimeter gated SPAD (pgSPAD) imagers. We utilized the DSNU of three 64 x 64 pgSPAD imager chips, fabricated in a 0.35 {\mu}m standard CMOS process and analyzed the simulated watermarks for standard test images from publicly available database. Our observation shows that both source identification and tamper detection can be achieved using the proposed source-scene-specific dynamic watermarks with a controllable sensitivity-robustness trade-off.

Dongdong Meng, Sheng Li, Hao Wu, Guoping Wang, Xueqing Yan

TL;DR: 该论文提出了一种新颖的半监督多模态医学图像分割方法，通过多阶段融合和对比互学习策略，有效利用未标记数据提升复杂背景下的分割性能。

Details

Motivation: 医学图像标注成本高，而传统半监督学习方法在复杂背景下表现不足。多模态融合虽能提供互补信息，但在半监督条件下利用未标记数据仍具挑战。因此，需开发一种有效的半监督多模态学习策略。

Result: 在两个多模态数据集上验证了所提框架的优越性能和鲁棒性。

Insight: 多模态互补信息与半监督学习结合可显著提升复杂任务的分割性能，尤其是在标注数据有限的情况下。

Abstract: Semi-supervised learning addresses the issue of limited annotations in medical images effectively, but its performance is often inadequate for complex backgrounds and challenging tasks. Multi-modal fusion methods can significantly improve the accuracy of medical image segmentation by providing complementary information. However, they face challenges in achieving significant improvements under semi-supervised conditions due to the challenge of effectively leveraging unlabeled data. There is a significant need to create an effective and reliable multi-modal learning strategy for leveraging unlabeled data in semi-supervised segmentation. To address these issues, we propose a novel semi-supervised multi-modal medical image segmentation approach, which leverages complementary multi-modal information to enhance performance with limited labeled data. Our approach employs a multi-stage multi-modal fusion and enhancement strategy to fully utilize complementary multi-modal information, while reducing feature discrepancies and enhancing feature sharing and alignment. Furthermore, we effectively introduce contrastive mutual learning to constrain prediction consistency across modalities, thereby facilitating the robustness of segmentation results in semi-supervised tasks. Experimental results on two multi-modal datasets demonstrate the superior performance and robustness of the proposed framework, establishing its valuable potential for solving medical image segmentation tasks in complex scenarios.

[78] On the Theory of Conditional Feature Alignment for Unsupervised Domain-Adaptive Counting cs.CVPDF

Zhuonan Liang, Dongnan Liu, Jianan Fan, Yaxuan Song, Qiang Qu

TL;DR: 论文提出了一种条件特征对齐的理论框架，用于解决无监督域自适应计数问题，通过条件对齐分布改善了跨域计数性能。

Details

Motivation: 目标计数模型在跨域部署时性能下降，原因是密度的变化是任务相关的，违反了标准的域自适应假设。

Result: 在多个密度分布变化的计数数据集上，该方法优于现有无监督域自适应方法。

Insight: 条件对齐分布能够更紧致地限制联合误差，从而提升跨域计数的泛化能力。

Abstract: Object counting models suffer when deployed across domains with differing density variety, since density shifts are inherently task-relevant and violate standard domain adaptation assumptions. To address this, we propose a theoretical framework of conditional feature alignment. We first formalize the notion of conditional divergence by partitioning each domain into subsets (e.g., object vs. background) and measuring divergences per condition. We then derive a joint error bound showing that, under discrete label spaces treated as condition sets, aligning distributions conditionally leads to tighter bounds on the combined source-target decision error than unconditional alignment. These insights motivate a general conditional adaptation principle: by preserving task-relevant variations while filtering out nuisance shifts, one can achieve superior cross-domain generalization for counting. We provide both defining conditional divergence then proving its benefit in lowering joint error and a practical adaptation strategy that preserves task-relevant information in unsupervised domain-adaptive counting. We demonstrate the effectiveness of our approach through extensive experiments on multiple counting datasets with varying density distributions. The results show that our method outperforms existing unsupervised domain adaptation methods, empirically validating the theoretical insights on conditional feature alignment.

[79] Do We Need Large VLMs for Spotting Soccer Actions? cs.CV | cs.AI | cs.LGPDF

Ritabrata Chakraborty, Rajatsubhra Chakraborty, Avijit Dasgupta, Sandeep Chaurasia

TL;DR: 该论文提出了一种基于文本的轻量级方法，利用大型语言模型（LLMs）替代传统的视觉-语言模型（VLMs）来检测足球比赛中的关键动作，实验表明其有效性。

Details

Motivation: 传统基于视频的足球动作检测方法复杂且计算成本高，论文希望通过利用专家评论中的丰富信息，构建一种更轻量且无需训练的文本中心替代方案。

Result: 实验结果表明，该语言中心方法能够有效检测比赛中的关键事件，提供了一种轻量且无需训练的替代方案。

Insight: 专家评论中的文本信息包含了足够的上下文和细节，可以替代密集的视频数据，简化动作检测的流程。

Abstract: Traditional video-based tasks like soccer action spotting rely heavily on visual inputs, often requiring complex and computationally expensive models to process dense video data. In this work, we propose a shift from this video-centric approach to a text-based task, making it lightweight and scalable by utilizing Large Language Models (LLMs) instead of Vision-Language Models (VLMs). We posit that expert commentary, which provides rich, fine-grained descriptions and contextual cues such as excitement and tactical insights, contains enough information to reliably spot key actions in a match. To demonstrate this, we use the SoccerNet Echoes dataset, which provides timestamped commentary, and employ a system of three LLMs acting as judges specializing in outcome, excitement, and tactics. Each LLM evaluates sliding windows of commentary to identify actions like goals, cards, and substitutions, generating accurate timestamps for these events. Our experiments show that this language-centric approach performs effectively in detecting critical match events, providing a lightweight and training-free alternative to traditional video-based methods for action spotting.

[80] Co-Seg++: Mutual Prompt-Guided Collaborative Learning for Versatile Medical Segmentation cs.CVPDF

Qing Xu, Yuxiang Luo, Wenting Duan, Zhen Chen

TL;DR: Co-Seg++提出了一种新的医学图像协同分割框架，通过结合语义和实例分割任务，利用时空提示编码器和多任务协同解码器提升分割性能。

Details

Motivation: 现有的医学图像分割方法通常孤立处理不同任务，忽视了任务间的相互依赖性，导致性能不佳。作者提出通过协同学习提升分割效果。

Result: 在CT和组织病理学数据集上，Co-Seg++在语义、实例和全景分割任务中均优于现有方法。

Insight: 结合语义和实例分割任务可以相互促进，提升医学图像的理解能力；时空关系和跨任务引导是协同学习的有效机制。

Abstract: Medical image analysis is critical yet challenged by the need of jointly segmenting organs or tissues, and numerous instances for anatomical structures and tumor microenvironment analysis. Existing studies typically formulated different segmentation tasks in isolation, which overlooks the fundamental interdependencies between these tasks, leading to suboptimal segmentation performance and insufficient medical image understanding. To address this issue, we propose a Co-Seg++ framework for versatile medical segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing semantic and instance segmentation tasks to mutually enhance each other. We first devise a spatio-temporal prompt encoder (STP-Encoder) to capture long-range spatial and temporal relationships between segmentation regions and image embeddings as prior spatial constraints. Moreover, we devise a multi-task collaborative decoder (MTC-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, jointly computing semantic and instance segmentation masks. Extensive experiments on diverse CT and histopathology datasets demonstrate that the proposed Co-Seg++ outperforms state-of-the-arts in the semantic, instance, and panoptic segmentation of dental anatomical structures, histopathology tissues, and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg-Plus.

[81] YASMOT: Yet another stereo image multi-object tracker cs.CVPDF

Ketil Malde

TL;DR: YASMOT是一个轻量级且灵活的多目标追踪器，支持处理单目或双目相机配置的输出，并能从多个目标检测器的集成中生成共识检测。

Details

Motivation: 在视频或静态图像序列中，追踪对象并保持其身份是提升目标检测性能的关键，同时为下游任务（如行为分类和丰度估计）提供支持。

Result: YASMOT提供了一个高效且灵活的追踪解决方案，适用于多种相机配置和检测器集成。

Insight: 集成多个检测器的结果可以提高追踪的鲁棒性和准确性，而轻量级设计使其适合实际应用部署。

Abstract: There now exists many popular object detectors based on deep learning that can analyze images and extract locations and class labels for occurrences of objects. For image time series (i.e., video or sequences of stills), tracking objects over time and preserving object identity can help to improve object detection performance, and is necessary for many downstream tasks, including classifying and predicting behaviors, and estimating total abundances. Here we present yasmot, a lightweight and flexible object tracker that can process the output from popular object detectors and track objects over time from either monoscopic or stereoscopic camera configurations. In addition, it includes functionality to generate consensus detections from ensembles of object detectors.

[82] Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition cs.CVPDF

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou

TL;DR: Hunyuan-GameCraft 是一种新型框架，用于游戏中高动态交互视频生成。通过将键盘和鼠标输入映射到共享相机表示空间，并结合混合历史条件训练策略，实现了细粒度动作控制和长期一致性。模型蒸馏提升了推理效率，适用于实时部署。

Details

Motivation: 当前基于扩散和可控视频生成的方法在动态性、通用性、长期一致性和效率方面存在局限，限制了多样化游戏视频的生成能力。Hunyuan-GameCraft 旨在解决这些问题。

Result: Hunyuan-GameCraft 在实验中的表现显著优于现有模型，提升了游戏视频的真实感和可玩性。

Insight: 统一的输入表示和混合历史条件训练对高动态交互视频生成至关重要，模型蒸馏在保持性能的同时提高了效率。

Abstract: Recent advances in diffusion-based and controllable video generation have enabled high-quality and temporally coherent video synthesis, laying the groundwork for immersive interactive gaming experiences. However, current methods face limitations in dynamics, generality, long-term consistency, and efficiency, which limit the ability to create various gameplay videos. To address these gaps, we introduce Hunyuan-GameCraft, a novel framework for high-dynamic interactive video generation in game environments. To achieve fine-grained action control, we unify standard keyboard and mouse inputs into a shared camera representation space, facilitating smooth interpolation between various camera and movement operations. Then we propose a hybrid history-conditioned training strategy that extends video sequences autoregressively while preserving game scene information. Additionally, to enhance inference efficiency and playability, we achieve model distillation to reduce computational overhead while maintaining consistency across long temporal sequences, making it suitable for real-time deployment in complex interactive environments. The model is trained on a large-scale dataset comprising over one million gameplay recordings across over 100 AAA games, ensuring broad coverage and diversity, then fine-tuned on a carefully annotated synthetic dataset to enhance precision and control. The curated game scene data significantly improves the visual fidelity, realism and action controllability. Extensive experiments demonstrate that Hunyuan-GameCraft significantly outperforms existing models, advancing the realism and playability of interactive game video generation.

[83] UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation cs.CVPDF

Teng Li, Quanfeng Lu, Lirui Zhao, Hao Li, Xizhou Zhu

TL;DR: 论文提出了一种名为UniFork的Y形结构，用于解决多模态理解与生成任务中模态对齐模式的冲突问题，通过共享浅层网络和任务专用的深层分支，实现了性能的提升。

Details

Motivation: 现有的统一多模态模型在处理理解与生成任务时，由于模态对齐模式的不同（理解任务需要渐进增强的对齐，而生成任务需要早期对齐、后期解耦），导致性能受限。

Result: UniFork在性能上优于传统共享Transformer架构，与任务专用模型相当或更好。

Insight: 多模态任务的设计需考虑不同任务的对齐模式差异，共享与特化的结合是提升性能的关键。

Abstract: Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.

[84] Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens cs.CV | cs.AIPDF

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan

TL;DR: 论文提出了一种名为Mirage的框架，通过在视觉语言模型（VLM）的解码过程中引入潜在视觉令牌（latent visual tokens），实现多模态推理，避免了显式图像生成，从而提升了任务性能。

Details

Motivation: 视觉语言模型在多模态理解方面表现优异，但由于仅支持文本解码，其在需要视觉想象的任务中存在性能瓶颈。现有方法尝试通过生成显式图像来改善推理，但图像生成预训练通常会削弱推理能力。受人类通过心理意象（mental imagery）进行推理的启发，研究者探索了是否可以在不生成显式图像的情况下，通过潜在视觉令牌实现多模态推理。

Result: 实验表明，Mirage在多模态推理任务中表现优于传统方法，且无需显式生成图像。

Insight: 通过潜在视觉令牌实现的心理意象式推理，可以绕过显式图像生成的复杂性，从而更高效地提升多模态模型的推理能力。

Abstract: Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually’’, it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

[85] Emergent Temporal Correspondences from Video Diffusion Transformers cs.CVPDF

Jisu Nam, Soowon Son, Dahyun Chung, Jiyoung Kim, Siyoon Jin

TL;DR: 该论文提出了DiffTrack，一个用于定量分析视频扩散模型中时间对应关系的框架，揭示了特定层在时间匹配中的作用，并展示了其在零样本点跟踪和视频生成中的实际应用。

Details

Motivation: 尽管基于Diffusion Transformers（DiTs）的视频扩散模型在生成时间连贯视频方面取得了显著成功，但其内部如何建立和表示帧间的时间对应关系仍是一个关键问题。

Result: DiffTrack在零样本点跟踪任务中达到了最先进的性能，且在视频生成中通过新指导方法提高了时间一致性。

Insight: 1. 时间匹配主要通过特定层的查询-键相似性完成，而非所有层；2. 去噪过程中时间匹配逐渐显现；3. 视频DiTs的内部机制为理解和改进其时间理解提供了基础。

Abstract: Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question persists: how do these models internally establish and represent temporal correspondences across frames? We introduce DiffTrack, the first quantitative analysis framework designed to answer this question. DiffTrack constructs a dataset of prompt-generated video with pseudo ground-truth tracking annotations and proposes novel evaluation metrics to systematically analyze how each component within the full 3D attention mechanism of DiTs (e.g., representations, layers, and timesteps) contributes to establishing temporal correspondences. Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching, and that this matching becomes increasingly prominent during the denoising process. We demonstrate practical applications of DiffTrack in zero-shot point tracking, where it achieves state-of-the-art performance compared to existing vision foundation and self-supervised video models. Further, we extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training. We believe our work offers crucial insights into the inner workings of video DiTs and establishes a foundation for further research and applications leveraging their temporal understanding.

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, Hengshuang Zhao

TL;DR: VLN-R1 是一個端到端的框架，利用大型視覺-語言模型（LVLM）從第一人稱視角直接導航。通過兩階段訓練（監督微調與強化微調），並結合時間衰減獎勵機制，在 VLN-CE 基準上表現出色。

Details

Motivation: 現有的基於語言模型的導航系統依賴於離散的拓撲圖結構，限制了路徑規劃的靈活性，而 VLN-R1 旨在通過 LVLM 實現連續的導航動作，提升任務適應性。

Result: VLN-R1 在 VLN-CE 基準上表現優異，證明 LVLM 可以通過數據高效和獎勵驅動的訓練提升導航能力。

Insight: 大型視覺-語言模型不僅能處理語言任務，還可通過強化學習微調實現高效的物理導航，展現了跨模態應用的潛力。

Abstract: Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions. Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections. We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1. To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, Habitat, and propose Long-Short Memory Sampling to balance historical and current observations. While large language models can supervise complete textual instructions, they lack fine-grained action-level control. Our framework employs a two-stage training approach: a) Supervised fine-tuning (SFT) to align the model’s action sequence text predictions with expert demonstrations, followed by b) Reinforcement fine-tuning (RFT) enhanced with a Time-Decayed Reward (TDR) mechanism that strategically weights multi-step future actions. Experimental results show VLN-R1 achieves strong performance on VLN-CE benchmark. VLN-R1 proves LVLMs can drive embodied navigation and enhance task-specific reasoning through data-efficient, reward-driven post-training.

cs.CL [Back]

[87] Veracity: An Open-Source AI Fact-Checking System cs.CL | cs.AI | cs.HCPDF

Taylor Lynn Curtis, Maximilian Puelma Touzel, William Garneau, Manon Gruaz, Mike Pinder

TL;DR: 论文介绍了开源AI事实核查系统Veracity，旨在通过透明且易用的方式帮助用户对抗虚假信息。系统结合大型语言模型和网络检索代理，提供多语言支持、数值化评分及交互界面。

Details

Motivation: 虚假信息的泛滥对社会构成严重威胁，尤其是在生成式AI的推动下。Veracity旨在通过开源工具提升大众的媒体素养，促进社会信息的真实性。

Result: 系统能够有效检测虚假信息，并通过直观解释帮助用户理解其推理过程，提升媒体素养。

Insight: 开源和透明的设计可以增强公众对AI工具的信任，同时多语言和交互式功能有助于广泛推广事实核查技术。

Abstract: The proliferation of misinformation poses a significant threat to society, exacerbated by the capabilities of generative AI. This demo paper introduces Veracity, an open-source AI system designed to empower individuals to combat misinformation through transparent and accessible fact-checking. Veracity leverages the synergy between Large Language Models (LLMs) and web retrieval agents to analyze user-submitted claims and provide grounded veracity assessments with intuitive explanations. Key features include multilingual support, numerical scoring of claim veracity, and an interactive interface inspired by familiar messaging applications. This paper will showcase Veracity’s ability to not only detect misinformation but also explain its reasoning, fostering media literacy and promoting a more informed society.

[88] MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents cs.CL | cs.AI | cs.IRPDF

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash

TL;DR: MEM1提出了一种强化学习框架，通过动态更新紧凑的共享内部状态，实现长时多轮任务的高效处理，显著减少内存使用并提升性能。

Details

Motivation: 现有语言智能体在多轮交互中依赖全上下文提示，导致内存无限增长、计算成本增加以及分布外输入长度下推理性能下降。

Result: 在多项任务中，MEM1-7B性能提升3.5倍，内存使用减少3.7倍，且能超越训练时长进行泛化。

Insight: 通过推理驱动的记忆合并，MEM1为长时交互智能体提供了一种高效且性能优化的解决方案。

Abstract: Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.

[89] Finance Language Model Evaluation (FLaME) cs.CL | cs.AI | cs.CEPDF

Glenn Matlin, Mika Okamoto, Huzaifa Pardawala, Yang Yang, Sudheer Chava

TL;DR: FLaME是首个针对金融领域语言模型（LMs）的全面评测框架，填补了现有评估方法在专业金融NLP任务中的不足，并展示了LMs在金融领域的潜力。

Details

Motivation: 现有评测框架在金融领域的专业NLP任务中存在方法论缺陷，导致低估了LMs的性能。本文旨在填补这一空白。

Result: 展示了LMs在金融NLP任务中的性能潜力，纠正了现有评估中低估的问题。

Insight: 金融领域的LMs需要在推理能力上进行针对性优化，现有基础模型的潜力被低估。

Abstract: Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs’ performance on common Finance NLP (FinNLP) tasks. To demonstrate the potential of LMs for these FinNLP tasks, we present the first holistic benchmarking suite for Financial Language Model Evaluation (FLaME). We are the first research paper to comprehensively study LMs against ‘reasoning-reinforced’ LMs, with an empirical study of 23 foundation LMs over 20 core NLP tasks in finance. We open-source our framework software along with all data and results.

[90] Language Models can perform Single-Utterance Self-Correction of Perturbed Reasoning cs.CL | cs.AIPDF

Sam Silver, Jimin Sun, Ivan Zhang, Sara Hooker, Eddie Kim

TL;DR: 这篇论文揭示了大型语言模型（LLMs）在单次自我纠正扰动的推理能力，包括隐式和显式纠正错误的能力，表明其内在自我纠正能力可能比文献中常见的更强。

Details

Motivation: 研究LLMs在面对问题描述或提示策略的细微变化时的推理脆弱性，并探索其自我纠正能力的潜在机制。

Result: 观察到模型能够稳健地从细微的隐式纠正到显式承认并纠正错误，表现出显著的自我纠正能力。

Insight: LLMs的内在自我纠正能力可能被低估，其推理性能的提升可能源于对已有能力的强化而非全新学习。

Abstract: Large Language Models (LLMs) have demonstrated impressive mathematical reasoning capabilities, yet their performance remains brittle to minor variations in problem description and prompting strategy. Furthermore, reasoning is vulnerable to sampling-induced errors which autoregressive models must primarily address using self-correction via additionally-generated tokens. To better understand self-correction capabilities of recent models, we conduct experiments measuring models’ ability to self-correct synthetic perturbations introduced into their Chain of Thought (CoT) reasoning. We observe robust single-utterance intrinsic self-correction behavior across a range of open-weight models and datasets, ranging from subtle, implicit corrections to explicit acknowledgments and corrections of errors. Our findings suggest that LLMs, including those not finetuned for long CoT, may possess stronger intrinsic self-correction capabilities than commonly shown in the literature. The presence of this ability suggests that recent “reasoning” model work involves amplification of traits already meaningfully present in models.

[91] From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents cs.CLPDF

Mohammad Amaan Sayeed, Mohammed Talha Alam, Raza Imam, Shahab Saquib Sohail, Amir Hussain

TL;DR: 论文提出了一種評估管道Tibbe-AG，用於驗證伊斯蘭醫學文本的回應，通過結合檢索、生成和自評，顯著提升了回答的準確性和文化敏感性。

Details

Motivation: 伊斯蘭醫學文本（如《阿維森納醫學典》和《先知醫學》）包含豐富的預防保健和整體療法信息，但在現代AI系統中未被充分利用。現有語言模型基準僅關注事實回憶或用戶偏好，缺乏對基於文化的醫學指導的大規模驗證。

Result: 檢索增強生成將事實準確性提升了13%，而代理提示通過深度機制洞察和安全考慮再增加10%的改進。

Insight: 結合傳統伊斯蘭文本、檢索和自評，能夠實現可靠且文化敏感的醫學問答，為跨文化醫學AI應用提供了新思路。

Abstract: Centuries-old Islamic medical texts like Avicenna’s Canon of Medicine and the Prophetic Tibb-e-Nabawi encode a wealth of preventive care, nutrition, and holistic therapies, yet remain inaccessible to many and underutilized in modern AI systems. Existing language-model benchmarks focus narrowly on factual recall or user preference, leaving a gap in validating culturally grounded medical guidance at scale. We propose a unified evaluation pipeline, Tibbe-AG, that aligns 30 carefully curated Prophetic-medicine questions with human-verified remedies and compares three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) under three configurations: direct generation, retrieval-augmented generation, and a scientific self-critique filter. Each answer is then assessed by a secondary LLM serving as an agentic judge, yielding a single 3C3H quality score. Retrieval improves factual accuracy by 13%, while the agentic prompt adds another 10% improvement through deeper mechanistic insight and safety considerations. Our results demonstrate that blending classical Islamic texts with retrieval and self-evaluation enables reliable, culturally sensitive medical question-answering.

[92] Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion cs.CL | cs.AI | cs.SD | eess.ASPDF

Markus Frohmann, Gabriel Meseguer-Brocal, Markus Schedl, Elena V. Epure

TL;DR: 该论文提出了一种新颖的多模态检测方法（DE-detect），结合音频和歌词信息，提升了检测AI生成音乐的鲁棒性和实用性。

Details

Motivation: 随着AI音乐生成工具的快速发展，现有的单模态检测方法（仅依赖音频或歌词）面临泛化性差或对数据质量要求高的限制，亟需一种更可靠的检测方法。

Result: 实验表明，该方法（DE-detect）在检测AI生成音乐时优于现有歌词检测器，同时对音频扰动具有更强的鲁棒性。

Insight: 通过融合多模态信息，直接从音频中提取歌词相关特征，可以更有效地检测AI生成内容，且适用于实际场景。

Abstract: The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.

[93] From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation cs.CL | cs.AIPDF

Zhihan Guo, Jiele Wu, Wenqian Cui, Yifei Zhang, Minda Hu

TL;DR: 本文提出了ProxyReward，一种基于强化学习的框架，用于解决开放端长文本生成（Open-LTG）任务中的数据缺失和奖励信号不准确问题，并通过实验证明其性能超越GPT-4-Turbo。

Details

Motivation: 现有的长文本生成研究主要集中在长上下文理解，而开放端长文本生成任务因缺乏黄金标准数据和通用奖励信号不足而未被充分探索。

Result: 实验结果表明，ProxyReward在Open-LTG任务中表现优于GPT-4-Turbo，性能提升20%。

Insight: 目标奖励信号的设计和自动数据集生成是提升开放端长文本生成任务性能的关键。

Abstract: Current research on long-form context in Large Language Models (LLMs) primarily focuses on the understanding of long-contexts, the Open-ended Long Text Generation (Open-LTG) remains insufficiently explored. Training a long-context generation model requires curation of gold standard reference data, which is typically nonexistent for informative Open-LTG tasks. However, previous methods only utilize general assessments as reward signals, which limits accuracy. To bridge this gap, we introduce ProxyReward, an innovative reinforcement learning (RL) based framework, which includes a dataset and a reward signal computation method. Firstly, ProxyReward Dataset generation is accomplished through simple prompts that enables the model to create automatically, obviating extensive labeled data or significant manual effort. Secondly, ProxyReward Signal offers a targeted evaluation of information comprehensiveness and accuracy for specific questions. The experimental results indicate that our method ProxyReward surpasses even GPT-4-Turbo. It can significantly enhance performance by 20% on the Open-LTG task when training widely used open-source models, while also surpassing the LLM-as-a-Judge approach. Our work presents effective methods to enhance the ability of LLMs to address complex open-ended questions posed by human.

[94] EvoLM: In Search of Lost Language Model Training Dynamics cs.CL | cs.AI | cs.LGPDF

Zhenting Qi, Fan Nie, Alexandre Alahi, James Zou, Himabindu Lakkaraju

TL;DR: EvoLM是一个语言模型套件，用于系统地分析不同训练阶段（如预训练、持续预训练、监督微调和强化学习）对语言模型性能的影响。通过对100多个参数为1B和4B的模型进行训练，揭示了过度预训练和后期训练的收益递减现象，以及持续预训练在连接不同训练阶段中的重要性。

Details

Motivation: 现代语言模型训练通常分为多个阶段，但开发者难以评估每个阶段设计选择的影响。EvoLM旨在提供一个透明且系统的工具，帮助研究语言模型的训练动态。

Result: 研究发现过度预训练和后期训练的收益递减，持续预训练在连接不同阶段中的重要性，以及在监督微调和强化学习中的复杂权衡。

Insight: 持续预训练对于缓解领域迁移中的遗忘至关重要；预训练和后期训练的过度投入可能带来收益递减；不同训练阶段的配置会直接影响模型的泛化能力。

Abstract: Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs’ training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. By training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.

[95] Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3 cs.CL | cs.LGPDF

Xinyue Huang, Ziqi Lin, Fang Sun, Wenchao Zhang, Kejian Tong

TL;DR: 该论文提出了一种基于LLaMA 3的检索增强生成（RAG）框架，专注于复杂多跳问答任务，结合密集检索和多跳推理机制，提升了生成答案的准确性和连贯性。

Details

Motivation: 为了解决长文档中的多跳推理和上下文理解问题，提出了一个新型RAG框架。

Result: 实验表明，该框架在复杂问答任务中优于现有方法，生成的答案更准确且上下文相关。

Insight: 结合检索与生成的联合优化策略是提升多跳问答性能的关键，LLaMA 3的强大生成能力进一步增强了框架的适应性。

Abstract: This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent response generation. A joint optimization strategy combining retrieval likelihood and generation cross-entropy improves the model’s robustness and adaptability. Experimental results show that the proposed system outperforms existing retrieval-augmented and generative baselines, confirming its effectiveness in delivering precise, contextually grounded answers.

[96] DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling cs.CL | cs.AI | cs.LGPDF

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, Sercan Ö. Arık

TL;DR: DynScaling通过动态和集成采样实现高效的推理扩展，无需依赖外部验证器，提升了大规模语言模型的性能。

Details

Motivation: 现有的推理扩展方法通常依赖外部验证器或未优化实际计算约束，限制了其实际应用。

Result: 实验结果表明，DynScaling在任务性能和计算成本上均优于现有无验证器的推理扩展基线方法。

Insight: DynScaling通过动态调整计算资源分配和优化采样策略，显著提升了效率，适用于实际资源受限场景。

Abstract: Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation. Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints. We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework. The integrated sampling strategy unifies parallel and sequential sampling by constructing synthetic sequential reasoning chains from initially independent parallel responses, promoting diverse and coherent reasoning trajectories. The dynamic budget allocation framework formulates the allocation of computational resources as a multi-armed bandit problem, adaptively distributing the inference budget across queries based on the uncertainty of previously sampled responses, thereby maximizing computational efficiency. By combining these components, DynScaling effectively improves LLM performance under practical resource constraints without the need for external verifiers. Experimental results demonstrate that DynScaling consistently surpasses existing verifier-free inference scaling baselines in both task performance and computational cost.

[97] A Hybrid DeBERTa and Gated Broad Learning System for Cyberbullying Detection in English Text cs.CL | cs.AIPDF

Devesh Kumar

TL;DR: 本文提出了一种结合DeBERTa改进模型和门控广义学习系统（GBLS）的混合架构，用于英语文本中的网络欺凌检测，在多个数据集上表现出色，同时提供了可解释性机制。

Details

Motivation: 网络欺凌是严重的社会问题，影响大量青少年，现有检测方法在上下文理解和模式识别方面存在不足。

Result: 在HateXplain（79.3%）、SOSNet（95.41%）、Mendeley-I（91.37%）和Mendeley-II（94.67%）数据集上表现优异。

Insight: 1. 显式偏见和讽刺内容检测仍是挑战；2. 各组件对性能提升均有贡献；3. 可解释性对内容审核至关重要。

Abstract: The proliferation of online communication platforms has created unprecedented opportunities for global connectivity while simultaneously enabling harmful behaviors such as cyberbullying, which affects approximately 54.4% of teenagers according to recent research. This paper presents a hybrid architecture that combines the contextual understanding capabilities of transformer-based models with the pattern recognition strengths of broad learning systems for effective cyberbullying detection. This approach integrates a modified DeBERTa model augmented with Squeeze-and-Excitation blocks and sentiment analysis capabilities with a Gated Broad Learning System (GBLS) classifier, creating a synergistic framework that outperforms existing approaches across multiple benchmark datasets. The proposed ModifiedDeBERTa + GBLS model achieved good performance on four English datasets: 79.3% accuracy on HateXplain, 95.41% accuracy on SOSNet, 91.37% accuracy on Mendeley-I, and 94.67% accuracy on Mendeley-II. Beyond performance gains, the framework incorporates comprehensive explainability mechanisms including token-level attribution analysis, LIME-based local interpretations, and confidence calibration, addressing critical transparency requirements in automated content moderation. Ablation studies confirm the meaningful contribution of each architectural component, while failure case analysis reveals specific challenges in detecting implicit bias and sarcastic content, providing valuable insights for future improvements in cyberbullying detection systems.

[98] Knee-Deep in C-RASP: A Transformer Depth Hierarchy cs.CL | cs.FLPDF

Andy Yang, Michaël Cadilhac, David Chiang

TL;DR: 该论文通过理论证明和实证研究表明，更深层的Transformer模型表达能力更强，并通过C-RASP程序形式验证了这一结论。

Details

Motivation: 研究动机是探索Transformer模型深度的增加如何带来表现能力的提升，并试图通过形式化方法验证这一点。

Result: 结果表明，更深层的Transformer模型在理论上和实验中都表现出了更强的表达能力，尤其是在顺序依赖任务上的泛化能力。

Insight: 论文的洞见在于，通过形式化方法揭示了模型深度的增加对表达能力的具体影响，并为设计更高效的Transformer模型提供了理论支持。

Abstract: It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained with greater depth? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). These results are established by studying a form of temporal logic with counting operators, which was shown equivalent to C-RASP in previous work. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.

[99] Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning cs.CLPDF

Duc Hieu Ho, Chenglin Fan

TL;DR: 该论文提出了一种自我批判引导的好奇心优化方法，通过上下文学习提升大语言模型（LLMs）的诚实性和帮助性，实验表明该方法在多个模型中均取得了显著改进。

Details

Motivation: 尽管LLMs在多种自然语言任务中表现优异，但其输出的一致诚实性和帮助性仍是一个挑战。本研究致力于通过无需额外训练的轻量级方法解决这一问题。

Result: 相比好奇心驱动提示，该方法在所有测试模型中均显著减少了低质量回答，增加了高质量回答，H²得分相对提升了1.4%至4.3%。

Insight: 结构化自我优化是一种可扩展且无需训练的策略，能够有效提升LLM输出的可信度，为未来研究提供了新思路。

Abstract: Large language models (LLMs) have demonstrated robust capabilities across various natural language tasks. However, producing outputs that are consistently honest and helpful remains an open challenge. To overcome this challenge, this paper tackles the problem through two complementary directions. It conducts a comprehensive benchmark evaluation of ten widely used large language models, including both proprietary and open-weight models from OpenAI, Meta, and Google. In parallel, it proposes a novel prompting strategy, self-critique-guided curiosity refinement prompting. The key idea behind this strategy is enabling models to self-critique and refine their responses without additional training. The proposed method extends the curiosity-driven prompting strategy by incorporating two lightweight in-context steps including self-critique step and refinement step. The experiment results on the HONESET dataset evaluated using the framework $\mathrm{H}^2$ (honesty and helpfulness), which was executed with GPT-4o as a judge of honesty and helpfulness, show consistent improvements across all models. The approach reduces the number of poor-quality responses, increases high-quality responses, and achieves relative gains in $\mathrm{H}^2$ scores ranging from 1.4% to 4.3% compared to curiosity-driven prompting across evaluated models. These results highlight the effectiveness of structured self-refinement as a scalable and training-free strategy to improve the trustworthiness of LLMs outputs.

[100] FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning cs.CLPDF

Natapong Nitarach, Warit Sirichotedumrong, Panop Pitchayarthorn, Pittawat Taveekitworachai, Potsawee Manakul

TL;DR: FinCoT引入了一种结构化的链式思维提示方法，结合专家金融推理，显著提升了金融问答任务的性能，同时降低了生成复杂度。

Details

Motivation: 金融自然语言处理(FinNLP)领域中，现有提示方法多为非结构化或启发式设计，缺乏专家领域知识支持。FinCoT旨在通过结构化思维链提升模型推理能力。

Result: 在CFA风格问题测试中，FinCoT显著提升性能（最高达80.5%），同时生成token减少8倍，推理成本降低。

Insight: 领域对齐的结构化提示不仅能提升性能，还能生成更可解释且与专家推理一致的推理过程。

Abstract: This paper presents FinCoT, a structured chain-of-thought (CoT) prompting approach that incorporates insights from domain-specific expert financial reasoning to guide the reasoning traces of large language models. We investigate that there are three main prompting styles in FinNLP: (1) standard prompting–zero-shot prompting; (2) unstructured CoT–CoT prompting without an explicit reasoning structure, such as the use of tags; and (3) structured CoT prompting–CoT prompting with explicit instructions or examples that define structured reasoning steps. Previously, FinNLP has primarily focused on prompt engineering with either standard or unstructured CoT prompting. However, structured CoT prompting has received limited attention in prior work. Furthermore, the design of reasoning structures in structured CoT prompting is often based on heuristics from non-domain experts. In this study, we investigate each prompting approach in FinNLP. We evaluate the three main prompting styles and FinCoT on CFA-style questions spanning ten financial domains. We observe that FinCoT improves performance from 63.2% to 80.5% and Qwen-2.5-7B-Instruct from 69.7% to 74.2%, while reducing generated tokens eight-fold compared to structured CoT prompting. Our findings show that domain-aligned structured prompts not only improve performance and reduce inference costs but also yield more interpretable and expert-aligned reasoning traces.

[101] Under the Shadow of Babel: How Language Shapes Reasoning in LLMs cs.CL | cs.AIPDF

Chenxi Wang, Yixuan Zhang, Lang Gao, Zixiang Xu, Zirui Song

TL;DR: 该论文探讨了语言结构如何影响大语言模型（LLMs）的推理能力，通过构建双语数据集BICAUSE，揭示了模型在不同语言中表现出的注意力分布和因果推理偏好。

Details

Motivation: 研究背景基于语言相对性假说，探讨语言结构是否会影响LLMs的认知和推理模式。

Result: 发现LLMs在中文和英文中表现出不同的注意力模式和因果推理偏好，且模型倾向于将语言特定的偏好强加于非典型输入。

Insight: 语言不仅影响LLMs的表层形式，还塑造了其内部推理模式，验证了语言相对性假说在AI领域的适用性。

Abstract: Language is not only a tool for communication but also a medium for human cognition and reasoning. If, as linguistic relativity suggests, the structure of language shapes cognitive patterns, then large language models (LLMs) trained on human language may also internalize the habitual logical structures embedded in different languages. To examine this hypothesis, we introduce BICAUSE, a structured bilingual dataset for causal reasoning, which includes semantically aligned Chinese and English samples in both forward and reversed causal forms. Our study reveals three key findings: (1) LLMs exhibit typologically aligned attention patterns, focusing more on causes and sentence-initial connectives in Chinese, while showing a more balanced distribution in English. (2) Models internalize language-specific preferences for causal word order and often rigidly apply them to atypical inputs, leading to degraded performance, especially in Chinese. (3) When causal reasoning succeeds, model representations converge toward semantically aligned abstractions across languages, indicating a shared understanding beyond surface form. Overall, these results suggest that LLMs not only mimic surface linguistic forms but also internalize the reasoning biases shaped by language. Rooted in cognitive linguistic theory, this phenomenon is for the first time empirically verified through structural analysis of model internals.

[102] SGIC: A Self-Guided Iterative Calibration Framework for RAG cs.CLPDF

Guanhua Chen, Yutong Yao, Lidia S. Chao, Xuebo Liu, Derek F. Wong

TL;DR: 本文提出了一种名为SGIC的自引导迭代校准框架，通过利用不确定性评分来增强检索增强生成（RAG）中大语言模型（LLMs）的校准能力，显著提升了性能。

Details

Motivation: 现有的RAG研究多关注文档检索，但忽略了LLMs的校准潜力。本文旨在通过利用LLMs的上下文推理能力，改进其校准效果。

Result: SGIC显著提升了闭源和开源权重LLMs的性能。

Insight: 在RAG系统中，结合LLMs的上下文推理能力和不确定性评分的迭代校准可以显著提升模型对关键信息的利用和响应准确性。

Abstract: Recent research in retrieval-augmented generation (RAG) has concentrated on retrieving useful information from candidate documents. However, numerous methodologies frequently neglect the calibration capabilities of large language models (LLMs), which capitalize on their robust in-context reasoning prowess. This work illustrates that providing LLMs with specific cues substantially improves their calibration efficacy, especially in multi-round calibrations. We present a new SGIC: Self-Guided Iterative Calibration Framework that employs uncertainty scores as a tool. Initially, this framework calculates uncertainty scores to determine both the relevance of each document to the query and the confidence level in the responses produced by the LLMs. Subsequently, it reevaluates these scores iteratively, amalgamating them with prior responses to refine calibration. Furthermore, we introduce an innovative approach for constructing an iterative self-calibration training set, which optimizes LLMs to efficiently harness uncertainty scores for capturing critical information and enhancing response accuracy. Our proposed framework significantly improves performance on both closed-source and open-weight LLMs.

[103] End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data cs.CL | eess.ASPDF

Aishwarya Pothula, Bhavana Akkiraju, Srihari Bandarupalli, Charan D, Santosh Kesiraju

TL;DR: 论文探讨了在低资源语言中，如何利用弱标注数据构建端到端语音翻译系统，并通过实验验证了其有效性。

Details

Motivation: 低资源语言缺乏高质量的标注数据，无法有效训练语音翻译模型，因此尝试利用弱标注数据解决这一问题。

Result: 实验表明，利用弱标注数据训练的模型性能可与SONAR和SeamlessM4T等大规模多模态基线媲美。

Insight: 弱标注数据可以作为解决低资源语言数据稀缺问题的有效途径，但数据质量对模型性能仍有显著影响。

Abstract: The scarcity of high-quality annotated data presents a significant challenge in developing effective end-to-end speech-to-text translation (ST) systems, particularly for low-resource languages. This paper explores the hypothesis that weakly labeled data can be used to build ST models for low-resource language pairs. We constructed speech-to-text translation datasets with the help of bitext mining using state-of-the-art sentence encoders. We mined the multilingual Shrutilipi corpus to build Shrutilipi-anuvaad, a dataset comprising ST data for language pairs Bengali-Hindi, Malayalam-Hindi, Odia-Hindi, and Telugu-Hindi. We created multiple versions of training data with varying degrees of quality and quantity to investigate the effect of quality versus quantity of weakly labeled data on ST model performance. Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines such as SONAR and SeamlessM4T.

[104] Generalizability of Media Frames: Corpus creation and analysis across countries cs.CLPDF

Agnese Daffara, Sourabh Dattawad, Sebastian Padó, Tanise Ceron

TL;DR: 该论文探讨了美国媒体框架（MFC）在巴西葡萄牙语新闻中的泛化性，创建了FrameNews-PT数据集并通过标注实验验证其适用性，发现MFC框架基本适用但需调整。

Details

Motivation: 研究动机在于验证MFC框架是否适用于其他文化背景的新闻报道，尤其是巴西的政治和经济新闻。

Result: 结果显示，MFC框架基本适用但需调整，某些框架使用较少，新议题需依赖通用框架。

Insight: 跨文化框架使用需谨慎，通用框架可能无法完全覆盖文化差异和新兴议题。

Abstract: Frames capture aspects of an issue that are emphasized in a debate by interlocutors and can help us understand how political language conveys different perspectives and ultimately shapes people’s opinions. The Media Frame Corpus (MFC) is the most commonly used framework with categories and detailed guidelines for operationalizing frames. It is, however, focused on a few salient U.S. news issues, making it unclear how well these frames can capture news issues in other cultural contexts. To explore this, we introduce FrameNews-PT, a dataset of Brazilian Portuguese news articles covering political and economic news and annotate it within the MFC framework. Through several annotation rounds, we evaluate the extent to which MFC frames generalize to the Brazilian debate issues. We further evaluate how fine-tuned and zero-shot models perform on out-of-domain data. Results show that the 15 MFC frames remain broadly applicable with minor revisions of the guidelines. However, some MFC frames are rarely used, and novel news issues are analyzed using general ‘fall-back’ frames. We conclude that cross-cultural frame use requires careful consideration.

[105] Large Language Models in Argument Mining: A Survey cs.CLPDF

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic

TL;DR: 这篇综述系统性总结了大型语言模型（LLMs）在论点挖掘（Argument Mining, AM）领域的应用与影响，包括理论基础、数据集、任务分类、技术方法、评估实践及未来挑战。

Details

Motivation: 论点挖掘是自然语言处理的重要子领域，而大型语言模型的出现为该领域带来了革命性变革。本文旨在系统梳理LLMs如何重塑AM，并为未来研究提供指导。

Result: 总结了LLMs在AM中的最新进展，揭示了当前技术的局限性与潜力，并提出了未来研究方向。

Insight: LLMs通过改进上下文学习、跨领域适应性、推理能力等推动了AM的发展，但长文本推理、可解释性和标注效率仍是关键挑战。

Abstract: Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques – such as prompting, chain-of-thought reasoning, and retrieval augmentation – have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.

[106] HausaNLP at SemEval-2025 Task 11: Advancing Hausa Text-based Emotion Detection cs.CLPDF

Sani Abdullahi Sani, Salim Abubakar, Falalu Ibrahim Lawan, Abdulhamid Abubakar, Maryam Bala

TL;DR: 本文介绍了在低资源非洲豪萨语中进行多标签情感检测的方法，通过微调基于非洲语言的预训练模型AfriBERTa，取得了74.00%的验证准确率和73.50%的F1分数。

Details

Motivation: 豪萨语是一种低资源语言，在情感检测任务中缺乏专门的研究。通过利用非洲语言的预训练模型，填补了这一研究空白。

Result: 验证准确率为74.00%，F1分数为73.50%。

Insight: 预训练模型（如AfriBERTa）能够有效支持低资源语言的情感检测任务，展示了跨语言模型的潜力。

Abstract: This paper presents our approach to multi-label emotion detection in Hausa, a low-resource African language, as part of SemEval Track A. We fine-tuned AfriBERTa, a transformer-based model pre-trained on African languages, to classify Hausa text into six emotions: anger, disgust, fear, joy, sadness, and surprise. Our methodology involved data preprocessing, tokenization, and model fine-tuning using the Hugging Face Trainer API. The system achieved a validation accuracy of 74.00%, with an F1-score of 73.50%, demonstrating the effectiveness of transformer-based models for emotion detection in low-resource languages.

Chenyi Zhou, Zhengyan Shi, Yuan Yao, Lei Liang, Huajun Chen

TL;DR: RiOT提出了一种基于残差优化树的自动化提示优化框架，通过多样化候选生成和语义残差连接解决现有方法在多样性和语义漂移上的不足，在多个任务上表现优异。

Details

Motivation: 现有的大型语言模型（LLM）性能高度依赖于提示设计，而现有的自动提示优化方法面临多样性和语义漂移的问题，限制了探索创新方向和任务间性能的一致性。

Result: 在常识、数学、逻辑、时间和语义推理五个基准测试中，RiOT超越了之前的提示优化方法和人工设计提示。

Insight: RiOT通过残差连接和树结构有效平衡了多样性和任务一致性，展示了自动化提示优化的潜力。

Abstract: Recent advancements in large language models (LLMs) have highlighted their potential across a variety of tasks, but their performance still heavily relies on the design of effective prompts. Existing methods for automatic prompt optimization face two challenges: lack of diversity, limiting the exploration of valuable and innovative directions and semantic drift, where optimizations for one task can degrade performance in others. To address these issues, we propose Residual Optimization Tree (RiOT), a novel framework for automatic prompt optimization. RiOT iteratively refines prompts through text gradients, generating multiple semantically diverse candidates at each step, and selects the best prompt using perplexity. Additionally, RiOT incorporates the text residual connection to mitigate semantic drift by selectively retaining beneficial content across optimization iterations. A tree structure efficiently manages the optimization process, ensuring scalability and flexibility. Extensive experiments across five benchmarks, covering commonsense, mathematical, logical, temporal, and semantic reasoning, demonstrate that RiOT outperforms both previous prompt optimization methods and manual prompting.

[108] From LLM-anation to LLM-orchestrator: Coordinating Small Models for Data Labeling cs.CL | cs.AIPDF

Yao Lu, Zhaiyuan Ji, Jiawei Du, Yu Shanqing, Qi Xuan

TL;DR: 论文提出了一种多模型协同标注的新范式AutoAnnotator，解决了LLM标注成本高和细粒度语义理解能力不足的问题。

Details

Motivation: LLM标注在大规模应用中成本高昂且在细粒度任务（如情感分类）中表现不佳，而专用的小模型（SLM）在这些任务上更具优势。

Result: AutoAnnotator在零样本、单样本、CoT和多数投票设置下优于现有LLMs，成本降低74.15%，准确率提升6.21%。

Insight: LLM与SLM的协同标注可以结合两者的优势，降低成本的同时提升标注质量，尤其在细粒度任务中表现突出。

Abstract: Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task-specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. Project page: https://github.com/Zhaiyuan-Ji/AutoAnnotator.

[109] OJBench: A Competition Level Code Benchmark For Large Language Models cs.CLPDF

Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao

TL;DR: 论文介绍了OJBench，一个专为大语言模型设计的竞争级代码基准测试，包含232个来自NOI和ICPC的编程竞赛问题，用于全面评估模型的代码推理能力。

Details

Motivation: 现有代码基准测试无法全面评估大语言模型在竞争级代码推理上的能力，因此需要更严格的测试标准。

Result: 结果显示，即使是顶级推理导向模型（如o4-mini和Gemini-2.5-pro-exp）在高难度竞赛问题上表现不佳，揭示了模型在竞争级代码推理中的挑战。

Insight: 竞争级代码推理对模型提出了更高要求，现有模型仍需进一步优化。

Abstract: Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models’ reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems. This highlights the significant challenges that models face in competitive-level code reasoning.

[110] A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications cs.CLPDF

Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He

TL;DR: 这篇综述调研了2020年至2025年间合成数据生成技术在生物医学领域的研究和应用，总结了数据模态、生成方法和评估方式，并指出了当前局限性和未来挑战。

Details

Motivation: 生物医学领域面临数据稀缺、隐私问题和数据质量挑战，合成数据生成技术（尤其是基于大语言模型的方法）提供了潜在解决方案。

Result: 研究中，非结构化文本占78.0%，表格数据占13.6%，多模态数据占8.4%；生成方法以提示（72.9%）和微调（22.0%）大语言模型为主；评估方式包括内在指标（27.1%）、人机协作评估（55.9%）和LLM评估（13.6%）。

Insight: 合成数据生成在生物医学领域的应用潜力巨大，但在跨临床领域适应性、资源获取和评估标准化方面仍需突破。

Abstract: Synthetic data generation–mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields–has been facilitated by rapid advances of large language models (LLMs). This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically examines biomedical research and application trends in synthetic data generation, emphasizing clinical applications, methodologies, and evaluations. Our analysis identifies data modalities of unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%); generation methods of prompting (72.9%), fine-tuning (22.0%) LLMs and specialized model (5.1%); and heterogeneous evaluations of intrinsic metrics (27.1%), human-in-the-loop assessments (55.9%), and LLM-based evaluations (13.6%). The analysis addresses current limitations in what, where, and how health professionals can leverage synthetic data generation for biomedical domains. Our review also highlights challenges in adaption across clinical domains, resource and model accessibility, and evaluation standardizations.

[111] Modeling Public Perceptions of Science in Media cs.CL | cs.AI | cs.CY | cs.HCPDF

Jiaxin Pei, Dustin Wright, Isabelle Augenstin, David Jurgens

TL;DR: 该论文提出了一个计算框架，用于建模公众对科学新闻的多维感知，创建了一个大规模数据集，并开发了预测公众感知的NLP模型。研究发现，科学新闻的消费频率是影响感知的主要因素，且感知分数能预测公众参与度。

Details

Motivation: 科学传播在建立公众信任和理解方面至关重要，但信息量大且公众反应难以预测。

Result: 1) 科学新闻消费频率是感知的主要驱动因素；2) 感知分数高的科学帖子在Reddit上获得更多参与（评论和点赞）；3) 人口统计学因素对感知影响较小。

Insight: 科学传播中，公众感知的细微建模能有效预测公众参与度，科学信息的框架方式影响其传播效果。

Abstract: Effectively engaging the public with science is vital for fostering trust and understanding in our scientific community. Yet, with an ever-growing volume of information, science communicators struggle to anticipate how audiences will perceive and interact with scientific news. In this paper, we introduce a computational framework that models public perception across twelve dimensions, such as newsworthiness, importance, and surprisingness. Using this framework, we create a large-scale science news perception dataset with 10,489 annotations from 2,101 participants from diverse US and UK populations, providing valuable insights into public responses to scientific information across domains. We further develop NLP models that predict public perception scores with a strong performance. Leveraging the dataset and model, we examine public perception of science from two perspectives: (1) Perception as an outcome: What factors affect the public perception of scientific information? (2) Perception as a predictor: Can we use the estimated perceptions to predict public engagement with science? We find that individuals’ frequency of science news consumption is the driver of perception, whereas demographic factors exert minimal influence. More importantly, through a large-scale analysis and carefully designed natural experiment on Reddit, we demonstrate that the estimated public perception of scientific information has direct connections with the final engagement pattern. Posts with more positive perception scores receive significantly more comments and upvotes, which is consistent across different scientific information and for the same science, but are framed differently. Overall, this research underscores the importance of nuanced perception modeling in science communication, offering new pathways to predict public interest and engagement with scientific content.

[112] GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View cs.CL | cs.AI | cs.MMPDF

Fenghua Cheng, Jinxiang Wang, Sen Wang, Zi Huang, Xue Li

TL;DR: GeoGuess 是一个新的多模态推理任务，要求系统通过街道视图图像识别位置并提供详细解释，结合了细粒度视觉线索和全局上下文。作者提出了数据集 GeoExplain 和推理方法 SightSense，展示了其出色性能。

Details

Motivation: 现有任务在多模态推理中缺乏对不同粒度层次视觉线索的推理能力，而这对实际场景至关重要。GeoGuess 填补了这一空白。

Result: 实验和分析表明，SightSense 在 GeoGuess 任务中表现优异。

Insight: GeoGuess 不仅测试 AI 的视觉理解能力，还强调多模态和层次化推理的重要性，为未来研究提供了新方向。

Abstract: Multimodal reasoning is a process of understanding, integrating and inferring information across different data modalities. It has recently attracted surging academic attention as a benchmark for Artificial Intelligence (AI). Although there are various tasks for evaluating multimodal reasoning ability, they still have limitations. Lack of reasoning on hierarchical visual clues at different levels of granularity, e.g., local details and global context, is of little discussion, despite its frequent involvement in real scenarios. To bridge the gap, we introduce a novel and challenging task for multimodal reasoning, namely GeoGuess. Given a street view image, the task is to identify its location and provide a detailed explanation. A system that succeeds in GeoGuess should be able to detect tiny visual clues, perceive the broader landscape, and associate with vast geographic knowledge. Therefore, GeoGuess would require the ability to reason between hierarchical visual information and geographic knowledge. In this work, we establish a benchmark for GeoGuess by introducing a specially curated dataset GeoExplain which consists of panoramas-geocoordinates-explanation tuples. Additionally, we present a multimodal and multilevel reasoning method, namely SightSense which can make prediction and generate comprehensive explanation based on hierarchy of visual information and external knowledge. Our analysis and experiments demonstrate their outstanding performance in GeoGuess.

[113] Long-Context Generalization with Sparse Attention cs.CL | cs.AIPDF

Pavlo Vasylenko, Marcos Treviso, André F. T. Martins

TL;DR: 该论文提出了一种基于稀疏注意力的方法（ASEntmax），通过可学习的温度参数动态调节注意力分布，解决了传统softmax在长序列任务中注意力分散的问题，并结合位置编码设计提升了模型性能。

Details

Motivation: 传统Transformer中的softmax注意力机制在长序列任务中会导致注意力分散，难以聚焦固定大小的模式。为了解决这一问题，论文探索了稀疏注意力机制的潜力。

Result: 实验表明，ASEntmax结合优化位置编码的方法在长上下文泛化任务中显著优于softmax及其变体。

Insight: 稀疏注意力机制（如ASEntmax）通过精确分配注意力权重，能有效解决长序列任务中的注意力分散问题，提升模型对固定模式的聚焦能力。

Abstract: Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that sparse attention mechanisms using $\alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $\alpha$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Finally, we show that the ability to locate and generalize fixed-size patterns can be further improved through a careful design of position encodings, which impacts both dense and sparse attention methods. By integrating ASEntmax into standard transformer layers alongside proper positional encodings, we show that our models greatly outperform softmax, scalable softmax, and fixed-temperature $\alpha$-entmax baselines on long-context generalization.

[114] Arch-Router: Aligning LLM Routing with Human Preferences cs.CLPDF

Co Tran, Salman Paracha, Adil Hafeez, Shuguang Chen

TL;DR: Arch-Router 是一种偏好对齐的路由框架，通过将查询映射到用户定义的领域或动作类型来优化LLM路由决策，支持动态添加新模型且无需重新训练。

Details

Motivation: 现有LLM路由方法通常基于不完整捕捉人类偏好的基准测试进行性能评估，且模型池选择有限。

Result: 在对话数据集上实现了SOTA性能，表现优于顶级专有模型，同时捕捉了主观评价标准。

Insight: 偏好对齐的路由框架能更透明、灵活地满足用户需求，同时轻量级模型的设计提升了实用性。

Abstract: With the rapid proliferation of large language models (LLMs) – each optimized for different strengths, style, or latency/cost profile – routing has become an essential technique to operationalize the use of different models. However, existing LLM routing approaches are limited in two key ways: they evaluate performance using benchmarks that often fail to capture human preferences driven by subjective evaluation criteria, and they typically select from a limited pool of models. In this work, we propose a preference-aligned routing framework that guides model selection by matching queries to user-defined domains (e.g., travel) or action types (e.g., image editing) – offering a practical mechanism to encode preferences in routing decisions. Specifically, we introduce \textbf{Arch-Router}, a compact 1.5B model that learns to map queries to domain-action preferences for model routing decisions. Our approach also supports seamlessly adding new models for routing without requiring retraining or architectural modifications. Experiments on conversational datasets demonstrate that our approach achieves state-of-the-art (SOTA) results in matching queries with human preferences, outperforming top proprietary models. Our approach captures subjective evaluation criteria and makes routing decisions more transparent and flexible. Our model is available at: \texttt{https://huggingface.co/katanemo/Arch-Router-1.5B}.

[115] Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations cs.CLPDF

Ananth Agarwal, Jasper Jian, Christopher D. Manning, Shikhar Murty

TL;DR: 研究发现，尽管大语言模型（LLMs）在文本处理中表现出对语法的掌握，但通过探针技术提取的语法特征无法预测其在特定语法评估任务中的表现，揭示了探针技术与实际语法行为之间的脱节。

Details

Motivation: 探索大语言模型内部语法表示的机制，并验证探针技术是否能可靠预测模型在语法任务中的表现。

Result: 探针提取的语法特征无法预测模型在语法任务中的表现，表明模型内部的语法表示与下游任务行为存在脱节。

Insight: 研究挑战了探针技术作为解释模型语法能力的工具的有效性，强调需要更直接的方法理解模型的内部机制。

Abstract: Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model’s probing accuracy reliably predicts its downstream syntactic performance. Adopting a “mechanisms vs. outcomes” framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.

[116] ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models cs.CL | cs.AIPDF

Bin Chen, Xinzge Gao, Chuanrui Hu, Penghang Yu, Hua Zhang

TL;DR: ReasonGRM通过三阶段的生成式奖励模型框架（包括Zero-RL引导的推理路径生成、新型评估指标$R^\star$的应用和强化学习微调）显著提升了模型的推理能力和偏好建模效果，在多个基准测试中超越了现有GRMs和GPT-4o。

Details

Motivation: 生成式奖励模型（GRMs）在捕获人类偏好方面更具灵活性，但其推理能力不足导致推理路径不完整或过度推测，产生幻觉或遗漏关键信息。

Result: 在三个基准测试中，ReasonGRM平均超越现有GRMs 1.8%，部分场景超越GPT-4o达5.6%。

Insight: 高质量的推理路径选择和推理感知训练对提升偏好建模的可靠性至关重要。

Abstract: Generative Reward Models (GRMs) provide greater flexibility than scalar reward models in capturing human preferences, but their effectiveness is limited by poor reasoning capabilities. This often results in incomplete or overly speculative reasoning paths, leading to hallucinations or missing key information in complex tasks. We address this challenge with ReasonGRM, a three-stage generative reward modeling framework. In the first stage, Zero-RL is used to generate concise, outcome-directed reasoning paths that reduce the likelihood of critical omissions. In the second stage, we introduce a novel evaluation metric, $R^\star$, which scores reasoning paths based on their generation likelihood. This favors paths that reach correct answers with minimal exploration, helping to reduce hallucination-prone data during training. In the final stage, the model is further refined through reinforcement learning on challenging examples to enhance its preference discrimination capabilities. Experiments on three public benchmarks show that ReasonGRM achieves competitive or state-of-the-art performance, outperforming previous best GRMs by 1.8% on average and surpassing proprietary models such as GPT-4o by up to 5.6%. These results demonstrate the effectiveness of reasoning-aware training and highlight the importance of high-quality rationale selection for reliable preference modeling.

[117] The Role of Model Confidence on Bias Effects in Measured Uncertainties cs.CL | cs.AIPDF

Xinyi Liu, Weiguang Wang, Hangfeng He

TL;DR: 论文研究了在大型语言模型（LLM）中，模型置信度如何影响偏差对认知不确定性和偶然不确定性的量化。通过视觉问答（VQA）实验，发现降低提示引入的偏差可以优化GPT-4o的不确定性量化。此外，低偏差自由置信度会导致认知不确定性的低估（过度自信），但对偶然不确定性无显著方向性影响。

Details

Motivation: 随着LLM在开放任务中的广泛应用，准确量化认知不确定性对确保可靠性至关重要。然而，开放任务中存在多种有效答案（偶然不确定性），使得认知不确定性的量化变得复杂。偏差可能同时影响这两种不确定性的估计，但其具体作用尚未明确。

Result: 实验表明：1）低偏差自由置信度下，偏差对两种不确定性的影响更大；2）低置信度导致认知不确定性被低估，但对偶然不确定性的方向性变化无显著影响。

Insight: 论文揭示了偏差与置信度的交互作用，强调了在低置信度时模型更易受偏差影响，导致过度自信。这一发现为未来开发更鲁棒的不确定性量化技术提供了理论依据。

Abstract: With the growing adoption of Large Language Models (LLMs) for open-ended tasks, accurately assessing epistemic uncertainty, which reflects a model’s lack of knowledge, has become crucial to ensuring reliable outcomes. However, quantifying epistemic uncertainty in such tasks is challenging due to the presence of aleatoric uncertainty, which arises from multiple valid answers. While bias can introduce noise into epistemic uncertainty estimation, it may also reduce noise from aleatoric uncertainty. To investigate this trade-off, we conduct experiments on Visual Question Answering (VQA) tasks and find that mitigating prompt-introduced bias improves uncertainty quantification in GPT-4o. Building on prior work showing that LLMs tend to copy input information when model confidence is low, we further analyze how these prompt biases affect measured epistemic and aleatoric uncertainty across varying bias-free confidence levels with GPT-4o and Qwen2-VL. We find that all considered biases induce greater changes in both uncertainties when bias-free model confidence is lower. Moreover, lower bias-free model confidence leads to greater underestimation of epistemic uncertainty (i.e. overconfidence) due to bias, whereas it has no significant effect on the direction of changes in aleatoric uncertainty estimation. These distinct effects deepen our understanding of bias mitigation for uncertainty quantification and potentially inform the development of more advanced techniques.

[118] LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization cs.CL | cs.AI | cs.SD | eess.ASPDF

Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim

TL;DR: LM-SPT提出了一种新的语音标记化方法，通过语义蒸馏和重构技术，优化了语音标记与语言模型的对齐，支持多帧率并提升了语音重建质量。

Details

Motivation: 语音标记化方法需要更好地提取语义信息并减少冗余，但现有方法生成的标记序列过长，且降低帧率会导致语义结构失真。

Result: 实验显示LM-SPT在重建质量上优于基线，且在语音转文本和文本转语音任务中表现更优。

Insight: 间接监督和重构技术能更有效地学习语义对齐的离散单元，多帧率设计提升了灵活性。

Abstract: With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for effective LM alignment. To address this, we propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation. Instead of directly matching teacher and student features via pooling, we reconstruct speech solely from semantic tokens and minimize the discrepancy between the encoded representations of the original and reconstructed waveforms, obtained from a frozen automatic speech recognition (ASR) encoder. This indirect yet data-driven supervision enables the tokenizer to learn discrete units that are more semantically aligned with language models. LM-SPT further incorporates architectural improvements to the encoder and decoder for speech tokenization, and supports multiple frame rates, including 25Hz, 12.5Hz, and 6.25Hz. Experimental results show that LM-SPT achieves superior reconstruction fidelity compared to baselines, and that SLMs trained with LM-SPT tokens achieve competitive performances on speech-to-text and consistently outperform baselines on text-to-speech tasks.

[119] Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly cs.CL | cs.AIPDF

Lance Ying, Ryan Truong, Katherine M. Collins, Cedegao E. Zhang, Megan Wei

TL;DR: 本文提出了一种结合语言和视觉输入的多模态社交推理框架 LIRAS，通过符号表示和贝叶斯逆规划引擎，显著提升了社交推理任务的性能。

Details

Motivation: 在社交推理中，语言和视觉信息的多模态融合至关重要，尤其在陌生情境中语言能提供抽象与具体的环境信息。现有方法未能有效整合这些模态。

Result: 在多个社交推理任务中，LIRAS 的表现优于现有模型和消融实验，接近人类判断水平。

Insight: 符号化表示与贝叶斯逆规划的有机结合，是多模态社交推理的有效方法，即使轻量级视觉语言模型也能实现高性能。

Abstract: Drawing real world social inferences usually requires taking into account information from multiple modalities. Language is a particularly powerful source of information in social settings, especially in novel situations where language can provide both abstract information about the environment dynamics and concrete specifics about an agent that cannot be easily visually observed. In this paper, we propose Language-Informed Rational Agent Synthesis (LIRAS), a framework for drawing context-specific social inferences that integrate linguistic and visual inputs. LIRAS frames multimodal social reasoning as a process of constructing structured but situation-specific agent and environment representations - leveraging multimodal language models to parse language and visual inputs into unified symbolic representations, over which a Bayesian inverse planning engine can be run to produce granular probabilistic judgments. On a range of existing and new social reasoning tasks derived from cognitive science experiments, we find that our model (instantiated with a comparatively lightweight VLM) outperforms ablations and state-of-the-art models in capturing human judgments across all domains.

[120] SocialSim: Towards Socialized Simulation of Emotional Support Conversation cs.CLPDF

Zhuang Chen, Yaru Cao, Guanqun Bi, Jincenzi Wu, Jinfeng Zhou

TL;DR: 该论文提出了一种名为SocialSim的新型框架，用于模拟情感支持对话（ESC），通过整合社交互动的关键方面（如社交披露和社交意识），构建了高质量的合成ESC语料库SSConv，其质量甚至优于众包数据。

Details

Motivation: 由于众包大规模ESC语料库成本高昂，现有方法通常忽视ESC中的社交动态，导致模拟效果不佳。论文旨在通过引入社交动态提升ESC模拟的效率和真实性。

Result: 构建的SSConv语料库质量优于众包数据，基于其训练的聊天机器人在评估中表现最佳。

Insight: 社交动态的整合（如社交披露和社交意识）是提升ESC模拟效果的关键，同时合成的语料库可以替代或补充众包数据。

Abstract: Emotional support conversation (ESC) helps reduce people’s psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic help-seeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical.

Lei Jiang, Zixun Zhang, Zizhou Wang, Xiaobing Sun, Zhen Li

TL;DR: 提出了一种新型的黑盒越狱攻击框架CAMO，通过跨模态攻击分解恶意提示为语义无害的视觉和文本片段，高效绕过现有安全机制。

Details

Motivation: 大型视觉语言模型（LVLMs）在多模态任务中表现优异，但容易受到越狱攻击绕过安全机制生成受限内容。现有攻击方法效率低且易被检测。

Result: 在主流LVLMs上验证了CAMO的有效性，展示了高性能和跨模型可迁移性。

Insight: 揭示了现有安全机制的显著弱点，亟需开发更先进的、关注对齐性能的视觉语言系统安全解决方案。

Abstract: Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross-modal Adversarial Multimodal Obfuscation (CAMO), a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs’ cross-modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi-step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency. Comprehensive evaluations conducted on leading LVLMs validate CAMO’s effectiveness, showcasing robust performance and strong cross-model transferability. These results underscore significant vulnerabilities in current built-in safety mechanisms, emphasizing an urgent need for advanced, alignment-aware security and safety solutions in vision-language systems.

[122] DistillNote: LLM-based clinical note summaries improve heart failure diagnosis cs.CLPDF

Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto

TL;DR: DistillNote提出了一种基于LLM的临床笔记摘要框架，通过三种技术生成摘要，显著提高了心衰诊断的准确性，同时减少了文本量和幻觉问题。

Details

Motivation: 临床文档的负担过重，医疗提供者需要更高效的摘要工具以减轻工作压力并提升诊断效率。

Result: 蒸馏摘要实现了79%的文本压缩率，AUPRC提升18.2%；一步摘要在临床可操作性上更受青睐，蒸馏摘要在效率上表现最优。

Insight: 直接摘要适合临床决策，而蒸馏摘要在性能和效率上达成平衡，同时显著减少幻觉问题。

Abstract: Large language models (LLMs) offer unprecedented opportunities to generate concise summaries of patient information and alleviate the burden of clinical documentation that overwhelms healthcare providers. We present Distillnote, a framework for LLM-based clinical note summarization, and generate over 64,000 admission note summaries through three techniques: (1) One-step, direct summarization, and a divide-and-conquer approach involving (2) Structured summarization focused on independent clinical insights, and (3) Distilled summarization that further condenses the Structured summaries. We test how useful are the summaries by using them to predict heart failure compared to a model trained on the original notes. Distilled summaries achieve 79% text compression and up to 18.2% improvement in AUPRC compared to an LLM trained on the full notes. We also evaluate the quality of the generated summaries in an LLM-as-judge evaluation as well as through blinded pairwise comparisons with clinicians. Evaluations indicate that one-step summaries are favoured by clinicians according to relevance and clinical actionability, while distilled summaries offer optimal efficiency (avg. 6.9x compression-to-performance ratio) and significantly reduce hallucinations. We release our summaries on PhysioNet to encourage future research.

Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai

TL;DR: MUCAR是一个多语言跨模态歧义消解基准，旨在评估多模态大语言模型（MLLMs）在解决语言和视觉歧义方面的能力。它包含多语言数据集和双歧义数据集，测试19种先进模型后发现与人类表现有明显差距。

Details

Motivation: 现有的多模态基准通常忽视语言和视觉歧义，依赖单模态上下文消歧，未能充分利用多模态相互澄清的潜力。MUCAR填补了这一空白。

Result: 实验结果表明，现有先进模型在多模态歧义消解任务上表现远低于人类水平。

Insight: MUCAR表明，当前多模态模型在跨模态歧义理解方面仍需提升，未来研究需关注更复杂的方法以推动多模态推理的边界。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models–encompassing both open-source and proprietary architectures–reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.

[124] Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025 cs.CLPDF

Dominik Macháček, Peter Polák

TL;DR: 本文介绍了Charles University为IWSLT 2025同步语音翻译任务提交的系统，使用Whisper模型和AlignAtt策略，并通过提示和上下文优化性能，取得了显著提升。

Details

Motivation: 针对同步语音翻译任务，目标是提升多语言对的翻译质量和延迟表现。

Result: 相比基线，性能显著提升（如捷克语到英语提升2 BLEU，英语到德语/汉语/日语提升13-22 BLEU）。

Insight: 提示和上下文优化对同步翻译任务至关重要，新延迟度量方法有助于更准确评估性能。

Abstract: This paper describes Charles University submission to the Simultaneous Speech Translation Task of the IWSLT 2025. We cover all four language pairs with a direct or cascade approach. The backbone of our systems is the offline Whisper speech model, which we use for both translation and transcription in simultaneous mode with the state-of-the-art simultaneous policy AlignAtt. We further improve the performance by prompting to inject in-domain terminology, and we accommodate context. Our cascaded systems further use EuroLLM for unbounded simultaneous translation. Compared to the Organizers’ baseline, our systems improve by 2 BLEU points on Czech to English and 13-22 BLEU points on English to German, Chinese and Japanese on the development sets. Additionally, we also propose a new enhanced measure of speech recognition latency.

[125] Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs cs.CL | cs.AIPDF

Ricardo Rei, Nuno M. Guerreiro, José Pombal, João Alves, Pedro Teixeirinha

TL;DR: Tower+ 是一个多语言 LLM 套件，通过新颖的训练方法平衡翻译专业化和通用能力，实现高性能。

Details

Motivation: 微调预训练 LLM 虽能提升特定任务性能，但会牺牲通用能力。Tower+ 旨在同时优化翻译和通用文本能力。

Result: 小模型优于更大规模的通用 LLM；大模型在高资源语言翻译和 IF-MT 基准上表现最佳。

Insight: 通过特定训练方法，可在优化特定领域（如翻译）的同时，保持与前沿模型匹敌的通用能力。

Abstract: Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.

[126] Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation cs.CLPDF

Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu

TL;DR: 论文通过实证研究发现，思维链（CoT）提示虽然能减少大型语言模型（LLM）的幻觉行为，但同时会掩盖检测幻觉的关键信号，影响检测方法的准确性。

Details

Motivation: 大型语言模型（LLM）常生成事实错误或语义无关的内容（幻觉），而思维链（CoT）提示虽能通过逐步推理减少幻觉，但其对幻觉检测的影响尚未充分研究。

Result: 研究发现CoT提示降低了幻觉频率，但削弱了检测方法的准确性和置信度。

Insight: 研究强调了在推理过程中使用CoT提示时的权衡问题，即减少幻觉可能以牺牲检测能力为代价。

Abstract: Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM’s internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: https://anonymous.4open.science/r/cot-hallu-detect.

[127] CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models cs.CLPDF

Naiming Liu, Richard Baraniuk, Shashank Sonkar

TL;DR: CLEAR-3K是一个包含3000个断言-推理问题的数据集，旨在评估语言模型是否能判断一个陈述是否因果解释另一个陈述。研究揭示语言模型常混淆语义相关性和因果解释关系，且随着参数增加，模型从过度怀疑转为过度接受因果关系，但性能仍有限。

Details

Motivation: 开发一个评测语言模型因果推理能力的基准数据集，以揭示当前模型在区分语义相关性和真实因果解释关系上的局限性。

Result: 语言模型常混淆语义相似性和因果性，模型规模增加时倾向从过度怀疑转为过度接受因果关系，但最佳模型的性能仍有限（MCC=0.55）。

Insight: 现有语言模型在因果推理能力上仍有较大提升空间，CLEAR-3K为未来研究提供了关键基准。

Abstract: We introduce CLEAR-3K, a dataset of 3,000 assertion-reasoning questions designed to evaluate whether language models can determine if one statement causally explains another. Each question present an assertion-reason pair and challenge language models to distinguish between semantic relatedness and genuine causal explanatory relationships. Through comprehensive evaluation of 21 state-of-the-art language models (ranging from 0.5B to 72B parameters), we identify two fundamental findings. First, language models frequently confuse semantic similarity with causality, relying on lexical and semantic overlap instead of inferring actual causal explanatory relationships. Second, as parameter size increases, models tend to shift from being overly skeptical about causal relationships to being excessively permissive in accepting them. Despite this shift, performance measured by the Matthews Correlation Coefficient plateaus at just 0.55, even for the best-performing models.Hence, CLEAR-3K provides a crucial benchmark for developing and evaluating genuine causal reasoning in language models, which is an essential capability for applications that require accurate assessment of causal relationships.

[128] Towards AI Search Paradigm cs.CL | cs.AI | cs.IRPDF

Yuchen Li, Hengyi Cai, Rui Kong, Xinran Chen, Jiamin Chen

TL;DR: 这篇论文提出了AI Search Paradigm，一种模拟人类信息处理和决策的新一代搜索系统蓝图，基于模块化架构的四种LLM驱动的智能体，动态适应各种信息需求。

Details

Motivation: 当前的搜索系统在处理复杂查询和多阶段推理任务时能力有限，需要一种更灵活、可扩展的AI搜索范式来满足多样化需求。

Result: 论文系统性地介绍了实现该范式的关键技术，包括算法和基础设施优化，为构建可信赖、自适应和可扩展的AI搜索系统提供了指导。

Insight: 通过模块化智能体和动态协作，AI搜索系统可以更灵活地处理复杂信息需求，为未来的搜索技术发展奠定了基础。

Abstract: In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making. The paradigm employs a modular architecture of four LLM-powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex multi-stage reasoning tasks. These agents collaborate dynamically through coordinated workflows to evaluate query complexity, decompose problems into executable plans, and orchestrate tool usage, task execution, and content synthesis. We systematically present key methodologies for realizing this paradigm, including task planning and tool integration, execution strategies, aligned and robust retrieval-augmented generation, and efficient LLM inference, spanning both algorithmic techniques and infrastructure-level optimizations. By providing an in-depth guide to these foundational components, this work aims to inform the development of trustworthy, adaptive, and scalable AI search systems.

[129] LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles cs.CL | cs.AI | cs.CVPDF

Ho Yin ‘Sam’ Ng, Ting-Yao Hsu, Aashish Anantha Ramakrishnan, Branislav Kveton, Nedim Lipka

TL;DR: LaMP-Cap 是一个支持多模态图像描述生成的个性化数据集，通过结合图像和文本上下文，提升了生成的图像描述与作者原始风格的匹配度。

Details

Motivation: 现有的图像描述生成模型缺乏个性化，生成的描述通常需要作者手动调整以匹配其写作风格和领域特点，这凸显了对个性化生成的需求。

Result: 实验表明，使用多模态上下文信息可以显著提升生成的描述与作者原始描述的匹配度，尤其是图像信息的贡献大于文本段落。

Insight: 多模态上下文在个性化生成任务中比纯文本上下文更有效，证明了多模态信息融合的重要性。

Abstract: Figure captions are crucial for helping readers understand and remember a figure’s key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain’s style, highlighting the need for personalization. Despite language models’ personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document–each with its image, caption, and figure-mentioning paragraphs–as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.

astro-ph.EP [Back]

[130] Exoplanet Classification through Vision Transformers with Temporal Image Analysis astro-ph.EP | astro-ph.IM | cs.CVPDF

Anupma Choudhary, Sohith Bandari, B. S. Kushvah, C. Swastik

TL;DR: 该论文提出了一种利用Vision Transformer (ViT)模型对开普勒任务的原始光变曲线数据进行分析的方法，通过Gramian Angular Fields (GAFs)和Recurrence Plots (RPs)将数据转换为图像，以捕获时间依赖性。ViT模型在RPs上的表现优于GAFs，展示了较高的召回率和精确率。

Details

Motivation: 传统的外行星分类方法需要大量计算和观测资源，效率低下，因此需要开发更高效的机器学习技术。

Result: ViT模型在RPs上表现最佳，召回率和精确率分别达到89.46%和85.09%。

Insight: 尽管使用了欠采样技术缓解类别不平衡问题，数据规模缩小仍是主要限制。进一步研究模型架构优化对提升自动化、性能和泛化能力至关重要。

Abstract: The classification of exoplanets has been a longstanding challenge in astronomy, requiring significant computational and observational resources. Traditional methods demand substantial effort, time, and cost, highlighting the need for advanced machine learning techniques to enhance classification efficiency. In this study, we propose a methodology that transforms raw light curve data from NASA’s Kepler mission into Gramian Angular Fields (GAFs) and Recurrence Plots (RPs) using the Gramian Angular Difference Field and recurrence plot techniques. These transformed images serve as inputs to the Vision Transformer (ViT) model, leveraging its ability to capture intricate temporal dependencies. We assess the performance of the model through recall, precision, and F1 score metrics, using a 5-fold cross-validation approach to obtain a robust estimate of the model’s performance and reduce evaluation bias. Our comparative analysis reveals that RPs outperform GAFs, with the ViT model achieving an 89.46$%$ recall and an 85.09$%$ precision rate, demonstrating its significant capability in accurately identifying exoplanetary transits. Despite using under-sampling techniques to address class imbalance, dataset size reduction remains a limitation. This study underscores the importance of further research into optimizing model architectures to enhance automation, performance, and generalization of the model.

cs.SE [Back]

[131] Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems cs.SE | cs.AI | cs.CLPDF

Matias Martinez, Xavier Franch

TL;DR: 该论文对SWE-Bench Lite和Verified榜单的所有提交进行了首次全面分析，揭示了专有LLM（如Claude 3.5/3.7）的主导地位，以及从个体开发者到大型科技公司的贡献者多样性。

Details

Motivation: 由于SWE-Bench提交过程缺乏详细文档，许多解决方案的架构设计和来源不明确，阻碍了对其技术细节的理解。

Result: 发现专有LLM（如Claude 3.5/3.7）占主导地位，存在基于代理和非代理的设计，且贡献者包括个体和大型企业。

Insight: 结果显示，尽管LLM在APR领域表现优秀，但开源模型的参与度较低，且代理与非代理架构均有其应用场景。

Abstract: The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards, SWE-Bench Lite and SWE-Bench Verified, have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (68 entries) and Verified (79 entries) leaderboards, analyzing 67 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5/3.7), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.

cs.CY [Back]

[132] TrajSceneLLM: A Multimodal Perspective on Semantic GPS Trajectory Analysis cs.CY | cs.CVPDF

Chunhou Ji, Qiumeng Li

TL;DR: TrajSceneLLM提出了一种多模态方法，结合地图图像和文本描述，通过LLM生成语义丰富的轨迹嵌入，显著提升了GPS轨迹的语义理解能力。

Details

Motivation: 传统方法难以从GPS轨迹数据中提取深层语义信息，且难以结合上下文地图信息。TrajSceneLLM旨在解决这一问题，提升轨迹分析的语义理解。

Result: 实验表明，TrajSceneLLM在出行方式识别任务中表现优异，显著优于传统方法。

Insight: 多模态方法（图像+文本）能有效捕捉GPS轨迹的时空依赖性，减少对手工特征的依赖，为地理空间AI提供了新思路。

Abstract: GPS trajectory data reveals valuable patterns of human mobility and urban dynamics, supporting a variety of spatial applications. However, traditional methods often struggle to extract deep semantic representations and incorporate contextual map information. We propose TrajSceneLLM, a multimodal perspective for enhancing semantic understanding of GPS trajectories. The framework integrates visualized map images (encoding spatial context) and textual descriptions generated through LLM reasoning (capturing temporal sequences and movement dynamics). Separate embeddings are generated for each modality and then concatenated to produce trajectory scene embeddings with rich semantic content which are further paired with a simple MLP classifier. We validate the proposed framework on Travel Mode Identification (TMI), a critical task for analyzing travel choices and understanding mobility behavior. Our experiments show that these embeddings achieve significant performance improvement, highlighting the advantage of our LLM-driven method in capturing deep spatio-temporal dependencies and reducing reliance on handcrafted features. This semantic enhancement promises significant potential for diverse downstream applications and future research in geospatial artificial intelligence. The source code and dataset are publicly available at: https://github.com/februarysea/TrajSceneLLM.

cs.LG [Back]

[133] MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference cs.LG | cs.AI | cs.CLPDF

Kunxi Li, Zhonghua Jiang, Zhouzhou Shen, Zhaode Wang, Chengfei Lv

TL;DR: MadaKV提出了一种自适应模态感知的KV缓存淘汰策略，通过动态感知注意力头中的模态信息和分层压缩补偿，显著减少了KV缓存内存占用和解码延迟，同时保持了多模态长上下文任务的高精度。

Details

Motivation: 在多模态场景中，注意力头对不同模态的偏好差异显著，传统单模态设计的KV缓存淘汰策略无法有效捕捉模态特定信息，导致性能不佳。MadaKV旨在解决这一问题。

Result: 在MileBench基准测试中，MadaKV优于现有KV缓存淘汰方法，显著减少了内存占用和延迟，同时保持高精度。

Insight: 多模态场景下，注意力头的模态偏好差异是优化缓存淘汰策略的关键因素，自适应感知模态信息能显著提升效率。

Abstract: This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference. In multimodal scenarios, attention heads exhibit varying preferences for different modalities, resulting in significant disparities in modality importance across attention heads. Traditional KV cache eviction methods, which are tailored for unimodal settings, fail to capture modality-specific information, thereby yielding suboptimal performance. MadaKV addresses these challenges through two key components: modality preference adaptation and hierarchical compression compensation. By dynamically sensing modality information within attention heads and adaptively retaining critical tokens, MadaKV achieves substantial reductions in KV cache memory footprint and model inference decoding latency (1.3 to 1.5 times improvement) while maintaining high accuracy across various multimodal long-context tasks. Extensive experiments on representative MLLMs and the MileBench benchmark demonstrate the effectiveness of MadaKV compared to existing KV cache eviction methods.

[134] Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute cs.LG | cs.AI | cs.CL | eess.SPPDF

Sheng Liu, Tianlang Chen, Pan Lu, Haotian Ye, Yizheng Chen

TL;DR: 该论文提出了Fractional Reasoning方法，一种无需训练、模型无关的框架，通过调节推理强度来灵活适应不同输入问题的复杂性，显著提升了推理任务的性能。

Details

Motivation: 现有的推理方法（如Best-of-N、多数投票等）对所有输入采用统一的推理深度，忽略了不同问题可能需要的不同推理强度。这限制了推理效率和正确性的提升。

Result: 在GSM8K、MATH500和GPQA等任务上的实验表明，Fractional Reasoning能够显著提升推理性能，适应多样的任务和模型。

Insight: 通过动态调节推理强度，可以更高效地分配计算资源，提升推理任务的正确性和效率。该方法为推理时计算提供了新的思路。

Abstract: Test-time compute has emerged as a powerful paradigm for improving the performance of large language models (LLMs), where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different problems may require different levels of reasoning depth. In this work, we propose Fractional Reasoning, a training-free and model-agnostic framework that enables continuous control over reasoning intensity at inference time, going beyond the limitations of fixed instructional prompts. Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor, allowing the model to tailor its reasoning process to the complexity of each input. This supports two key modes of test-time scaling: (1) improving output quality in breadth-based strategies (e.g., Best-of-N, majority voting), and (2) enhancing the correctness of individual reasoning chains in depth-based strategies (e.g., self-reflection). Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.

[135] Early Attentive Sparsification Accelerates Neural Speech Transcription cs.LG | cs.CL | cs.SD | eess.ASPDF

Zifei Xu, Sayeh Sharify, Hesham Mostafa, Tristan Webb, Wanzin Yazar

TL;DR: 论文研究了利用自注意力机制的早期稀疏化技术加速神经语音转录，发现在1%精度损失下，40-60%稀疏度能实现1.6倍加速。

Details

Motivation: 语音信号具有高度可压缩性，利用Transformer的自注意力机制实现早期稀疏化，以加速神经语音转录任务。

Result: 在英语语音转录任务中，稀疏化至40-60%时可实现1.6倍加速，且精度损失小于1%。

Insight: 早期稀疏化是加速神经语音转录的有效方法，且稀疏化阶段和压缩比需联合优化以实现最优性能。

Abstract: Transformer-based neural speech processing has achieved state-of-the-art performance. Since speech audio signals are known to be highly compressible, here we seek to accelerate neural speech transcription by time-domain signal sparsification early in the neural encoding stage, taking advantage of the interpretability of the self-attention mechanism in transformer audio encoders. With the Whisper family of models, we perform a systematic architecture search over the joint space of sparsification stage (a certain encoder layer) and compression ratio (sparsity). We found that the best resulting solutions under 1% accuracy degradation choose to sparsify the hidden state to 40-60% sparsity at an early encoding stage, and thereby achieve up to 1.6x runtime acceleration in English speech transcription tasks on Nvidia GPUs without any fine-tuning.

[136] Probing the Robustness of Large Language Models Safety to Latent Perturbations cs.LG | cs.AI | cs.CL | cs.CRPDF

Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li

TL;DR: 该论文研究了大型语言模型在安全对齐中对潜在扰动的鲁棒性，提出了一种新的探测方法和训练策略LAPT，以增强安全对齐的鲁棒性。

Details

Motivation: 尽管现有安全对齐方法取得进展，模型在潜在扰动下仍可能产生不安全响应，揭示了当前对齐方法的浅层性。

Result: 实验表明LAPT在不影响通用能力的情况下提升了安全对齐的鲁棒性。

Insight: 当前对齐方法的缺陷在于仅关注表面行为，而非内部表示，需开发表征级训练策略。

Abstract: Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety.

[137] Latent Concept Disentanglement in Transformer-based Language Models cs.LG | cs.AI | cs.CLPDF

Guan Zhe Hong, Bhavya Vasudeva, Vatsal Sharan, Cyrus Rashtchian, Prabhakar Raghavan

TL;DR: 本文研究了在基于Transformer的大语言模型中，潜在概念的分离与使用情况，发现模型能够在多步推理任务中识别潜在概念并逐步组合，同时在连续潜在概念任务中表现出低维子空间的结构。

Details

Motivation: 探索Transformer模型在上下文学习（ICL）中是否真正表示潜在结构或仅通过捷径解决问题，弥补了现有研究对潜在概念与学习表示关系探讨不足的缺陷。

Result: 模型在多步任务中成功识别潜在概念并进行逐步组合；在连续概念任务中，表示空间的几何结构与潜在参数化一致。

Insight: Transformer模型在ICL任务中具有高度局部化的结构，能够分离潜在概念，并表现出低维子空间的几何特性，为理解其内部表示提供了新视角。

Abstract: When large language models (LLMs) use in-context learning (ICL) to solve a new task, they seem to grasp not only the goal of the task but also core, latent concepts in the demonstration examples. This begs the question of whether transformers represent latent structures as part of their computation or whether they take shortcuts to solve the problem. Prior mechanistic work on ICL does not address this question because it does not sufficiently examine the relationship between the learned representation and the latent concept, and the considered problem settings often involve only single-step reasoning. In this work, we examine how transformers disentangle and use latent concepts. We show that in 2-hop reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. In tasks parameterized by a continuous latent concept, we find low-dimensional subspaces in the representation space where the geometry mimics the underlying parameterization. Together, these results refine our understanding of ICL and the representation of transformers, and they provide evidence for highly localized structures in the model that disentangle latent concepts in ICL tasks.

[138] From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers cs.LG | cs.AI | cs.CLPDF

Jingtong Su, Julia Kempe, Karen Ullrich

TL;DR: 论文提出了SAMD（可扩展注意力模块发现）方法和SAMI（标量注意力模块干预）策略，用于在Transformer模型中定位与任意复杂概念相关的注意力头，并通过调整单个标量参数控制概念效果。

Details

Motivation: 现有的归因方法主要关注MLP神经元和简单概念，忽略了注意力机制的作用，且缺乏对复杂概念的统一分析。论文旨在填补这一空白。

Result: 1. 模块位置在LLM训练前后稳定；2. 在HarmBench上通过抑制‘安全’概念提升攻击成功率72.7%；3. 在GSM8K上通过放大‘推理’概念提升性能1.6%；4. 在ImageNet上验证视觉任务的通用性。

Insight: 注意力机制中特定模块与复杂概念存在明确关联，通过轻量干预可显著改变模型行为，为可解释性和控制提供了新工具。

Abstract: Transformers have achieved state-of-the-art performance across language and vision tasks. This success drives the imperative to interpret their internal mechanisms with the dual goals of enhancing performance and improving behavioral control. Attribution methods help advance interpretability by assigning model outputs associated with a target concept to specific model components. Current attribution research primarily studies multi-layer perceptron neurons and addresses relatively simple concepts such as factual associations (e.g., Paris is located in France). This focus tends to overlook the impact of the attention mechanism and lacks a unified approach for analyzing more complex concepts. To fill these gaps, we introduce Scalable Attention Module Discovery (SAMD), a concept-agnostic method for mapping arbitrary, complex concepts to specific attention heads of general transformer models. We accomplish this by representing each concept as a vector, calculating its cosine similarity with each attention head, and selecting the TopK-scoring heads to construct the concept-associated attention module. We then propose Scalar Attention Module Intervention (SAMI), a simple strategy to diminish or amplify the effects of a concept by adjusting the attention module using only a single scalar parameter. Empirically, we demonstrate SAMD on concepts of varying complexity, and visualize the locations of their corresponding modules. Our results demonstrate that module locations remain stable before and after LLM post-training, and confirm prior work on the mechanics of LLM multilingualism. Through SAMI, we facilitate jailbreaking on HarmBench (+72.7%) by diminishing “safety” and improve performance on the GSM8K benchmark (+1.6%) by amplifying “reasoning”. Lastly, we highlight the domain-agnostic nature of our approach by suppressing the image classification accuracy of vision transformers on ImageNet.

[139] Shadow defense against gradient inversion attack in federated learning cs.LG | cs.AI | cs.CR | cs.CVPDF

Le Jiang, Liyan Ma, Guang Yang

TL;DR: 该论文提出了一种基于影子模型（shadow model）和可解释性的防御框架，用于对抗联邦学习中的梯度反演攻击，通过针对性地注入噪声保护敏感信息，同时最小化对模型性能的影响。

Details

Motivation: 联邦学习（FL）在隐私保护的分布式训练中具有重要意义，但梯度反演攻击（GIAs）可能泄露敏感数据。现有的防御方法缺乏对哪些梯度或图像信息最易受攻击的深入理解，导致保护不足或过度。

Result: 在ChestXRay和EyePACS数据集上，防御策略显著降低了梯度反演攻击的效果（PSNR和SSIM差值分别为3.73/0.2和2.78/0.166），同时模型性能损失小于1%。

Insight: 影子模型和可解释性分析有助于精准识别敏感信息，从而在隐私保护和模型性能之间取得平衡。该方法为对抗梯度反演攻击提供了通用解决方案。

Abstract: Federated learning (FL) has emerged as a transformative framework for privacy-preserving distributed training, allowing clients to collaboratively train a global model without sharing their local data. This is especially crucial in sensitive fields like healthcare, where protecting patient data is paramount. However, privacy leakage remains a critical challenge, as the communication of model updates can be exploited by potential adversaries. Gradient inversion attacks (GIAs), for instance, allow adversaries to approximate the gradients used for training and reconstruct training images, thus stealing patient privacy. Existing defense mechanisms obscure gradients, yet lack a nuanced understanding of which gradients or types of image information are most vulnerable to such attacks. These indiscriminate calibrated perturbations result in either excessive privacy protection degrading model accuracy, or insufficient one failing to safeguard sensitive information. Therefore, we introduce a framework that addresses these challenges by leveraging a shadow model with interpretability for identifying sensitive areas. This enables a more targeted and sample-specific noise injection. Specially, our defensive strategy achieves discrepancies of 3.73 in PSNR and 0.2 in SSIM compared to the circumstance without defense on the ChestXRay dataset, and 2.78 in PSNR and 0.166 in the EyePACS dataset. Moreover, it minimizes adverse effects on model performance, with less than 1% F1 reduction compared to SOTA methods. Our extensive experiments, conducted across diverse types of medical images, validate the generalization of the proposed framework. The stable defense improvements for FedAvg are consistently over 1.5% times in LPIPS and SSIM. It also offers a universal defense against various GIA types, especially for these sensitive areas in images.

[140] Subspace-Boosted Model Merging cs.LG | cs.AI | cs.CVPDF

Ronald Skorobogat, Karsten Roth, Mariana-Iuliana Georgescu, Zeynep Akata

TL;DR: 本文提出了一种名为‘子空间增强’的方法，用于解决多专家模型合并时因任务向量空间秩崩溃导致的性能下降问题，显著提升了合并效果。

Details

Motivation: 多专家模型合并是一个高效的方式，但随着合并专家数量的增加，性能提升逐渐减弱甚至下降。本文从任务算术的角度分析了这一现象，发现任务向量空间的秩崩溃是主要原因。

Result: 在视觉评测基准上，子空间增强方法将合并效果提升了10%以上，支持多达20个专家模型的合并。

Insight: 任务向量空间的秩崩溃是多专家模型合并性能下降的关键原因，而子空间增强是一种有效的解决方案。

Abstract: Model merging enables the combination of multiple specialized expert models into a single model capable of performing multiple tasks. However, the benefits of merging an increasing amount of specialized experts generally lead to diminishing returns and reduced overall performance gains. In this work, we offer an explanation and analysis from a task arithmetic perspective; revealing that as the merging process (across numerous existing merging methods) continues for more and more experts, the associated task vector space experiences rank collapse. To mitigate this issue, we introduce Subspace Boosting, which operates on the singular value decomposed task vector space and maintains task vector ranks. Subspace Boosting raises merging efficacy for up to 20 expert models by large margins of more than 10% when evaluated on vision benchmarks. Moreover, we propose employing Higher-Order Generalized Singular Value Decomposition to further quantify task similarity, offering a new interpretable perspective on model merging.

[141] From Lab to Factory: Pitfalls and Guidelines for Self-/Unsupervised Defect Detection on Low-Quality Industrial Images cs.LG | cs.CV | stat.AP | 62-06 | G.3; I.4; I.5PDF

Sebastian Hönel, Jonas Nordqvist

TL;DR: 论文探讨了在工业低质量图像上进行自监督/无监督缺陷检测的常见陷阱，并提出实践指南和改进框架。

Details

Motivation: 工业生产中的缺陷检测传统依赖人工，成本高且易错。机器学习可替代人工，但现有方法在低质量数据下表现不佳，需改进。

Result: 通过改进框架，提升了缺陷检测的鲁棒性和准确性，更适用于工业实际场景。

Insight: 现有评估指标（如AUROC）在实际中可能误导，需设计更适合工业场景的评估方法；数据质量和模型鲁棒性是关键问题。

Abstract: The detection and localization of quality-related problems in industrially mass-produced products has historically relied on manual inspection, which is costly and error-prone. Machine learning has the potential to replace manual handling. As such, the desire is to facilitate an unsupervised (or self-supervised) approach, as it is often impossible to specify all conceivable defects ahead of time. A plethora of prior works have demonstrated the aptitude of common reconstruction-, embedding-, and synthesis-based methods in laboratory settings. However, in practice, we observe that most methods do not handle low data quality well or exude low robustness in unfavorable, but typical real-world settings. For practitioners it may be very difficult to identify the actual underlying problem when such methods underperform. Worse, often-reported metrics (e.g., AUROC) are rarely suitable in practice and may give misleading results. In our setting, we attempt to identify subtle anomalies on the surface of blasted forged metal parts, using rather low-quality RGB imagery only, which is a common industrial setting. We specifically evaluate two types of state-of-the-art models that allow us to identify and improve quality issues in production data, without having to obtain new data. Our contribution is to provide guardrails for practitioners that allow them to identify problems related to, e.g., (lack of) robustness or invariance, in either the chosen model or the data reliably in similar scenarios. Furthermore, we exemplify common pitfalls in and shortcomings of likelihood-based approaches and outline a framework for proper empirical risk estimation that is more suitable for real-world scenarios.

cs.RO [Back]

[142] Semantic and Feature Guided Uncertainty Quantification of Visual Localization for Autonomous Vehicles cs.RO | cs.CVPDF

Qiyuan Wu, Mark Campbell

TL;DR: 该论文提出了一种用于自动驾驶视觉定位的不确定性量化方法，通过轻量级传感器误差模型结合图像特征和语义信息预测2D误差分布，并在Ithaca365数据集上验证了其有效性。

Details

Motivation: 自动驾驶等安全关键应用需要高精度的传感器测量不确定性量化，尤其是在视觉定位中，现有方法未能充分考虑环境和上下文的影响。

Result: 在Ithaca365数据集上验证了方法的有效性，表明在恶劣天气和光照条件下，误差分布更适合用高斯混合模型（GMM）而非高斯分布描述。

Insight: 视觉定位中的测量误差分布受环境和上下文因素显著影响，尤其是非理想条件下（如雪天、夜晚），高斯混合模型能更准确地捕捉这种复杂性。

Abstract: The uncertainty quantification of sensor measurements coupled with deep learning networks is crucial for many robotics systems, especially for safety-critical applications such as self-driving cars. This paper develops an uncertainty quantification approach in the context of visual localization for autonomous driving, where locations are selected based on images. Key to our approach is to learn the measurement uncertainty using light-weight sensor error model, which maps both image feature and semantic information to 2-dimensional error distribution. Our approach enables uncertainty estimation conditioned on the specific context of the matched image pair, implicitly capturing other critical, unannotated factors (e.g., city vs highway, dynamic vs static scenes, winter vs summer) in a latent manner. We demonstrate the accuracy of our uncertainty prediction framework using the Ithaca365 dataset, which includes variations in lighting and weather (sunny, night, snowy). Both the uncertainty quantification of the sensor+network is evaluated, along with Bayesian localization filters using unique sensor gating method. Results show that the measurement error does not follow a Gaussian distribution with poor weather and lighting conditions, and is better predicted by our Gaussian Mixture model.

[143] Noise Fusion-based Distillation Learning for Anomaly Detection in Complex Industrial Environments cs.RO | cs.CVPDF

Jiawen Yu, Jieji Ren, Yang Chang, Qiaojun Yu, Xuan Tong

TL;DR: 论文提出了一种基于噪声融合的蒸馏学习方法HetNet，用于复杂工业环境中的异常检测。通过异构教师网络、自适应局部-全局特征融合模块和局部多元高斯噪声生成模块，有效解决了环境波动下的缺陷检测问题。

Details

Motivation: 现有方法在预定义或控制环境下可检测表面缺陷，但在复杂工业环境中（如多视角、多光照）效果不佳。论文旨在提升复杂环境下的异常检测能力。

Result: 在主流基准测试中表现优越，MSC-AD指标提升约10%，其他数据集达到SOTA。实际环境测试证实其实时性和鲁棒性。

Insight: HetNet通过噪声生成和特征融合增强了模型对复杂环境的适应性，为工业异常检测提供了可靠解决方案。

Abstract: Anomaly detection and localization in automated industrial manufacturing can significantly enhance production efficiency and product quality. Existing methods are capable of detecting surface defects in pre-defined or controlled imaging environments. However, accurately detecting workpiece defects in complex and unstructured industrial environments with varying views, poses and illumination remains challenging. We propose a novel anomaly detection and localization method specifically designed to handle inputs with perturbative patterns. Our approach introduces a new framework based on a collaborative distillation heterogeneous teacher network (HetNet), an adaptive local-global feature fusion module, and a local multivariate Gaussian noise generation module. HetNet can learn to model the complex feature distribution of normal patterns using limited information about local disruptive changes. We conducted extensive experiments on mainstream benchmarks. HetNet demonstrates superior performance with approximately 10% improvement across all evaluation metrics on MSC-AD under industrial conditions, while achieving state-of-the-art results on other datasets, validating its resilience to environmental fluctuations and its capability to enhance the reliability of industrial anomaly detection systems across diverse scenarios. Tests in real-world environments further confirm that HetNet can be effectively integrated into production lines to achieve robust and real-time anomaly detection. Codes, images and videos are published on the project website at: https://zihuatanejoyu.github.io/HetNet/

[144] FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation cs.RO | cs.CVPDF

Sen Wang, Le Wang, Sanping Zhou, Jingyi Tian, Jiayi Li

TL;DR: FlowRAM提出了一种结合生成模型和区域感知的技术框架，显著提升了机器人高精度操控任务的性能和效率。

Details

Motivation: 现有基于扩散模型的策略学习方法在推理时计算效率低，且未充分利用生成模型在3D环境中的信息探索潜力。

Result: 在RLBench基准测试中，FlowRAM的平均成功率比之前方法提高了12.0%，且在4步内生成物理可行的动作，显著提升推理速度。

Insight: 将生成模型与区域感知结合，既能提升精度又能保持计算效率，为机器人高精度操控任务提供了新思路。

Abstract: Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhancing information exploration in 3D environments. In response, we propose FlowRAM, a novel framework that leverages generative models to achieve region-aware perception, enabling efficient multimodal information processing. Specifically, we devise a Dynamic Radius Schedule, which allows adaptive perception, facilitating transitions from global scene comprehension to fine-grained geometric details. Furthermore, we integrate state space models to integrate multimodal information, while preserving linear computational complexity. In addition, we employ conditional flow matching to learn action poses by regressing deterministic vector fields, simplifying the learning process while maintaining performance. We verify the effectiveness of the FlowRAM in the RLBench, an established manipulation benchmark, and achieve state-of-the-art performance. The results demonstrate that FlowRAM achieves a remarkable improvement, particularly in high-precision tasks, where it outperforms previous methods by 12.0% in average success rate. Additionally, FlowRAM is able to generate physically plausible actions for a variety of real-world tasks in less than 4 time steps, significantly increasing inference speed.

[145] Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control cs.RO | cs.AI | cs.CVPDF

Yuxin Chen, Jianglan Wei, Chenfeng Xu, Boyi Li, Masayoshi Tomizuka

TL;DR: 提出了ReOI方法，通过在测试时干预观测，去除视觉干扰，提升世界模型在开放环境中的预测可靠性。

Details

Motivation: 世界模型在遇到训练时未见过的视觉干扰（如物体或背景元素）时表现脆弱，影响预测准确性，导致机器人规划失败。

Result: 在机器人操作任务中，ReOI显著提升了任务成功率，最高达到3倍，且在分布内外干扰下均表现鲁棒。

Insight: 测试时干预观测是一种简单但有效的策略，可显著提升世界模型对开放环境中视觉干扰的鲁棒性。

Abstract: World models enable robots to “imagine” future observations given current observations and planned actions, and have been increasingly adopted as generalized dynamics models to facilitate robot learning. Despite their promise, these models remain brittle when encountering novel visual distractors such as objects and background elements rarely seen during training. Specifically, novel distractors can corrupt action outcome predictions, causing downstream failures when robots rely on the world model imaginations for planning or action verification. In this work, we propose Reimagination with Observation Intervention (ReOI), a simple yet effective test-time strategy that enables world models to predict more reliable action outcomes in open-world scenarios where novel and unanticipated visual distractors are inevitable. Given the current robot observation, ReOI first detects visual distractors by identifying which elements of the scene degrade in physically implausible ways during world model prediction. Then, it modifies the current observation to remove these distractors and bring the observation closer to the training distribution. Finally, ReOI “reimagines” future outcomes with the modified observation and reintroduces the distractors post-hoc to preserve visual consistency for downstream planning and verification. We validate our approach on a suite of robotic manipulation tasks in the context of action verification, where the verifier needs to select desired action plans based on predictions from a world model. Our results show that ReOI is robust to both in-distribution and out-of-distribution visual distractors. Notably, it improves task success rates by up to 3x in the presence of novel distractors, significantly outperforming action verification that relies on world model predictions without imagination interventions.

[146] CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity cs.RO | cs.CV | cs.LG | cs.SEPDF

Guang Yin, Yitong Li, Yixuan Wang, Dale McConachie, Paarth Shah

TL;DR: 论文提出了一种名为CodeDiffuser的新框架，通过结合视觉语言模型（VLM）和注意力增强的扩散策略，解决机器人操作中自然语言指令的模糊性问题，生成可解释的中间代码表示。

Details

Motivation: 现有基于语言的机器人操作策略通常依赖端到端模型，缺乏模块化和可解释性，导致性能不佳。论文旨在通过引入可解释的中间代码表示和注意力机制，解决指令模糊性问题。

Result: 实验显示，CodeDiffuser在处理语言模糊性、接触丰富的操作和多物体交互任务中表现优异，优于现有模仿学习方法。

Insight: 论文揭示了当前模仿学习方法对语言和环境变化的适应性不足。通过引入可解释的中间表示和注意力机制，显著提升了任务的鲁棒性和可控性。

Abstract: Natural language instructions for robotic manipulation tasks often exhibit ambiguity and vagueness. For instance, the instruction “Hang a mug on the mug tree” may involve multiple valid actions if there are several mugs and branches to choose from. Existing language-conditioned policies typically rely on end-to-end models that jointly handle high-level semantic understanding and low-level action generation, which can result in suboptimal performance due to their lack of modularity and interpretability. To address these challenges, we introduce a novel robotic manipulation framework that can accomplish tasks specified by potentially ambiguous natural language. This framework employs a Vision-Language Model (VLM) to interpret abstract concepts in natural language instructions and generates task-specific code - an interpretable and executable intermediate representation. The generated code interfaces with the perception module to produce 3D attention maps that highlight task-relevant regions by integrating spatial and semantic information, effectively resolving ambiguities in instructions. Through extensive experiments, we identify key limitations of current imitation learning methods, such as poor adaptation to language and environmental variations. We show that our approach excels across challenging manipulation tasks involving language ambiguity, contact-rich manipulation, and multi-object interactions.

[147] Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping cs.RO | cs.CVPDF

Teng Guo, Baichuan Huang, Jingjin Yu

TL;DR: 提出了一个名为MOMA的单次度量深度对齐框架，用于从单张RGB图像恢复度量深度，通过一次适应的方式改进现有单目深度估计模型，解决了透明物体和泛化性问题。

Details

Motivation: 当前6D姿态估计依赖昂贵的深度传感器，且存在噪声和透明物体处理能力差的问题，而现有单目深度估计模型在度量深度上表现不佳。MOMA旨在通过单次适应解决这些问题。

Result: 在桌面二指抓取和吸盘式物料分拣任务中，MOMA表现出高成功率，验证了其广泛的适用性和有效性。

Insight: 通过单次适应技术，可以显著提升单目深度模型的度量深度估计能力，尤其适用于机器人操作任务中对透明物体的处理。

Abstract: Accurate 6D object pose estimation is a prerequisite for successfully completing robotic prehensile and non-prehensile manipulation tasks. At present, 6D pose estimation for robotic manipulation generally relies on depth sensors based on, e.g., structured light, time-of-flight, and stereo-vision, which can be expensive, produce noisy output (as compared with RGB cameras), and fail to handle transparent objects. On the other hand, state-of-the-art monocular depth estimation models (MDEMs) provide only affine-invariant depths up to an unknown scale and shift. Metric MDEMs achieve some successful zero-shot results on public datasets, but fail to generalize. We propose a novel framework, Monocular One-shot Metric-depth Alignment (MOMA), to recover metric depth from a single RGB image, through a one-shot adaptation building on MDEM techniques. MOMA performs scale-rotation-shift alignments during camera calibration, guided by sparse ground-truth depth points, enabling accurate depth estimation without additional data collection or model retraining on the testing setup. MOMA supports fine-tuning the MDEM on transparent objects, demonstrating strong generalization capabilities. Real-world experiments on tabletop 2-finger grasping and suction-based bin-picking applications show MOMA achieves high success rates in diverse tasks, confirming its effectiveness.

[148] Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation cs.RO | cs.CVPDF

Jianglong Ye, Keyi Wang, Chengjing Yuan, Ruihan Yang, Yiquan Li

TL;DR: Dex1B是一个通过生成模型创建的大规模、多样化的灵巧手操作演示数据集，包含10亿个演示，用于抓取和关节任务。生成模型结合了几何约束以提升可行性，并通过额外条件增强多样性。

Details

Motivation: 灵巧手操作的大规模演示生成仍然是一个挑战，生成模型为解决这一问题提供了新的可能性。

Result: 在仿真基准测试中显著超越现有方法，并在真实机器人实验中展示了模型的鲁棒性。

Insight: 几何约束和多样性条件的结合可以显著提升生成演示的质量和实用性，为灵巧手操作任务提供了新的数据支持。

Abstract: Generating large-scale demonstrations for dexterous hand manipulation remains challenging, and several approaches have been proposed in recent years to address this. Among them, generative models have emerged as a promising paradigm, enabling the efficient creation of diverse and physically plausible demonstrations. In this paper, we introduce Dex1B, a large-scale, diverse, and high-quality demonstration dataset produced with generative models. The dataset contains one billion demonstrations for two fundamental tasks: grasping and articulation. To construct it, we propose a generative model that integrates geometric constraints to improve feasibility and applies additional conditions to enhance diversity. We validate the model on both established and newly introduced simulation benchmarks, where it significantly outperforms prior state-of-the-art methods. Furthermore, we demonstrate its effectiveness and robustness through real-world robot experiments. Our project page is at https://jianglongye.com/dex1b

cs.AI [Back]

[149] OAgents: An Empirical Study of Building Effective Agents cs.AI | cs.CLPDF

He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan

TL;DR: 本文对当前Agentic AI研究的标准化和科学严谨性不足进行了系统性实证研究，提出了更稳健的评估协议，并开源了高性能的OAgents框架。

Details

Motivation: 当前Agentic AI研究缺乏标准化评估，导致方法间难以公平比较，需要系统性研究设计选择对智能体性能的影响。

Result: OAgents在开源项目中实现了最先进的性能，且模块化设计支持未来研究。

Insight: 部分看似合理的设计实际冗余，而某些关键组件对智能体性能至关重要。

Abstract: Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we conduct a systematic empirical study on GAIA benchmark and BrowseComp to examine the impact of popular design choices in key agent components in a fair and rigorous manner. We find that the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs. Therefore, we introduce a more robust evaluation protocol to stabilize comparisons. Our study reveals which components and designs are crucial for effective agents, while others are redundant, despite seeming logical. Based on our findings, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects. OAgents offers a modular design for various agent components, promoting future research in Agentic AI.

[150] SLR: An Automated Synthesis Framework for Scalable Logical Reasoning cs.AI | cs.CL | cs.LGPDF

Lukas Helff, Ahmad Omar, Felix Friedrich, Wolfgang Stammer, Antonia Wüst

TL;DR: SLR是一个用于评估和训练大型语言模型（LLM）的自动化框架，通过可扩展的逻辑推理任务实现了系统化合成与验证，显著提升了模型的逻辑推理能力。

Details

Motivation: 当前LLMs在逻辑推理任务中表现不稳定，难以准确完成复杂推理。为解决这一问题，SLR框架旨在提供一个无需人工标注、可自动化生成逻辑推理任务的工具，用于系统化评估和训练模型。

Result: 实验表明，当代LLMs能生成语法正确的规则，但逻辑推理能力不足。逻辑微调后的Llama-3-8B在SLR-Bench上性能翻倍，与Gemini-Flash-Thinking相当但计算成本更低。

Insight: 逻辑推理能力是LLMs的短板，SLR框架为提升这一能力提供了可扩展、自动化的解决方案，同时为未来研究提供了一个高效的工具和基准。

Abstract: We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user’s task specification, SLR enables scalable, automated synthesis of inductive reasoning tasks with precisely controlled difficulty. For each task, SLR synthesizes (i) a latent ground-truth rule, (ii) an executable validation program used by a symbolic judge to deterministically verify model outputs, and (iii) an instruction prompt for the reasoning task. Using SLR, we create SLR-Bench, a benchmark comprising over 19k prompts spanning 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs do somewhat better, but incur substantial increases in test-time compute, sometimes exceeding 15k completion tokens. Finally, logic-tuning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. SLR is fully automated, requires no human annotation, ensures dataset novelty, and offers a scalable environment for probing and advancing LLMs’ reasoning capabilities.

[151] Bayesian Epistemology with Weighted Authority: A Formal Architecture for Truth-Promoting Autonomous Scientific Reasoning cs.AI | cs.CL | cs.DB | cs.LO | math.LO | 68T27, 03B70, 68P20 | I.2.3; F.4.1; H.2.8PDF

Craig S. Wright

TL;DR: 论文提出了一种名为BEWA的贝叶斯认识论架构，通过加权权威机制动态建模科学主张，结合复制分数、引用权重和时间衰减，促进机器推理系统的真理效用和理性信念收敛。

Details

Motivation: 科学文献的爆炸式增长超过了人类专家和现有AI系统的认知处理能力，需要一种能动态建模、验证和更新科学主张的形式化架构。

Result: 构建了一个支持真理效用和信念收敛的机器推理系统基础，增强了动态科学领域中的审计弹性与完整性。

Insight: 通过形式化科学推理为可计算的认知网络，BEWA为解决科学文献爆炸和动态更新问题提供了新思路。

Abstract: The exponential expansion of scientific literature has surpassed the epistemic processing capabilities of both human experts and current artificial intelligence systems. This paper introduces Bayesian Epistemology with Weighted Authority (BEWA), a formally structured architecture that operationalises belief as a dynamic, probabilistically coherent function over structured scientific claims. Each claim is contextualised, author-attributed, and evaluated through a system of replication scores, citation weighting, and temporal decay. Belief updates are performed via evidence-conditioned Bayesian inference, contradiction processing, and epistemic decay mechanisms. The architecture supports graph-based claim propagation, authorial credibility modelling, cryptographic anchoring, and zero-knowledge audit verification. By formalising scientific reasoning into a computationally verifiable epistemic network, BEWA advances the foundation for machine reasoning systems that promote truth utility, rational belief convergence, and audit-resilient integrity across dynamic scientific domains.

[152] IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks cs.AI | cs.CL | cs.CV | cs.LG | cs.ROPDF

Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang

TL;DR: IS-Bench 是首个用于评估基于视觉语言模型（VLM）的具身代理在家庭任务中交互安全性的基准，通过多模态场景和过程导向的评估方法揭示当前模型的不足。

Details

Motivation: 现有的静态评估方法无法动态模拟具身代理在交互环境中产生的安全风险，导致其在实际部署中存在安全隐患。

Result: 实验发现当前 VLM 代理缺乏交互安全意识，安全意识的链式思考（Chain-of-Thought）虽能提升性能，但常影响任务完成。

Insight: 交互安全评估是具身代理可靠部署的关键，未来需平衡安全性与任务效率。

Abstract: Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent’s actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent’s interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems.

[153] The Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models cs.AI | cs.CL | cs.CR | cs.CV | cs.LGPDF

Peiyuan Tang, Haojie Xin, Xiaodong Zhang, Jun Sun, Qin Xia

TL;DR: 该论文提出了一种名为’The Safety Reminder’的软提示调优方法，通过优化的可学习提示词在文本生成过程中周期性注入，以增强视觉语言模型（VLMs）的安全性，有效防止有害内容生成。

Details

Motivation: 随着视觉语言模型（VLMs）在代码生成和聊天机器人辅助等实际应用中的能力不断增强，确保其安全性变得至关重要。VLMs由于多模态特性面临独特的漏洞，攻击者可修改视觉或文本输入以绕过安全护栏并触发有害内容生成。通过系统地分析VLM在攻击下的行为，发现了一种’延迟安全感知’现象，即安全对齐的VLMs可能最初被诱导生成有害内容，但最终会识别风险并尝试自我纠正。

Result: 在三个安全基准和一个对抗攻击测试上，’The Safety Reminder’显著降低了攻击成功率，同时保持了模型的实用性。

Insight: 论文发现VLMs保留了潜在的安全意识，但其激活存在时间延迟。这一发现为设计更有效的安全机制提供了新思路，即通过主动提示重新激活模型的安全意识。

Abstract: As Vision-Language Models (VLMs) demonstrate increasing capabilities across real-world applications such as code generation and chatbot assistance, ensuring their safety has become paramount. Unlike traditional Large Language Models (LLMs), VLMs face unique vulnerabilities due to their multimodal nature, allowing adversaries to modify visual or textual inputs to bypass safety guardrails and trigger the generation of harmful content. Through systematic analysis of VLM behavior under attack, we identify a novel phenomenon termed delayed safety awareness''. Specifically, we observe that safety-aligned VLMs may initially be compromised to produce harmful content, but eventually recognize the associated risks and attempt to self-correct. This pattern suggests that VLMs retain their underlying safety awareness but experience a temporal delay in their activation. Building on this insight, we hypothesize that VLMs' safety awareness can be proactively reactivated through carefully designed prompts. To this end, we introduce The Safety Reminder’’, a soft prompt tuning approach that optimizes learnable prompt tokens, which are periodically injected during the text generation process to enhance safety awareness, effectively preventing harmful content generation. Additionally, our safety reminder only activates when harmful content is detected, leaving normal conversations unaffected and preserving the model’s performance on benign tasks. Through comprehensive evaluation across three established safety benchmarks and one adversarial attacks, we demonstrate that our approach significantly reduces attack success rates while maintaining model utility, offering a practical solution for deploying safer VLMs in real-world applications.

cs.GR [Back]

[154] VEIGAR: View-consistent Explicit Inpainting and Geometry Alignment for 3D object Removal cs.GR | cs.AI | cs.CV | eess.IVPDF

Pham Khai Nguyen Do, Bao Nguyen Tran, Nam Nguyen, Duc Dung Nguyen

TL;DR: VEIGAR提出了一种高效的三维物体移除框架，通过显式对齐几何和视图一致的修复，无需初始重建阶段，显著提升了重建质量和视图一致性。

Details

Motivation: 当前方法在三维物体移除任务中通常需要初始重建阶段，带来了高计算开销且重建质量不足。VEIGAR旨在解决这一问题，提出高效且无需初始重建的框架。

Result: VEIGAR在重建质量和视图一致性上表现最佳，训练时间减少三倍。

Insight: 显式几何对齐和尺度不变损失的结合可以显著提升效率和质量，为三维编辑任务提供了新的思路。

Abstract: Recent advances in Novel View Synthesis (NVS) and 3D generation have significantly improved editing tasks, with a primary emphasis on maintaining cross-view consistency throughout the generative process. Contemporary methods typically address this challenge using a dual-strategy framework: performing consistent 2D inpainting across all views guided by embedded priors either explicitly in pixel space or implicitly in latent space; and conducting 3D reconstruction with additional consistency guidance. Previous strategies, in particular, often require an initial 3D reconstruction phase to establish geometric structure, introducing considerable computational overhead. Even with the added cost, the resulting reconstruction quality often remains suboptimal. In this paper, we present VEIGAR, a computationally efficient framework that outperforms existing methods without relying on an initial reconstruction phase. VEIGAR leverages a lightweight foundation model to reliably align priors explicitly in the pixel space. In addition, we introduce a novel supervision strategy based on scale-invariant depth loss, which removes the need for traditional scale-and-shift operations in monocular depth regularization. Through extensive experimentation, VEIGAR establishes a new state-of-the-art benchmark in reconstruction quality and cross-view consistency, while achieving a threefold reduction in training time compared to the fastest existing method, highlighting its superior balance of efficiency and effectiveness.

cs.IR [Back]

[155] MoR: Better Handling Diverse Queries with a Mixture of Sparse, Dense, and Human Retrievers cs.IR | cs.AI | cs.CLPDF

Jushaan Singh Kalra, Xinran Zhao, To Eun Kim, Fengyu Cai, Fernando Diaz

TL;DR: 该论文提出了一种混合检索器（Mixture of Retrievers, MoR）方法，通过动态选择和整合多种异构检索器来提升检索增强生成（RAG）的性能，无需人工干预。实验表明，MoR在性能和效率上均优于单一检索器，并能有效结合人类检索资源。

Details

Motivation: 现有检索增强生成（RAG）方法通常固定使用单一检索器，无法适应多样化的查询需求。不同检索器（如BM25和密集检索器）提供互补信号，但如何动态整合这些信号仍是一个挑战。

Result: MoR仅需0.8B参数，平均性能优于单一检索器（+10.8%）和更大7B模型（+3.9%）。结合人类检索资源时，性能比模拟人类检索器提升58.9%。

Insight: 异构检索器的动态组合能显著提升检索性能，且轻量级设计高效实用。同时，MoR展示了人类专业知识在检索系统中的潜力。

Abstract: Retrieval-augmented Generation (RAG) is powerful, but its effectiveness hinges on which retrievers we use and how. Different retrievers offer distinct, often complementary signals: BM25 captures lexical matches; dense retrievers, semantic similarity. Yet in practice, we typically fix a single retriever based on heuristics, which fails to generalize across diverse information needs. Can we dynamically select and integrate multiple retrievers for each individual query, without the need for manual selection? In our work, we validate this intuition with quantitative analysis and introduce mixture of retrievers: a zero-shot, weighted combination of heterogeneous retrievers. Extensive experiments show that such mixtures are effective and efficient: Despite totaling just 0.8B parameters, this mixture outperforms every individual retriever and even larger 7B models by +10.8% and +3.9% on average, respectively. Further analysis also shows that this mixture framework can help incorporate specialized non-oracle human information sources as retrievers to achieve good collaboration, with a 58.9% relative performance improvement over simulated humans alone.

q-bio.QM [Back]

[156] Smartphone-integrated RPA-CRISPR-Cas12a Detection System with Microneedle Sampling for Point-of-Care Diagnosis of Potato Late Blight in Early Stage q-bio.QM | cs.CV | q-bio.BMPDF

Jiangnan Zhao, Hanbo Xu, Cifu Xu, Wenlong Yin, Laixin Luo

TL;DR: 开发了一种基于RPA-CRISPR-Cas12a的便携式诊断系统，结合智能手机和微针采样技术，用于马铃薯晚疫病的早期现场检测。

Details

Motivation: 马铃薯晚疫病对作物危害极大，传统检测方法依赖复杂设备和实验室条件，无法满足现场快速诊断需求。

Result: 检测下限为2 pg/uL，与实验室设备相当；接种后第3天和第4天的检测率分别达到80%和100%，早于肉眼可见症状。

Insight: 该系统为植物病害的早期现场检测提供了高效、便携的解决方案，有望应用于其他植物病害的快速诊断。

Abstract: Potato late blight, caused by the oomycete pathogen Phytophthora infestans, is one of the most devastating diseases affecting potato crops in the history. Although conventional detection methods of plant diseases such as PCR and LAMP are highly sensitive and specific, they rely on bulky and expensive laboratory equipment and involve complex operations, making them impracticable for point-of care diagnosis in the field. Here in this study, we report a portable RPA-CRISPR based diagnosis system for plant disease, integrating smartphone for acquisition and analysis of fluorescent images. A polyvinyl alcohol (PVA) microneedle patch was employed for sample extraction on the plant leaves within one minute, the DNA extraction efficiency achieved 56 ug/mg, which is approximately 3 times to the traditional CTAB methods (18 ug/mg). The system of RPA-CRISPR-Cas12a isothermal assay was established to specifically target P. infestans with no cross-reactivity observed against closely-related species (P. sojae, P. capsici). The system demonstrated a detection limit of 2 pg/uL for P. infestans genomic DNA, offering sensitivity comparable to that of benchtop laboratory equipment. The system demonstrates the early-stage diagnosis capability by achieving a approximately 80% and 100% detection rate on the third and fourth day post-inoculation respectively, before visible symptoms observed on the leaves. The smartphone-based “sample-to-result” system decouples the limitations of traditional methods that rely heavily on specialized equipment, offering a promising way for early-stage plant disease detection and control in the field.

physics.ins-det [Back]

[157] Bias Variation Compensation in Perimeter-Gated SPAD TRNGs physics.ins-det | cs.AR | cs.CR | cs.CVPDF

Md Sakibur Sajal, Hunter Guthrie, Marc Dandin

TL;DR: 本文提出了一种针对基于pgSPAD阵列的真随机数生成器的偏置变化补偿技术，通过优化栅极电压显著降低了偏置变化，并通过经典Von Neumann算法处理后的比特成功通过NIST统计测试。

Details

Motivation: 研究背景源于基于熵源阵列的真随机数生成器通常受到偏置变化（BV）的影响，现有去偏方法难以适应宽范围的BV，因此需要一种补偿技术以提升生成器的性能。

Result: 实验结果展示了在室温下每像素2kHz的原始比特生成率中，偏置变化小于1%，且去偏后的比特通过了NIST统计测试套件的全部16项测试。

Insight: 本研究表明通过硬件补偿与经典去偏算法的结合，可以有效提升真随机数生成器的性能，为实际应用提供了可行的解决方案。

Abstract: Random number generators that utilize arrays of entropy source elements suffer from bias variation (BV). Despite the availability of efficient debiasing algorithms, optimized implementations of hardware friendly options depend on the bit bias in the raw bit streams and cannot accommodate a wide BV. In this work, we present a 64 x 64 array of perimeter gated single photon avalanche diodes (pgSPADs), fabricated in a 0.35 {\mu}m standard CMOS technology, as a source of entropy to generate random binary strings with a BV compensation technique. By applying proper gate voltages based on the devices’ native dark count rates, we demonstrate less than 1% BV for a raw-bit generation rate of 2 kHz/pixel at room temperature. The raw bits were debiased using the classical iterative Von Neumann’s algorithm and the debiased bits were found to pass all of the 16 tests from NIST’s Statistical Test Suite.

cs.CG [Back]

[158] Wavelet-based Global Orientation and Surface Reconstruction for Point Clouds cs.CG | cs.CVPDF

Yueji Ma, Yanzun Meng, Dong Xiao, Zuoqiang Shi, Bin Wang

TL;DR: 提出了一种基于小波的方法，用于表示平滑的指示函数，同时完成点云的全局定向和曲面重建任务，通过改进核函数和利用小波基函数的性质加速计算，显著提升了稀疏点云的处理效果和效率。

Details

Motivation: 未定向曲面重建在计算机图形学中具有重要意义，但现有方法（如经典小波重建）仅适用于定向点云。尽管有改进尝试（如iWSR），稀疏点云的处理效果仍较差。论文旨在解决这一不足。

Result: 实验表明，该方法在稀疏模型的定向和重建任务中达到当前最佳性能，且在CPU上高效运行。

Insight: 通过小波基函数的特性与核函数优化结合，能够显著提升稀疏点云的定向和重建性能，为未定向点云处理提供了新思路。

Abstract: Unoriented surface reconstruction is an important task in computer graphics and has extensive applications. Based on the compact support of wavelet and orthogonality properties, classic wavelet surface reconstruction achieves good and fast reconstruction. However, this method can only handle oriented points. Despite some improved attempts for unoriented points, such as iWSR, these methods perform poorly on sparse point clouds. To address these shortcomings, we propose a wavelet-based method to represent the mollified indicator function and complete both the orientation and surface reconstruction tasks. We use the modifying kernel function to smoothen out discontinuities on the surface, aligning with the continuity of the wavelet basis function. During the calculation of coefficient, we fully utilize the properties of the convolutional kernel function to shift the modifying computation onto wavelet basis to accelerate. In addition, we propose a novel method for constructing the divergence-free function field and using them to construct the additional homogeneous constraints to improve the effectiveness and stability. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both orientation and reconstruction for sparse models. We align the matrix construction with the compact support property of wavelet basis functions to further accelerate our method, resulting in efficient performance on CPU. Our source codes will be released on GitHub.

eess.IV [Back]

[159] InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding eess.IV | cs.LGPDF

Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang

TL;DR: InfiniPot-V提出了一种无需训练、与查询无关的KV缓存压缩框架，用于流式视频理解，实现了内存硬性限制，显著减少了GPU内存占用并保持实时生成。

Details

Motivation: 现代多模态大语言模型（MLLMs）在处理长视频时，KV缓存会线性增长，但移动设备和边缘设备的固定内存无法满足，现有压缩方法无法完全解决这一问题。

Result: 在四个MLLM和六个基准测试中，InfiniPot-V将GPU峰值内存降低94%，保持实时生成，且精度与完整缓存相当或更高。

Insight: 通过在压缩中平衡冗余性和语义显著性，InfiniPot-V为流式视频理解的设备部署提供了实用解决方案。

Abstract: Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time–quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and two streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy–even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.

[160] Diffusion-based Counterfactual Augmentation: Towards Robust and Interpretable Knee Osteoarthritis Grading eess.IV | cs.CVPDF

Zhe Wang, Yuhua Ru, Aladine Chetouani, Tina Shiang, Fang Chen

TL;DR: 本文提出一种基于扩散模型的反事实增强方法（DCA），通过生成目标反事实样本提高深度学习模型在膝骨关节炎分级中的鲁棒性和可解释性。

Details

Motivation: 膝骨关节炎（KOA）分级存在观察者间差异性和模型在关键决策边界附近鲁棒性不足的问题，需要一种有效的方法提高模型性能和可解释性。

Result: 在OAI和MOST数据集上的实验表明，该方法显著提高了分类准确性，且生成的隐空间结构与临床知识一致。

Insight: 模型不确定性可以转化为鲁棒训练信号，反事实增强有助于揭示模型决策的病理特征，提升自动化诊断系统的可信度。

Abstract: Automated grading of Knee Osteoarthritis (KOA) from radiographs is challenged by significant inter-observer variability and the limited robustness of deep learning models, particularly near critical decision boundaries. To address these limitations, this paper proposes a novel framework, Diffusion-based Counterfactual Augmentation (DCA), which enhances model robustness and interpretability by generating targeted counterfactual examples. The method navigates the latent space of a diffusion model using a Stochastic Differential Equation (SDE), governed by balancing a classifier-informed boundary drive with a manifold constraint. The resulting counterfactuals are then used within a self-corrective learning strategy to improve the classifier by focusing on its specific areas of uncertainty. Extensive experiments on the public Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) datasets demonstrate that this approach significantly improves classification accuracy across multiple model architectures. Furthermore, the method provides interpretability by visualizing minimal pathological changes and revealing that the learned latent space topology aligns with clinical knowledge of KOA progression. The DCA framework effectively converts model uncertainty into a robust training signal, offering a promising pathway to developing more accurate and trustworthy automated diagnostic systems. Our code is available at https://github.com/ZWang78/DCA.

[161] MoNetV2: Enhanced Motion Network for Freehand 3D Ultrasound Reconstruction eess.IV | cs.AI | cs.CVPDF

Mingyuan Luo, Xin Yang, Zhongnuo Yan, Yan Cao, Yuanji Zhang

TL;DR: MoNetV2提出了一种增强的运动网络，通过融合图像和运动信息、多级一致性约束以及多模态自监督策略，显著提升了自由手3D超声重建的准确性和泛化能力。

Details

Motivation: 尽管基于深度学习的自由手3D超声重建取得了进展，但仅依赖图像的还原方法难以减少累积误差，尤其是在复杂运动轨迹场景下。因此，需要一种更鲁棒且通用的解决方案。

Result: 在三个大型数据集上的实验表明，MoNetV2在重建质量和泛化性能上均优于现有方法。

Insight: 融合多源信息（如图像与运动）和层级一致性约束可以有效提升3D超声重建的准确性和鲁棒性。

Abstract: Three-dimensional (3D) ultrasound (US) aims to provide sonographers with the spatial relationships of anatomical structures, playing a crucial role in clinical diagnosis. Recently, deep-learning-based freehand 3D US has made significant advancements. It reconstructs volumes by estimating transformations between images without external tracking. However, image-only reconstruction poses difficulties in reducing cumulative drift and further improving reconstruction accuracy, particularly in scenarios involving complex motion trajectories. In this context, we propose an enhanced motion network (MoNetV2) to enhance the accuracy and generalizability of reconstruction under diverse scanning velocities and tactics. First, we propose a sensor-based temporal and multi-branch structure that fuses image and motion information from a velocity perspective to improve image-only reconstruction accuracy. Second, we devise an online multi-level consistency constraint that exploits the inherent consistency of scans to handle various scanning velocities and tactics. This constraint exploits both scan-level velocity consistency, path-level appearance consistency, and patch-level motion consistency to supervise inter-frame transformation estimation. Third, we distill an online multi-modal self-supervised strategy that leverages the correlation between network estimation and motion information to further reduce cumulative errors. Extensive experiments clearly demonstrate that MoNetV2 surpasses existing methods in both reconstruction quality and generalizability performance across three large datasets.

[162] Enhanced Dermatology Image Quality Assessment via Cross-Domain Training eess.IV | cs.CVPDF

Ignacio Hernández Montilla, Alfonso Medela, Paola Pasquali, Andy Aguilar, Taig Mac Carthy

TL;DR: 该论文提出了一种通过跨领域训练提升皮肤科图像质量评估（IQA）的方法，利用皮肤科和非皮肤科数据集进行联合训练，解决了皮肤科IQA数据量小的问题。

Details

Motivation: 远程皮肤科诊断中，图像质量差是一个主要问题，但现有皮肤科IQA研究数据量不足，且未充分利用非皮肤科领域的最新进展。

Result: 跨领域训练提高了模型的性能，克服了皮肤病IQA数据量小的限制，提升了远程皮肤科诊断中图像质量的管理。

Insight: 跨领域训练可以弥补特定领域数据不足的问题，同时利用多样化数据提升模型的泛化能力。

Abstract: Teledermatology has become a widely accepted communication method in daily clinical practice, enabling remote care while showing strong agreement with in-person visits. Poor image quality remains an unsolved problem in teledermatology and is a major concern to practitioners, as bad-quality images reduce the usefulness of the remote consultation process. However, research on Image Quality Assessment (IQA) in dermatology is sparse, and does not leverage the latest advances in non-dermatology IQA, such as using larger image databases with ratings from large groups of human observers. In this work, we propose cross-domain training of IQA models, combining dermatology and non-dermatology IQA datasets. For this purpose, we created a novel dermatology IQA database, Legit.Health-DIQA-Artificial, using dermatology images from several sources and having them annotated by a group of human observers. We demonstrate that cross-domain training yields optimal performance across domains and overcomes one of the biggest limitations in dermatology IQA, which is the small scale of data, and leads to models trained on a larger pool of image distortions, resulting in a better management of image quality in the teledermatology process.

[163] CF-Seg: Counterfactuals meet Segmentation eess.IV | cs.AI | cs.CVPDF

Raghav Mehta, Fabio De Sousa Ribeiro, Tian Xia, Melanie Roschewitz, Ainkaran Santhirasekaram

TL;DR: 论文提出了一种利用反事实（CF）图像改进医学图像分割的方法，通过模拟健康状态下的解剖结构提升分割模型的准确性。

Details

Motivation: 医学图像中疾病的出现会改变周围健康组织的表现，增加分割的难度，可能导致误诊。因此，需要一种方法在不改变模型的情况下提升分割效果。

Result: 在两个真实的胸部X光数据集上的实验表明，使用反事实图像显著改善了解剖结构的分割效果。

Insight: 反事实图像能够有效解决疾病干扰分割的问题，为下游临床决策提供更准确的支持。

Abstract: Segmenting anatomical structures in medical images plays an important role in the quantitative assessment of various diseases. However, accurate segmentation becomes significantly more challenging in the presence of disease. Disease patterns can alter the appearance of surrounding healthy tissues, introduce ambiguous boundaries, or even obscure critical anatomical structures. As such, segmentation models trained on real-world datasets may struggle to provide good anatomical segmentation, leading to potential misdiagnosis. In this paper, we generate counterfactual (CF) images to simulate how the same anatomy would appear in the absence of disease without altering the underlying structure. We then use these CF images to segment structures of interest, without requiring any changes to the underlying segmentation model. Our experiments on two real-world clinical chest X-ray datasets show that the use of counterfactual images improves anatomical segmentation, thereby aiding downstream clinical decision-making.

[164] Hybrid Attention Network for Accurate Breast Tumor Segmentation in Ultrasound Images eess.IV | cs.AI | cs.CVPDF

Muhammad Azeem Aslam, Asim Naveed, Nisar Ahmed

TL;DR: 提出了一种基于混合注意力的网络，用于超声图像中乳腺肿瘤的精确分割，结合了全局空间注意力、位置编码和特征增强模块，显著优于现有方法。

Details

Motivation: 乳腺超声图像存在噪声、病变尺度变化和模糊边界等问题，传统方法难以实现精准分割。

Result: 在公开数据集上表现优于现有方法，证明了其在辅助乳腺癌早期诊断中的潜力。

Insight: 全局空间注意力和特征增强模块能有效捕捉肿瘤区域，混合损失函数缓解了类别不平衡问题。

Abstract: Breast ultrasound imaging is a valuable tool for early breast cancer detection, but automated tumor segmentation is challenging due to inherent noise, variations in scale of lesions, and fuzzy boundaries. To address these challenges, we propose a novel hybrid attention-based network for lesion segmentation. Our proposed architecture integrates a pre-trained DenseNet121 in the encoder part for robust feature extraction with a multi-branch attention-enhanced decoder tailored for breast ultrasound images. The bottleneck incorporates Global Spatial Attention (GSA), Position Encoding (PE), and Scaled Dot-Product Attention (SDPA) to learn global context, spatial relationships, and relative positional features. The Spatial Feature Enhancement Block (SFEB) is embedded at skip connections to refine and enhance spatial features, enabling the network to focus more effectively on tumor regions. A hybrid loss function combining Binary Cross-Entropy (BCE) and Jaccard Index loss optimizes both pixel-level accuracy and region-level overlap metrics, enhancing robustness to class imbalance and irregular tumor shapes. Experiments on public datasets demonstrate that our method outperforms existing approaches, highlighting its potential to assist radiologists in early and accurate breast cancer diagnosis.

[165] Overfitting in Histopathology Model Training: The Need for Customized Architectures eess.IV | cs.CVPDF

Saghir Alfasly, Ghazal Alabtah, H. R. Tizhoosh

TL;DR: 本文探讨了在组织病理学图像分析中深度学习的过拟合问题，表明直接采用为自然图像设计的大型模型会表现不佳并导致过拟合。研究发现，简单的领域特定架构能取得更好效果。

Details

Motivation: 组织病理学图像分析中直接使用自然图像的大型模型常导致过拟合和性能下降，需定制架构以提高效果。

Result: 在Oesophageal Adenocarcinomas数据集上，定制架构取得更优性能并减少过拟合。

Insight: 领域特定设计比通用架构更有效，尤其是在小数据集上，过拟合问题可通过简化模型缓解。

Abstract: This study investigates the critical problem of overfitting in deep learning models applied to histopathology image analysis. We show that simply adopting and fine-tuning large-scale models designed for natural image analysis often leads to suboptimal performance and significant overfitting when applied to histopathology tasks. Through extensive experiments with various model architectures, including ResNet variants and Vision Transformers (ViT), we show that increasing model capacity does not necessarily improve performance on histopathology datasets. Our findings emphasize the need for customized architectures specifically designed for histopathology image analysis, particularly when working with limited datasets. Using Oesophageal Adenocarcinomas public dataset, we demonstrate that simpler, domain-specific architectures can achieve comparable or better performance while minimizing overfitting.

[166] Temperature calibration of surface emissivities with an improved thermal image enhancement network eess.IV | cs.CVPDF

Ning Chu, Siya Zheng, Shanqing Zhang, Li Li, Caifang Cai

TL;DR: 论文提出了一种基于物理引导的神经网络框架，联合优化温度校正和图像增强，通过对称跳跃CNN和发射率感知注意力模块，显著提升了红外热成像的温度精度。

Details

Motivation: 红外热成像中，材料发射率的变化导致温度精度问题，现有方法往往忽略了辐射校准与图像退化的联合优化。

Result: 在不同工业条件下的验证中，该网络实现了准确温度校准，并显著减少了发射率伪影，同时恢复了结构细节。

Insight: 通过联合优化辐射校准和图像退化问题，可以显著提升红外热成像的温度精度和视觉质量。

Abstract: Infrared thermography faces persistent challenges in temperature accuracy due to material emissivity variations, where existing methods often neglect the joint optimization of radiometric calibration and image degradation. This study introduces a physically guided neural framework that unifies temperature correction and image enhancement through a symmetric skip-CNN architecture and an emissivity-aware attention module. The pre-processing stage segments the ROIs of the image and and initially corrected the firing rate. A novel dual-constrained loss function strengthens the statistical consistency between the target and reference regions through mean-variance alignment and histogram matching based on Kullback-Leibler dispersion. The method works by dynamically fusing thermal radiation features and spatial context, and the model suppresses emissivity artifacts while recovering structural details. After validating the industrial blower system under different conditions, the improved network realizes the dynamic fusion of thermal radiation characteristics and spatial background, with accurate calibration results in various industrial conditions.

[167] PET Tracer Separation Using Conditional Diffusion Transformer with Multi-latent Space Learning eess.IV | cs.CVPDF

Bin Huang, Feihong Xu, Xinchong Shi, Shan Huang, Binxuan Li

TL;DR: 论文提出了一种多潜在空间引导的纹理条件扩散Transformer模型（MS-CDT），用于解决PET成像中多种示踪剂信号难以区分的问题，通过结合扩散模型和Transformer架构，并引入纹理掩码作为条件输入，提升了图像细节的保留效果。

Details

Motivation: 临床中单示踪剂PET成像常用，但多示踪剂PET能提供更全面的生理和病理信息。由于不同示踪剂产生的γ光子能量相同，区分信号成为挑战，因此需开发新方法以实现准确的示踪剂分离。

Result: 在脑部和胸部3D PET数据集上的实验表明，MS-CDT在图像质量和临床信息保留上优于其他先进方法，实现了竞争性性能。

Insight: 纹理掩码和多潜在空间学习的结合有效提升了细节保留能力；扩散Transformer框架为医学图像分析提供了新思路，适用于复杂信号的分离任务。

Abstract: In clinical practice, single-radiotracer positron emission tomography (PET) is commonly used for imaging. Although multi-tracer PET imaging can provide supplementary information of radiotracers that are sensitive to physiological function changes, enabling a more comprehensive characterization of physiological and pathological states, the gamma-photon pairs generated by positron annihilation reactions of different tracers in PET imaging have the same energy, making it difficult to distinguish the tracer signals. In this study, a multi-latent space guided texture conditional diffusion transformer model (MS-CDT) is proposed for PET tracer separation. To the best of our knowledge, this is the first attempt to use texture condition and multi-latent space for tracer separation in PET imaging. The proposed model integrates diffusion and transformer architectures into a unified optimization framework, with the novel addition of texture masks as conditional inputs to enhance image details. By leveraging multi-latent space prior derived from different tracers, the model captures multi-level feature representations, aiming to balance computational efficiency and detail preservation. The texture masks, serving as conditional guidance, help the model focus on salient structural patterns, thereby improving the extraction and utilization of fine-grained image textures. When combined with the diffusion transformer backbone, this conditioning mechanism contributes to more accurate and robust tracer separation. To evaluate its effectiveness, the proposed MS-CDT is compared with several advanced methods on two types of 3D PET datasets: brain and chest scans. Experimental results indicate that MS-CDT achieved competitive performance in terms of image quality and preservation of clinically relevant information. Code is available at: https://github.com/yqx7150/MS-CDT.

[168] MeDi: Metadata-Guided Diffusion Models for Mitigating Biases in Tumor Classification eess.IV | cs.AI | cs.CVPDF

David Jacob Drexlin, Jonas Dippel, Julius Hense, Niklas Prenißl, Grégoire Montavon

TL;DR: 论文提出了一种元数据引导的扩散模型（MeDi），用于在肿瘤分类任务中缓解数据偏差问题，通过生成 underrepresented subpopulations 的合成数据来提升下游模型的鲁棒性和性能。

Details

Motivation: 深度学习模型在组织学预测任务中表现优异，但其对染色、扫描仪、医院和人口统计等变化的鲁棒性不足，导致 shortcut learning 和有偏差的预测。现有的基础模型未能完全解决这一问题。

Result: 实验表明，MeDi 在 TCGA 数据集上生成了高质量的未见子群体图像，提升了生成图像的保真度，并显著改善了带有子群体偏移的下游分类器性能。

Insight: 研究表明，显式建模元数据的生成框架能够有效缓解数据偏差问题，为利用生成模型解决医学图像中的临床适应性挑战提供了新思路。

Abstract: Deep learning models have made significant advances in histological prediction tasks in recent years. However, for adaptation in clinical practice, their lack of robustness to varying conditions such as staining, scanner, hospital, and demographics is still a limiting factor: if trained on overrepresented subpopulations, models regularly struggle with less frequent patterns, leading to shortcut learning and biased predictions. Large-scale foundation models have not fully eliminated this issue. Therefore, we propose a novel approach explicitly modeling such metadata into a Metadata-guided generative Diffusion model framework (MeDi). MeDi allows for a targeted augmentation of underrepresented subpopulations with synthetic data, which balances limited training data and mitigates biases in downstream models. We experimentally show that MeDi generates high-quality histopathology images for unseen subpopulations in TCGA, boosts the overall fidelity of the generated images, and enables improvements in performance for downstream classifiers on datasets with subpopulation shifts. Our work is a proof-of-concept towards better mitigating data biases with generative models.

cs.SI [Back]

Paulina DeVito, Akhil Vallala, Sean Mcmahon, Yaroslav Hinda, Benjamin Thaw

TL;DR: 该研究通过分析Reddit上的社交媒体数据，揭示了教育和学生对生成式AI（GAI）的不同观点，提出了一个基于LLM的模块化框架用于情感分析、话题建模和作者分类，并展示了GPT-4o在任务中的优越表现。研究发现教育者和学生对GAI的潜在用途和担忧存在差异。

Details

Motivation: 生成式AI（GAI）在教育中的快速普及引发了对其影响的关注。了解学生和教育者对GAI的感知差异是推动有效政策和技术整合的关键。

Result: GPT-4o在情感分析中达到90.6%的准确率。发现了12个潜在话题，显示教育者对GAI的担忧主要集中在职业安全和学术诚信，而学生则因AI检测工具对作弊的误报感到困扰。

Insight: 教育者和学生对GAI的态度差异凸显了创新与监管之间的张力。研究呼吁更透明的政策和支持机制，同时展示了LLM框架在社区话语分析中的潜力。

Abstract: Generative AI (GAI) technologies are quickly reshaping the educational landscape. As adoption accelerates, understanding how students and educators perceive these tools is essential. This study presents one of the most comprehensive analyses to date of stakeholder discourse dynamics on GAI in education using social media data. Our dataset includes 1,199 Reddit posts and 13,959 corresponding top-level comments. We apply sentiment analysis, topic modeling, and author classification. To support this, we propose and validate a modular framework that leverages prompt-based large language models (LLMs) for analysis of online social discourse, and we evaluate this framework against classical natural language processing (NLP) models. Our GPT-4o pipeline consistently outperforms prior approaches across all tasks. For example, it achieved 90.6% accuracy in sentiment analysis against gold-standard human annotations. Topic extraction uncovered 12 latent topics in the public discourse with varying sentiment and author distributions. Teachers and students convey optimism about GAI’s potential for personalized learning and productivity in higher education. However, key differences emerged: students often voice distress over false accusations of cheating by AI detectors, while teachers generally express concern about job security, academic integrity, and institutional pressures to adopt GAI tools. These contrasting perspectives highlight the tension between innovation and oversight in GAI-enabled learning environments. Our findings suggest a need for clearer institutional policies, more transparent GAI integration practices, and support mechanisms for both educators and students. More broadly, this study demonstrates the potential of LLM-based frameworks for modeling stakeholder discourse within online communities.

Table of Contents

cs.CV [Back]

[1] A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion cs.CV | eess.IVPDF

[2] VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service cs.CV | cs.CLPDF

[3] Weakly-supervised VLM-guided Partial Contrastive Learning for Visual Language Navigation cs.CVPDF

[4] ADAM-Dehaze: Adaptive Density-Aware Multi-Stage Dehazing for Improved Object Detection in Foggy Conditions cs.CVPDF

[5] EchoShot: Multi-Shot Portrait Video Generation cs.CVPDF

[6] Privacy-Preserving in Connected and Autonomous Vehicles Through Vision to Text Transformation cs.CV | cs.LGPDF

[7] Visual symbolic mechanisms: Emergent symbol processing in vision language models cs.CVPDF

[8] MoiréXNet: Adaptive Multi-Scale Demoiréing with Linear Attention Test-Time Training and Truncated Flow Matching Prior cs.CV | cs.AI | eess.IVPDF

[9] Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization cs.CV | cs.AI | cs.MMPDF

[10] Polyline Path Masked Attention for Vision Transformer cs.CVPDF

[11] LBMamba: Locally Bi-directional Mamba cs.CVPDF

[12] Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization cs.CV | cs.AIPDF

[13] PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models cs.CV | cs.GRPDF

[14] Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation cs.CVPDF

[15] STAR-Pose: Efficient Low-Resolution Video Human Pose Estimation via Spatial-Temporal Adaptive Super-Resolution cs.CVPDF

[16] PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning cs.CVPDF

[17] AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models cs.CVPDF

[18] FastInit: Fast Noise Initialization for Temporally Consistent Video Generation cs.CVPDF

[19] Neurosymbolic Object-Centric Learning with Distant Supervision cs.CVPDF

[20] GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning cs.CV | cs.AI | cs.CL | cs.LGPDF

[21] MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models cs.CVPDF

[22] Co-Speech Gesture and Facial Expression Generation for Non-Photorealistic 3D Characters cs.CV | I.2.10PDF

[23] Align the GAP: Prior-based Unified Multi-Task Remote Physiological Measurement Framework For Domain Generalization and Personalization cs.CVPDF

[24] Integrating Generative Adversarial Networks and Convolutional Neural Networks for Enhanced Traffic Accidents Detection and Analysis cs.CVPDF

[25] VideoGAN-based Trajectory Proposal for Automated Vehicles cs.CV | cs.LGPDF

[26] FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models cs.CVPDF

[27] R3eVision: A Survey on Robust Rendering, Restoration, and Enhancement for 3D Low-Level Vision cs.CVPDF

[28] Fine-grained Image Retrieval via Dual-Vision Adaptation cs.CV | cs.MMPDF

[29] SycnMapV2: Robust and Adaptive Unsupervised Segmentation cs.CV | cs.AI | cs.LGPDF

[30] Segment Anything for Satellite Imagery: A Strong Baseline and a Regional Dataset for Automatic Field Delineation cs.CV | cs.AIPDF

[31] RealDriveSim: A Realistic Multi-Modal Multi-Task Synthetic Dataset for Autonomous Driving cs.CVPDF

[32] Reliable Few-shot Learning under Dual Noises cs.CV | cs.AIPDF

[33] Transparency Techniques for Neural Networks trained on Writer Identification and Writer Verification cs.CVPDF

[34] MambaHash: Visual State Space Deep Hashing Model for Large-Scale Image Retrieval cs.CVPDF

[35] Prompt-based Dynamic Token Pruning to Guide Transformer Attention in Efficient Segmentation cs.CVPDF

[36] AGC-Drive: A Large-Scale Dataset for Real-World Aerial-Ground Collaboration in Driving Scenarios cs.CVPDF

[37] CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset cs.CV | cs.AI | cs.LGPDF

[38] HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis cs.CVPDF

[39] Robustness Evaluation of OCR-based Visual Document Understanding under Multi-Modal Adversarial Attacks cs.CV | cs.AIPDF

[40] Efficient Transformations in Deep Learning Convolutional Neural Networks cs.CV | cs.AI | eess.IV | eess.SP | 68T07, 68T10, 94A08, 42C10PDF

[41] How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? cs.CVPDF

[42] Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors cs.CV | cs.AI | cs.CRPDF

[43] SafeTriage: Facial Video De-identification for Privacy-Preserving Stroke Triage cs.CVPDF

[44] Spatially-Aware Evaluation of Segmentation Uncertainty cs.CV | cs.AI | cs.PF | stat.MLPDF

[45] Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge cs.CVPDF

[46] LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation cs.CVPDF

[47] Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition cs.CVPDF

[48] Few-Shot Generalized Category Discovery With Retrieval-Guided Decision Boundary Enhancement cs.CVPDF

[49] TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion cs.CVPDF

[50] 3DeepRep: 3D Deep Low-rank Tensor Representation for Hyperspectral Image Inpainting cs.CV | eess.IVPDF

[51] Cross-modal Offset-guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection cs.CVPDF

[52] Uncertainty-Aware Variational Information Pursuit for Interpretable Medical Image Analysis cs.CVPDF

[53] Noise-Informed Diffusion-Generated Image Detection with Anomaly Attention cs.CVPDF

[54] Infrared and Visible Image Fusion Based on Implicit Neural Representations cs.CVPDF

[55] TextBraTS: Text-Guided Volumetric Brain Tumor Segmentation with Innovative Dataset Development and Fusion Module Exploration cs.CV | cs.MMPDF

[56] RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought cs.CVPDF

[57] Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation cs.CVPDF

[58] Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes cs.CVPDF

[59] FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation cs.CVPDF

[60] Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection cs.CV | cs.AIPDF

[61] Self-supervised Feature Extraction for Enhanced Ball Detection on Soccer Robots cs.CVPDF

[62] AnyTraverse: An off-road traversability framework with VLM and human operator in the loop cs.CV | cs.AI | cs.ROPDF

[63] Camera Calibration via Circular Patterns: A Comprehensive Framework with Measurement Uncertainty and Unbiased Projection Model cs.CV | cs.ROPDF

[64] Controllable and Expressive One-Shot Video Head Swapping cs.CVPDF

[65] ParkFormer: A Transformer-Based Parking Policy with Goal Embedding and Pedestrian-Aware Control cs.CV | cs.AIPDF

[66] With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You cs.CV | cs.AI | cs.LGPDF

[67] LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models cs.CV | cs.LGPDF

[68] Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs cs.CV | cs.AI | cs.CLPDF

[69] Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments cs.CV | cs.LGPDF

[70] A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X Autonomous Driving cs.CVPDF

[71] Unsupervised Image Super-Resolution Reconstruction Based on Real-World Degradation Patterns cs.CV | eess.IVPDF

[72] Stretching Beyond the Obvious: A Gradient-Free Framework to Unveil the Hidden Landscape of Visual Invariance cs.CV | cs.NEPDF

[73] Assembler: Scalable 3D Part Assembly via Anchor Point Diffusion cs.CVPDF

[74] MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation cs.CV | cs.AI | cs.CLPDF

[75] RGBTrack: Fast, Robust Depth-Free 6D Pose Estimation and Tracking cs.CV | cs.ROPDF

[76] Dynamic Watermark Generation for Digital Images using Perimeter Gated SPAD Imager PUFs cs.CVPDF

[77] Semi-Supervised Multi-Modal Medical Image Segmentation for Complex Situations cs.CVPDF

[78] On the Theory of Conditional Feature Alignment for Unsupervised Domain-Adaptive Counting cs.CVPDF