cs.CV [Total: 74]
cs.CL [Total: 27]
cs.MM [Total: 1]
cs.AR [Total: 1]
cs.IR [Total: 1]
cs.GR [Total: 5]
econ.GN [Total: 1]
cs.RO [Total: 5]
cs.DB [Total: 1]
cs.AI [Total: 5]
cs.SE [Total: 1]
cs.LG [Total: 7]

cs.CV [Back]

[1] ACM Multimedia Grand Challenge on ENT Endoscopy Analysis cs.CVPDF

Trong-Thuan Nguyen, Viet-Tham Huynh, Thao Thi Phuong Dao, Ha Nguyen Thi, Tien To Vu Thuy

TL;DR: 本文介绍了ACM Multimedia 2025 Grand Challenge中的ENTRep挑战赛，旨在通过多模态（图像和文本）方法解决ENT内窥镜分析的自动化问题，提出了包含分类和检索任务的基准数据集和评估协议。

Details

Motivation: ENT内窥镜分析的自动化因设备、操作者的多样性和细微的局部发现而受限，现有公共基准缺乏支持，亟需多模态方法解决这一问题。

Result: 展示了顶级团队的表现结果，并提供了性能讨论，验证了多模态方法在ENT内窥镜分析中的有效性。

Insight: 多模态方法（结合视觉和文本）能够更好地满足临床需求，尤其是在复杂的ENT内窥镜分析场景中。

Abstract: Automated analysis of endoscopic imagery is a critical yet underdeveloped component of ENT (ear, nose, and throat) care, hindered by variability in devices and operators, subtle and localized findings, and fine-grained distinctions such as laterality and vocal-fold state. In addition to classification, clinicians require reliable retrieval of similar cases, both visually and through concise textual descriptions. These capabilities are rarely supported by existing public benchmarks. To this end, we introduce ENTRep, the ACM Multimedia 2025 Grand Challenge on ENT endoscopy analysis, which integrates fine-grained anatomical classification with image-to-image and text-to-image retrieval under bilingual (Vietnamese and English) clinical supervision. Specifically, the dataset comprises expert-annotated images, labeled for anatomical region and normal or abnormal status, and accompanied by dual-language narrative descriptions. In addition, we define three benchmark tasks, standardize the submission protocol, and evaluate performance on public and private test splits using server-side scoring. Moreover, we report results from the top-performing teams and provide an insight discussion.

[2] CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework cs.CV | cs.AIPDF

Sriram Mandalika, Lalitha V

TL;DR: CoMAD是一个轻量级、无需参数的框架，通过整合多个自监督ViT教师模型的知识，将其蒸馏到一个紧凑的学生网络中，实现了高效的表示学习。

Details

Motivation: 现有自监督学习方法通常独立训练，忽略了互补性信息，且模型体积大，难以在资源受限的场景部署。

Result: 在ImageNet-1K上，ViT-Tiny达到75.4% Top-1；在ADE20K和MS-COCO上也刷新了SOTA性能。

Insight: 通过多教师模型的互补性和非对称掩码策略，CoMAD显著提升了紧凑模型的表现，展示了知识蒸馏在自监督学习中的潜力。

Abstract: Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask, forcing the student to interpolate missing features under richer contexts. Teacher embeddings are aligned to the student’s space via a linear adapter and layer normalization, then fused through our joint consensus gating, which weights each token by combining cosine affinity with inter-teacher agreement. The student is trained with dual-level KL divergence on visible tokens and reconstructed feature maps, capturing both local and global structure. On ImageNet-1K, CoMAD’s ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art. In dense-prediction transfers, it attains 47.3 percent mIoU on ADE20K, and 44.5 percent box average precision and 40.5 percent mask average precision on MS-COCO, establishing a new state-of-the-art in compact SSL distillation.

[3] VER-Bench: Evaluating MLLMs on Reasoning with Fine-Grained Visual Evidence cs.CVPDF

Chenhui Qiang, Zhaoyang Wei, Xumeng Han Zipeng Wang, Siyao Li, Xiangyuan Lan

TL;DR: VER-Bench是一个新颖的评估框架，专注于评估多模态大语言模型（MLLMs）在细粒度视觉证据提取和复杂推理方面的能力。

Details

Motivation: 现有的基准测试要么侧重于基本感知任务，缺乏深度推理，要么专注于显眼的图像元素，未能评估模型对细微视觉线索的理解能力。然而，真正的视觉理解和复杂推理更需要模型对细微的、不起眼的局部细节进行解读。

Result: 实验表明，现有模型在提取细微视觉证据和构建基于证据的推理方面存在显著局限性。

Insight: 真正的视觉理解和人类水平的分析需要模型在细粒度视觉证据提取、整合和推理能力上的进一步提升。

Abstract: With the rapid development of MLLMs, evaluating their visual capabilities has become increasingly crucial. Current benchmarks primarily fall into two main types: basic perception benchmarks, which focus on local details but lack deep reasoning (e.g., “what is in the image?”), and mainstream reasoning benchmarks, which concentrate on prominent image elements but may fail to assess subtle clues requiring intricate analysis. However, profound visual understanding and complex reasoning depend more on interpreting subtle, inconspicuous local details than on perceiving salient, macro-level objects. These details, though occupying minimal image area, often contain richer, more critical information for robust analysis. To bridge this gap, we introduce the VER-Bench, a novel framework to evaluate MLLMs’ ability to: 1) identify fine-grained visual clues, often occupying on average just 0.25% of the image area; 2) integrate these clues with world knowledge for complex reasoning. Comprising 374 carefully designed questions across Geospatial, Temporal, Situational, Intent, System State, and Symbolic reasoning, each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them. VER-Bench reveals current models’ limitations in extracting subtle visual evidence and constructing evidence-based arguments, highlighting the need to enhance models’s capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis. Dataset and additional materials are available https://github.com/verbta/ACMMM-25-Materials.

Noreen Anwar, Guillaume-Alexandre Bilodeau, Wassim Bouachir

TL;DR: 该论文提出了一种名为DAMM的双流注意力框架，通过多模态查询和双流注意力机制，解决了目标检测中的遮挡、细粒度定位和计算效率问题。

Details

Motivation: Transformer-based目标检测器在遮挡、细粒度定位和固定查询导致的密集注意力计算效率低等问题上表现不佳。

Result: 在四个挑战性基准测试中，DAMM在平均精度（AP）和召回率上达到SOTA性能。

Insight: 多模态查询适应性和双流注意力机制的结合能够有效提升复杂交通场景中的目标检测效果。

Abstract: Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: \href{https://github.com/DET-LIP/DAMM}{GitHub}.

[5] Revealing Temporal Label Noise in Multimodal Hateful Video Classification cs.CV | cs.AIPDF

Shuonan Yang, Tailin Chen, Rahul Singh, Jiangbei Yue, Jianbo Jiao

TL;DR: 该论文研究了多模态仇恨视频分类中的时间标签噪声问题，通过精细化标注和实验分析揭示了视频级粗糙标签的局限性，并提出了时间感知模型的重要性。

Details

Motivation: 在线多媒体内容的快速增长加剧了仇恨言论的传播，而现有方法依赖视频级粗粒度标注，忽略了仇恨内容的时态细节，导致显著的标签噪声问题。

Result: 实验表明，时间标签噪声会显著影响模型决策边界，降低分类置信度，突出了仇恨言论表达的上下文依赖性和时间连续性。

Insight: 仇恨视频的时态动态性需要时间感知的模型和基准测试，以提高模型的鲁棒性和可解释性。

Abstract: The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise.

[6] Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations cs.CVPDF

Zahidul Islam, Sujoy Paul, Mrigank Rochan

TL;DR: 论文提出了一种测试时自适应框架Highlight-TTA，通过动态调整视频高光检测模型以适应每个测试视频的特性，提升了泛化能力和检测性能。

Details

Motivation: 现有视频高光检测方法无法适应测试视频的多样性和独特性，导致性能下降。

Result: 在三个基准数据集上验证，Highlight-TTA显著提升了现有模型的性能。

Insight: 测试时自适应结合多任务学习能有效提升视频高光检测的泛化能力和准确性。

Abstract: Existing video highlight detection methods, although advanced, struggle to generalize well to all test videos. These methods typically employ a generic highlight detection model for each test video, which is suboptimal as it fails to account for the unique characteristics and variations of individual test videos. Such fixed models do not adapt to the diverse content, styles, or audio and visual qualities present in new, unseen test videos, leading to reduced highlight detection performance. In this paper, we propose Highlight-TTA, a test-time adaptation framework for video highlight detection that addresses this limitation by dynamically adapting the model during testing to better align with the specific characteristics of each test video, thereby improving generalization and highlight detection performance. Highlight-TTA is jointly optimized with an auxiliary task, cross-modality hallucinations, alongside the primary highlight detection task. We utilize a meta-auxiliary training scheme to enable effective adaptation through the auxiliary task while enhancing the primary task. During testing, we adapt the trained model using the auxiliary task on the test video to further enhance its highlight detection performance. Extensive experiments with three state-of-the-art highlight detection models and three benchmark datasets show that the introduction of Highlight-TTA to these models improves their performance, yielding superior results.

[7] Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens cs.CV | cs.AI | cs.LGPDF

Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park

TL;DR: 该论文提出了一种方法，将基于透视图像训练的深度估计模型扩展到鱼眼相机，通过校准令牌调整潜在嵌入，避免了重新训练或微调的需求。

Details

Motivation: 基础单目深度估计模型（FMDEs）在大规模透视图像上训练，但无法直接适应鱼眼相机因相机参数变化导致的分布偏移，导致深度估计错误。

Result: 在室内外场景下，该方法显著优于现有技术，且仅需一组令牌即可适应不同场景。

Insight: 通过调制潜在嵌入而非图像空间的重新校准或投影，避免了传统方法的伪影和信息损失，充分利用了FMDEs的现有表达能力。

Abstract: We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.

[8] Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models cs.CVPDF

Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo

TL;DR: ProMIM通过结合掩码图像建模（MIM）和现有视觉语言模型（VLM），提出了一种增强条件提示学习的框架，提高了模型的泛化能力，同时计算成本几乎不变。

Details

Motivation: 解决现有提示学习方法（如CoOp和CoCoOp）在适应新任务时容易过拟合已知类别，泛化能力不足的问题。

Result: 实验表明，ProMIM在零样本和小样本分类任务中均能显著提升现有方法的泛化性能。

Insight: 掩码图像建模可以有效地结合到条件提示学习中，提供了一种轻量且高效的增强泛化能力的解决方案。

Abstract: Vision-language models (VLMs) like CLIP excel in zero-shot learning but often require resource-intensive training to adapt to new tasks. Prompt learning techniques, such as CoOp and CoCoOp, offer efficient adaptation but tend to overfit to known classes, limiting generalization to unseen categories. We introduce ProMIM, a plug-and-play framework that enhances conditional prompt learning by integrating masked image modeling (MIM) into existing VLM pipelines. ProMIM leverages a simple yet effective masking strategy to generate robust, instance-conditioned prompts, seamlessly augmenting methods like CoOp and CoCoOp without altering their core architectures. By masking only visible image patches and using these representations to guide prompt generation, ProMIM improves feature robustness and mitigates overfitting, all while introducing negligible additional computational cost. Extensive experiments across zero-shot and few-shot classification tasks demonstrate that ProMIM consistently boosts generalization performance when plugged into existing approaches, providing a practical, lightweight solution for real-world vision-language applications.

[9] TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring cs.CV | cs.AIPDF

Zhu Xu, Ting Lei, Zhimin Li, Guan Wang, Qingchao Chen

TL;DR: 论文提出TRKT方法，通过时间增强的关系感知知识迁移解决弱监督动态场景图生成（WS-DSGG）中外部目标检测器的问题，显著提升了性能。

Details

Motivation: 现有WS-DSGG方法依赖静态图像训练的检测器，但在动态、关系感知场景中表现不佳，导致定位不准和置信度低的问题。

Result: 在Action Genome数据集上达到SOTA性能。

Insight: TRKT通过关系感知和时间信息提升了弱监督动态场景图生成的性能，为类似任务提供了新思路。

Abstract: Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git.

[10] AdvDINO: Domain-Adversarial Self-Supervised Representation Learning for Spatial Proteomics cs.CV | cs.AIPDF

Stella Su, Marc Harary, Scott J. Rodig, William Lotter

TL;DR: AdvDINO 是一种域对抗自监督学习框架，通过将梯度反转层集成到 DINOv2 架构中，学习域不变特征，适用于存在域偏移的生物医学成像等领域。

Details

Motivation: 标准自监督学习方法对域偏移（数据源间的系统性差异）的鲁棒性不足，尤其在生物医学成像中，批次效应可能掩盖真实的生物信号。

Result: 在超过 546 万张 mIF 图像块上，AdvDINO 发现了具有不同蛋白质组学和预后意义的表型簇，并提高了基于注意力的多实例学习的生存预测能力。

Insight: AdvDINO 不仅适用于生物医学成像，还可推广到其他领域（如放射学、遥感和自动驾驶），解决域偏移和标注数据不足的问题。

Abstract: Self-supervised learning (SSL) has emerged as a powerful approach for learning visual representations without manual annotations. However, the robustness of standard SSL methods to domain shift – systematic differences across data sources – remains uncertain, posing an especially critical challenge in biomedical imaging where batch effects can obscure true biological signals. We present AdvDINO, a domain-adversarial self-supervised learning framework that integrates a gradient reversal layer into the DINOv2 architecture to promote domain-invariant feature learning. Applied to a real-world cohort of six-channel multiplex immunofluorescence (mIF) whole slide images from non-small cell lung cancer patients, AdvDINO mitigates slide-specific biases to learn more robust and biologically meaningful representations than non-adversarial baselines. Across $>5.46$ million mIF image tiles, the model uncovers phenotype clusters with distinct proteomic profiles and prognostic significance, and improves survival prediction in attention-based multiple instance learning. While demonstrated on mIF data, AdvDINO is broadly applicable to other imaging domains – including radiology, remote sensing, and autonomous driving – where domain shift and limited annotated data hinder model generalization and interpretability.

[11] CSRAP: Enhanced Canvas Attention Scheduling for Real-Time Mission Critical Perception cs.CVPDF

Md Iftekharul Islam Sakib, Yigong Hu, Tarek Abdelzaher

TL;DR: 本文提出了一种增强的画布注意力调度方法（CSRAP），通过可变大小的画布帧和可选择的帧率，优化实时关键任务感知的性能与资源权衡。在NVIDIA Jetson Orin Nano上实验显示，其mAP和召回率优于现有方法。

Details

Motivation: 边缘平台上的实时感知需在高分辨率目标检测与严格延迟约束之间取得平衡，现有画布注意力调度方法需要进一步优化以提升性能与资源利用效率。

Result: 在NVIDIA Jetson Orin Nano上的实验表明，CSRAP在mAP和召回率上优于现有方法，实现了更好的质量与成本权衡。

Insight: 可变画布帧和灵活帧率设计为实时感知系统提供了更多的优化维度，显著提升了资源受限环境下的性能表现。

Abstract: Real-time perception on edge platforms faces a core challenge: executing high-resolution object detection under stringent latency constraints on limited computing resources. Canvas-based attention scheduling was proposed in earlier work as a mechanism to reduce the resource demands of perception subsystems. It consolidates areas of interest in an input data frame onto a smaller area, called a canvas frame, that can be processed at the requisite frame rate. This paper extends prior canvas-based attention scheduling literature by (i) allowing for variable-size canvas frames and (ii) employing selectable canvas frame rates that may depart from the original data frame rate. We evaluate our solution by running YOLOv11, as the perception module, on an NVIDIA Jetson Orin Nano to inspect video frames from the Waymo Open Dataset. Our results show that the additional degrees of freedom improve the attainable quality/cost trade-offs, thereby allowing for a consistently higher mean average precision (mAP) and recall with respect to the state of the art.

[12] Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression cs.CVPDF

Zheng Chen, Mingde Zhou, Jinpei Guo, Jiale Yuan, Yifei Ji

TL;DR: SODEC提出了一种单步扩散图像压缩模型，通过预训练的VAE生成信息丰富的隐变量，并替换迭代去噪过程为单步解码，显著提升解码速度，同时引入保真度引导模块和速率退火训练策略，优化性能。

Details

Motivation: 现有的扩散图像压缩方法存在解码延迟高和保真度差的问题，影响了实际应用。

Result: SODEC在速率-失真-感知性能上优于现有方法，解码速度提升20倍以上。

Insight: 在图像压缩中，隐变量的信息丰富性可以消除多步精炼的需要，单步解码是可行的。

Abstract: Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code is released at: https://github.com/zhengchen1999/SODEC.

[13] Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion cs.CVPDF

Shenglun Chen, Xinzhu Ma, Hong Zhang, Haojie Li, Zhihui Wang

TL;DR: 该论文提出了一种利用深度基础模型的深度补全框架，通过在3D和2D空间中传播稀疏深度信息，并结合可学习校正模块，显著提升了模型在OOD场景下的鲁棒性。

Details

Motivation: 现有的深度补全模型依赖于有限的数据，导致在OOD场景中表现不佳。深度基础模型在大规模训练下表现出优异的鲁棒性，因此利用此类模型提升深度补全的鲁棒性是一个有前景的方向。

Result: 在NYUv2和KITTI数据集上训练，并在16个其他数据集上评估，模型在OOD场景中表现优异，优于现有SOTA方法。

Insight: 深度基础模型的环境线索提取与双空间传播方法的结合，可以有效提升深度补全模型在OOD场景中的鲁棒性，同时避免大规模训练的需求。

Abstract: Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagates sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in https://github.com/shenglunch/PSD.

[14] Unified modality separation: A vision-language framework for unsupervised domain adaptation cs.CVPDF

Xinyao Li, Jingjing Li, Zhekai Du, Lei Zhu, Heng Tao Shen

TL;DR: 该论文提出了一种统一模态分离框架，用于无监督域适应（UDA），通过分离模态特异性和模态不变性组件，并设计模态差异度量来优化目标性能。

Details

Motivation: 现有基于视觉语言模型（VLM）的无监督域适应方法因模态差异（modality gap）而仅能传递模态不变性知识，导致目标域性能不理想。

Result: 在多种骨干网络、基线、数据集和适应设置下，性能提升达9%，计算效率提升9倍。

Insight: 模态分离和差异度量的引入可有效利用模态不变性和特异性知识，提升无监督域适应性能。

Abstract: Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.

[15] Modeling Rapid Contextual Learning in the Visual Cortex with Fast-Weight Deep Autoencoder Networks cs.CVPDF

Yue Li, Weifan Wang, Tai Sing Lee

TL;DR: 本文探讨了如何通过快速权重（fast weights）在视觉Transformer（ViT）自编码器中模拟早期视觉皮层对全局图像上下文的快速学习，并使用低秩适应（LoRA）实现快速权重，发现其能增强网络对全局上下文的敏感性。

Details

Motivation: 最近神经生理研究表明，早期视觉皮层能快速学习全局图像上下文，但这一现象的机制仍需深入理解。本文旨在通过计算模型验证快速权重在这一过程中的作用。

Result: 1. ViT自编码器的自注意力机制模拟了神经回路的形变。2. 熟悉性训练使早期层与顶层（含全局信息）的表示对齐。3. LoRA显著放大了这些效应。

Insight: 混合快速-慢速权重架构可能是研究大脑快速全局上下文学习的有力计算模型，为神经网络的可解释性提供了新方向。

Abstract: Recent neurophysiological studies have revealed that the early visual cortex can rapidly learn global image context, as evidenced by a sparsification of population responses and a reduction in mean activity when exposed to familiar versus novel image contexts. This phenomenon has been attributed primarily to local recurrent interactions, rather than changes in feedforward or feedback pathways, supported by both empirical findings and circuit-level modeling. Recurrent neural circuits capable of simulating these effects have been shown to reshape the geometry of neural manifolds, enhancing robustness and invariance to irrelevant variations. In this study, we employ a Vision Transformer (ViT)-based autoencoder to investigate, from a functional perspective, how familiarity training can induce sensitivity to global context in the early layers of a deep neural network. We hypothesize that rapid learning operates via fast weights, which encode transient or short-term memory traces, and we explore the use of Low-Rank Adaptation (LoRA) to implement such fast weights within each Transformer layer. Our results show that (1) The proposed ViT-based autoencoder’s self-attention circuit performs a manifold transform similar to a neural circuit model of the familiarity effect. (2) Familiarity training aligns latent representations in early layers with those in the top layer that contains global context information. (3) Familiarity training broadens the self-attention scope within the remembered image context. (4) These effects are significantly amplified by LoRA-based fast weights. Together, these findings suggest that familiarity training introduces global sensitivity to earlier layers in a hierarchical network, and that a hybrid fast-and-slow weight architecture may provide a viable computational model for studying rapid global context learning in the brain.

[16] Attribute Guidance With Inherent Pseudo-label For Occluded Person Re-identification cs.CVPDF

Rui Zhi, Zhen Yang, Haiyang Zhang

TL;DR: 本文提出了Attribute-Guide ReID (AG-ReID)框架，通过利用预训练模型的固有能力提取细粒度语义属性，解决遮挡行人再识别中忽视属性信息的挑战，实现了在不增加数据或标注的情况下显著提升性能。

Details

Motivation: 遮挡场景下的行人再识别任务中，预训练的视觉语言模型往往关注整体图像语义而忽视细粒度属性信息，导致对遮挡部分或外观细微差异的行人识别效果不佳。本文旨在解决这一问题。

Result: 在多个广泛使用的Re-ID数据集上实现了最先进的结果，显著提升了对遮挡和细微属性差异的处理能力。

Insight: 预训练模型的固有能力可直接用于提取细粒度语义属性，结合双引导机制能高效提升遮挡场景下的Re-ID性能。

Abstract: Person re-identification (Re-ID) aims to match person images across different camera views, with occluded Re-ID addressing scenarios where pedestrians are partially visible. While pre-trained vision-language models have shown effectiveness in Re-ID tasks, they face significant challenges in occluded scenarios by focusing on holistic image semantics while neglecting fine-grained attribute information. This limitation becomes particularly evident when dealing with partially occluded pedestrians or when distinguishing between individuals with subtle appearance differences. To address this limitation, we propose Attribute-Guide ReID (AG-ReID), a novel framework that leverages pre-trained models’ inherent capabilities to extract fine-grained semantic attributes without additional data or annotations. Our framework operates through a two-stage process: first generating attribute pseudo-labels that capture subtle visual characteristics, then introducing a dual-guidance mechanism that combines holistic and fine-grained attribute information to enhance image feature extraction. Extensive experiments demonstrate that AG-ReID achieves state-of-the-art results on multiple widely-used Re-ID datasets, showing significant improvements in handling occlusions and subtle attribute differences while maintaining competitive performance on standard Re-ID scenarios.

[17] CRAM: Large-scale Video Continual Learning with Bootstrapped Compression cs.CV | cs.LG | cs.PFPDF

Shivani Mall, Joao F. Henriques

TL;DR: 论文提出了CRAM方法，通过压缩视频数据（存储视频编码而非原始输入）解决视频持续学习中的高内存需求问题，并通过刷新视频编码缓解灾难性遗忘，在EpicKitchens-100和Kinetics-700数据集上验证了其高效性。

Details

Motivation: 视频持续学习（CL）面临高内存需求挑战，尤其是长视频和连续数据流加剧了这一问题。论文旨在通过压缩存储视频编码，减少内存占用，同时解决灾难性遗忘问题。

Result: CRAM在EpicKitchens-100和Kinetics-700上表现优于现有方法，内存占用显著降低（数千长视频仅需2GB以下）。

Insight: 压缩视频编码是解决视频CL内存问题的有效途径，但需解决压缩网络自身的灾难性遗忘问题；刷新机制为此提供了可行方案。

Abstract: Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling, which requires random access to a full dataset. This would allow for much smaller storage requirements and self-sufficiency of deployed systems that cope with natural distribution shifts, similarly to biological learning. We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer. We posit that part of the reason why practical video CL is challenging is the high memory requirements of video, further exacerbated by long-videos and continual streams, which are at odds with the common rehearsal-buffer size constraints. To address this, we propose to use compressed vision, i.e. store video codes (embeddings) instead of raw inputs, and train a video classifier by IID sampling from this rolling buffer. Training a video compressor online (so not depending on any pre-trained networks) means that it is also subject to catastrophic forgetting. We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one. We name our method Continually Refreshed Amodal Memory (CRAM). We expand current video CL benchmarks to large-scale settings, namely EpicKitchens-100 and Kinetics-700, storing thousands of relatively long videos in under 2 GB, and demonstrate empirically that our video CL method outperforms prior art with a significantly reduced memory footprint.

[18] Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation cs.CVPDF

Xusheng Liang, Lihua Zhou, Nianxin Li, Miao Xu, Ziyang Song

TL;DR: 该论文提出了MCDRL框架，结合因果推理与视觉语言模型，以解决医学图像分割中的领域泛化问题，显著提升了模型在未见领域的表现。

Details

Motivation: 医学图像存在较大领域偏移（如设备差异、成像模式等），传统VLMs在医学领域的零样本能力受限，亟需新方法提升泛化性。

Result: 实验表明MCDRL在分割任务中优于现有方法，具有更高的准确性和鲁棒泛化能力。

Insight: 结合因果推理与VLMs可有效解决医学图像领域偏移问题，为跨模态医学分析提供了新思路。

Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP’s cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.

[19] AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content cs.CV | eess.IVPDF

Shushi Wang, Chunyi Li, Zicheng Zhang, Han Zhou, Wei Dong

TL;DR: 该论文提出了AU-IQA数据集，用于评估AI增强用户生成内容（AI-UGC）的感知质量，填补了该领域缺乏专用评估模型的空白。

Details

Motivation: AI增强技术在视觉应用中广泛应用，但缺乏针对AI-UGC的专用质量评估模型，限制了用户体验和方法进步。

Result: 提供了对当前评估方法在AI-UGC上性能的全面分析，为未来研究提供了基准。

Insight: AI-UGC的质量评估需要结合传统UGC和AIGC的特征，现有方法在这一任务上仍有改进空间。

Abstract: AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.

[20] Skin-SOAP: A Weakly Supervised Framework for Generating Structured SOAP Notes cs.CV | cs.AI | cs.LGPDF

Sadia Kamal, Tim Oates, Joy Wan

TL;DR: 本文提出了Skin-SOAP，一个弱监督的多模态框架，用于从有限的输入（如病变图像和稀疏的临床文本）生成结构化的SOAP临床笔记，以减少对人工标注的依赖并缓解临床医生的负担。

Details

Motivation: 皮肤癌是全球最常见的癌症，年医疗支出超过80亿美元。早期诊断和及时治疗对患者生存率至关重要。手工生成SOAP（主观、客观、评估和计划）笔记耗时费力，加剧了临床医生的负担。

Result: 在临床相关性指标上与GPT-4o、Claude等先进模型表现相当。

Insight: 1. 弱监督方法可减少对大量标注数据的依赖；2. 多模态输入提高了结构化临床笔记的生成质量；3. 新指标能更准确地评估临床相关性。

Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. Early diagnosis, accurate and timely treatment are critical to improving patient survival rates. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose skin-SOAP, a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate this clinical relevance, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.

[21] HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID cs.CVPDF

Yiyang Su, Yunping Shi, Feng Liu, Xiaoming Liu

TL;DR: HAMoBE提出了一种分层自适应的生物特征专家混合框架，通过多层级特征和动态决策门控网络提升视频ReID性能，表现卓越。

Details

Motivation: 现有视频ReID方法未能有效提取和整合查询-图库对中最具判别力的特征，HAMoBE旨在模拟人类感知机制，独立建模并动态整合外观、静态体型和动态步态等关键生物特征。

Result: 在MEVID等基准测试中，HAMoBE显著优于现有方法，例如Rank-1准确率提升13.0%。

Insight: 通过分层建模和动态整合多模态生物特征，HAMoBE有效解决了视频ReID中的特征选择与融合问题。

Abstract: Recently, research interest in person re-identification (ReID) has increasingly focused on video-based scenarios, which are essential for robust surveillance and security in varied and dynamic environments. However, existing video-based ReID methods often overlook the necessity of identifying and selecting the most discriminative features from both videos in a query-gallery pair for effective matching. To address this issue, we propose a novel Hierarchical and Adaptive Mixture of Biometric Experts (HAMoBE) framework, which leverages multi-layer features from a pre-trained large model (e.g., CLIP) and is designed to mimic human perceptual mechanisms by independently modeling key biometric features–appearance, static body shape, and dynamic gait–and adaptively integrating them. Specifically, HAMoBE includes two levels: the first level extracts low-level features from multi-layer representations provided by the frozen large model, while the second level consists of specialized experts focusing on long-term, short-term, and temporal features. To ensure robust matching, we introduce a new dual-input decision gating network that dynamically adjusts the contributions of each expert based on their relevance to the input scenarios. Extensive evaluations on benchmarks like MEVID demonstrate that our approach yields significant performance improvements (e.g., +13.0% Rank-1 accuracy).

[22] Finding Needles in Images: Can Multimodal LLMs Locate Fine Details? cs.CVPDF

Parth Thakkar, Ankush Agarwal, Prasad Kasu, Pulkit Bansal, Chaitanya Devaguptapu

TL;DR: 该论文提出了一个名为NiM的基准测试，用于评估多模态大语言模型（MLLMs）在复杂文档中定位和理解细粒度细节的能力，并提出了一种名为Spot-IT的简单而有效的方法，通过智能补丁选择和高斯注意力来增强模型性能。

Details

Motivation: 虽然多模态大语言模型在文档理解任务中表现优异，但其定位和理解复杂文档中细微细节的能力尚未得到充分研究。论文旨在填补这一空白。

Result: 实验表明，Spot-IT在复杂布局文档的细粒度细节提取任务中显著优于基线方法。

Insight: 当前MLLMs在处理细粒度任务时仍有局限性，但通过智能聚焦和注意力机制可以有效提升性能。

Abstract: While Multi-modal Large Language Models (MLLMs) have shown impressive capabilities in document understanding tasks, their ability to locate and reason about fine-grained details within complex documents remains understudied. Consider searching a restaurant menu for a specific nutritional detail or identifying a disclaimer in a lengthy newspaper article tasks that demand careful attention to small but significant details within a broader narrative, akin to Finding Needles in Images (NiM). To address this gap, we introduce NiM, a carefully curated benchmark spanning diverse real-world documents including newspapers, menus, and lecture images, specifically designed to evaluate MLLMs’ capability in these intricate tasks. Building on this, we further propose Spot-IT, a simple yet effective approach that enhances MLLMs capability through intelligent patch selection and Gaussian attention, motivated from how humans zoom and focus when searching documents. Our extensive experiments reveal both the capabilities and limitations of current MLLMs in handling fine-grained document understanding tasks, while demonstrating the effectiveness of our approach. Spot-IT achieves significant improvements over baseline methods, particularly in scenarios requiring precise detail extraction from complex layouts.

[23] DualMat: PBR Material Estimation via Coherent Dual-Path Diffusion cs.CVPDF

Yifeng Huang, Zhang Chen, Yi Xu, Minh Hoai, Zhong Li

TL;DR: DualMat提出了一种双路径扩散框架，通过两个不同的潜在空间（RGB和材料专用）从单张图像中估计PBR材料，并结合特征蒸馏和整流流技术提升性能，支持高分辨率与多视图输入。

Details

Motivation: 复杂光照条件下从单张图像准确估计PBR材料是计算机视觉和图形学的挑战性任务。现有方法在材料属性（如金属性和粗糙度）和光照分离上表现不足。

Result: 在Objaverse和真实数据上达到SOTA性能，反照率估计提升28%，金属性-粗糙度预测误差减少39%。

Insight: 双路径设计与特征蒸馏可有效分离材料和光照信息；整流流技术显著提升扩散模型推理效率。

Abstract: We present DualMat, a novel dual-path diffusion framework for estimating Physically Based Rendering (PBR) materials from single images under complex lighting conditions. Our approach operates in two distinct latent spaces: an albedo-optimized path leveraging pretrained visual knowledge through RGB latent space, and a material-specialized path operating in a compact latent space designed for precise metallic and roughness estimation. To ensure coherent predictions between the albedo-optimized and material-specialized paths, we introduce feature distillation during training. We employ rectified flow to enhance efficiency by reducing inference steps while maintaining quality. Our framework extends to high-resolution and multi-view inputs through patch-based estimation and cross-view attention, enabling seamless integration into image-to-3D pipelines. DualMat achieves state-of-the-art performance on both Objaverse and real-world data, significantly outperforming existing methods with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors.

[24] Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks cs.CV | cs.AI | cs.LG | eess.IVPDF

Ruiyu Li, Changyuan Qiu, Hangrui Cao, Qihan Ren, Yuqing Qiu

TL;DR: 论文探讨了基于卷积神经网络（CNN）和生成对抗网络（GAN）的自动图像着色方法，强调了语义和纹理对颜色的重要性。

Details

Motivation: 图像着色是一个高度不适定问题，但语义和纹理提供了重要线索。传统回归方法忽略了颜色预测的多模态性。

Result: 提出的方法在颜色预测的多模态性上表现更好，提升了着色质量。

Insight: 语义和纹理信息对图像着色至关重要，而对抗学习能够有效捕捉颜色分布的多模态性。

Abstract: Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons.

[25] FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer cs.CVPDF

Jian Zhu, Shanyuan Liu, Liuzhuozheng Li, Yue Gong, He Wang

TL;DR: FLUX-Makeup是一种基于扩散变换器的高保真、身份一致且鲁棒的化妆迁移框架，通过直接利用源-参考图像对和轻量级化妆特征注入器，避免了辅助模块的依赖，实现了卓越的迁移效果。

Details

Motivation: 现有GAN和扩散模型在化妆迁移中需要复杂的损失函数或辅助模块，容易引入额外误差。FLUX-Makeup旨在直接利用图像对实现高质量迁移，避免辅助模块的缺陷。

Result: 在多样场景下实现SOTA性能，迁移效果鲁棒且身份一致。

Insight: 直接利用图像对和轻量级特征注入优于依赖辅助模块的方法，高质量数据对迁移任务至关重要。

Abstract: Makeup transfer aims to apply the makeup style from a reference face to a target face and has been increasingly adopted in practical applications. Existing GAN-based approaches typically rely on carefully designed loss functions to balance transfer quality and facial identity consistency, while diffusion-based methods often depend on additional face-control modules or algorithms to preserve identity. However, these auxiliary components tend to introduce extra errors, leading to suboptimal transfer results. To overcome these limitations, we propose FLUX-Makeup, a high-fidelity, identity-consistent, and robust makeup transfer framework that eliminates the need for any auxiliary face-control components. Instead, our method directly leverages source-reference image pairs to achieve superior transfer performance. Specifically, we build our framework upon FLUX-Kontext, using the source image as its native conditional input. Furthermore, we introduce RefLoRAInjector, a lightweight makeup feature injector that decouples the reference pathway from the backbone, enabling efficient and comprehensive extraction of makeup-related information. In parallel, we design a robust and scalable data generation pipeline to provide more accurate supervision during training. The paired makeup datasets produced by this pipeline significantly surpass the quality of all existing datasets. Extensive experiments demonstrate that FLUX-Makeup achieves state-of-the-art performance, exhibiting strong robustness across diverse scenarios.

[26] PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation cs.CVPDF

Jingxuan He, Busheng Su, Finn Wong

TL;DR: PoseGen提出了一种创新的框架，通过单张参考图像和驱动姿态序列生成任意长度的视频，解决了身份漂移和时长限制问题。

Details

Motivation: 当前扩散模型生成长视频时面临身份漂移和短时长限制的挑战，PoseGen旨在通过结合身份保持与精细姿态控制的方法解决这些问题。

Result: 在仅33小时视频数据上训练，PoseGen在身份保真度、姿态准确性和长视频生成能力上显著优于现有方法。

Insight: 通过分层的身份与姿态解耦控制，PoseGen展示了小规模数据上实现高效长视频生成的潜力。

Abstract: Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.

[27] Sculpting Margin Penalty: Intra-Task Adapter Merging and Classifier Calibration for Few-Shot Class-Incremental Learning cs.CVPDF

Liang Bai, Hong Song, Jinfu Li, Yucong Lin, Jingfan Fan

TL;DR: 该论文提出了一种名为Sculpting Margin Penalty (SMP)的新方法，用于解决Few-Shot Class-Incremental Learning (FSCIL)中的基础类判别性与新类泛化性之间的平衡问题。通过Margin-aware Intra-task Adapter Merging (MIAM)机制和Margin Penalty-based Classifier Calibration (MPCC)策略，SMP在多个数据集上达到了SOTA性能。

Details

Motivation: 现实应用中，数据隐私和获取成本限制了增量学习任务的训练数据量，导致性能下降。现有方法难以平衡基础类判别性和新类泛化性，且增量任务中的决策边界模糊。

Result: 在CIFAR100、ImageNet-R和CUB200等数据集上，SMP实现了SOTA性能，同时在基础类和新类之间保持更好的平衡。

Insight: 1. 边界惩罚在不同阶段的有效整合能显著提升FSCIL性能；2. 自适应合并适配器是提升前向兼容性的有效手段；3. 决策边界的优化对增量学习任务至关重要。

Abstract: Real-world applications often face data privacy constraints and high acquisition costs, making the assumption of sufficient training data in incremental tasks unrealistic and leading to significant performance degradation in class-incremental learning. Forward-compatible learning, which prospectively prepares for future tasks during base task training, has emerged as a promising solution for Few-Shot Class-Incremental Learning (FSCIL). However, existing methods still struggle to balance base-class discriminability and new-class generalization. Moreover, limited access to original data during incremental tasks often results in ambiguous inter-class decision boundaries. To address these challenges, we propose SMP (Sculpting Margin Penalty), a novel FSCIL method that strategically integrates margin penalties at different stages within the parameter-efficient fine-tuning paradigm. Specifically, we introduce the Margin-aware Intra-task Adapter Merging (MIAM) mechanism for base task learning. MIAM trains two sets of low-rank adapters with distinct classification losses: one with a margin penalty to enhance base-class discriminability, and the other without margin constraints to promote generalization to future new classes. These adapters are then adaptively merged to improve forward compatibility. For incremental tasks, we propose a Margin Penalty-based Classifier Calibration (MPCC) strategy to refine decision boundaries by fine-tuning classifiers on all seen classes’ embeddings with a margin penalty. Extensive experiments on CIFAR100, ImageNet-R, and CUB200 demonstrate that SMP achieves state-of-the-art performance in FSCIL while maintaining a better balance between base and new classes.

Sachin Dudda Nagaraju, Ashkan Moradi, Bendik Skarre Abrahamsen, Mattijs Elschot

TL;DR: 论文提出了FedGIN，一种联邦学习框架，通过动态全局强度非线性增强模块（GIN）实现多模态器官分割，无需共享患者原始数据，显著提升了跨模态泛化能力。

Details

Motivation: 医疗图像分割在AI辅助诊断中至关重要，但多模态数据间的域偏移、数据稀缺和隐私限制阻碍了统一模型的开发。

Result: 在有限数据场景下，Dice分数提升12%-18%；在完整数据场景下，性能接近集中式训练，Dice分数提升30%（MRI）和10%（CT）。

Insight: 通过动态强度对齐和联邦学习的结合，FedGIN在保护隐私的同时显著提升了多模态分割的性能，为临床AI应用提供了新思路。

Abstract: Medical image segmentation plays a crucial role in AI-assisted diagnostics, surgical planning, and treatment monitoring. Accurate and robust segmentation models are essential for enabling reliable, data-driven clinical decision making across diverse imaging modalities. Given the inherent variability in image characteristics across modalities, developing a unified model capable of generalizing effectively to multiple modalities would be highly beneficial. This model could streamline clinical workflows and reduce the need for modality-specific training. However, real-world deployment faces major challenges, including data scarcity, domain shift between modalities (e.g., CT vs. MRI), and privacy restrictions that prevent data sharing. To address these issues, we propose FedGIN, a Federated Learning (FL) framework that enables multimodal organ segmentation without sharing raw patient data. Our method integrates a lightweight Global Intensity Non-linear (GIN) augmentation module that harmonizes modality-specific intensity distributions during local training. We evaluated FedGIN using two types of datasets: an imputed dataset and a complete dataset. In the limited dataset scenario, the model was initially trained using only MRI data, and CT data was added to assess its performance improvements. In the complete dataset scenario, both MRI and CT data were fully utilized for training on all clients. In the limited-data scenario, FedGIN achieved a 12 to 18% improvement in 3D Dice scores on MRI test cases compared to FL without GIN and consistently outperformed local baselines. In the complete dataset scenario, FedGIN demonstrated near-centralized performance, with a 30% Dice score improvement over the MRI-only baseline and a 10% improvement over the CT-only baseline, highlighting its strong cross-modality generalization under privacy constraints.

[29] Deep Learning-based Animal Behavior Analysis: Insights from Mouse Chronic Pain Models cs.CVPDF

Yu-Hsi Chen, Wei-Hsin Chen, Chien-Yao Wang, Hong-Yuan Mark Liao, James C. Liao

TL;DR: 这篇论文提出了一种基于深度学习的框架，用于自动分析小鼠慢性疼痛行为，无需依赖人工标注的行为特征，显著提升了分类准确性，并在药物测试中验证了其临床潜力。

Details

Motivation: 现有的慢性疼痛行为分析方法依赖人工标注，但人类难以明确哪些行为最能代表慢性疼痛，导致准确性受限。本研究旨在通过自动提取行为特征，克服人工标注的偏见。

Result: 在15类疼痛分类任务中准确率达到48.41%，显著优于人类专家（21.33%）和B-SOiD（30.52%）。在3类简化分类中，准确率进一步提升至73.1%。

Insight: 该方法不仅提高了行为分析的准确性，还能为疼痛研究和药物开发提供新视角，展示了深度学习方法在临床前研究中的潜力。

Abstract: Assessing chronic pain behavior in mice is critical for preclinical studies. However, existing methods mostly rely on manual labeling of behavioral features, and humans lack a clear understanding of which behaviors best represent chronic pain. For this reason, existing methods struggle to accurately capture the insidious and persistent behavioral changes in chronic pain. This study proposes a framework to automatically discover features related to chronic pain without relying on human-defined action labels. Our method uses universal action space projector to automatically extract mouse action features, and avoids the potential bias of human labeling by retaining the rich behavioral information in the original video. In this paper, we also collected a mouse pain behavior dataset that captures the disease progression of both neuropathic and inflammatory pain across multiple time points. Our method achieves 48.41% accuracy in a 15-class pain classification task, significantly outperforming human experts (21.33%) and the widely used method B-SOiD (30.52%). Furthermore, when the classification is simplified to only three categories, i.e., neuropathic pain, inflammatory pain, and no pain, then our method achieves an accuracy of 73.1%, which is notably higher than that of human experts (48%) and B-SOiD (58.43%). Finally, our method revealed differences in drug efficacy for different types of pain on zero-shot Gabapentin drug testing, and the results were consistent with past drug efficacy literature. This study demonstrates the potential clinical application of our method, which can provide new insights into pain research and related drug development.

[30] Rotation Equivariant Arbitrary-scale Image Super-Resolution cs.CVPDF

Qi Xie, Jiahong Fu, Zongben Xu, Deyu Meng

TL;DR: 本文提出了一种旋转等变任意尺度图像超分辨率（ASISR）方法，通过重新设计INR和编码器模块，实现了从输入到输出的端到端旋转等变，从而更好地恢复几何模式的原始方向和结构完整性。

Details

Motivation: 现有ASISR方法在处理低分辨率图像时，几何模式（如纹理、边缘或形状）容易出现变形和伪影。旋转等变性已被证明能够保持这些模式的方向和结构完整性，因此作者希望将其引入ASISR网络。

Result: 实验表明，该方法在模拟和真实数据集上均表现出色，且能通过“即插即用”方式进一步提升现有ASISR方法的性能。

Insight: 旋转等变性能够显著提升ASISR任务中几何模式恢复的保真度，这一特性可以广泛应用于其他图像恢复任务。

Abstract: The arbitrary-scale image super-resolution (ASISR), a recent popular topic in computer vision, aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image. This task is realized by representing the image as a continuous implicit function through two fundamental modules, a deep-network-based encoder and an implicit neural representation (INR) module. Despite achieving notable progress, a crucial challenge of such a highly ill-posed setting is that many common geometric patterns, such as repetitive textures, edges, or shapes, are seriously warped and deformed in the low-resolution images, naturally leading to unexpected artifacts appearing in their high-resolution recoveries. Embedding rotation equivariance into the ASISR network is thus necessary, as it has been widely demonstrated that this enhancement enables the recovery to faithfully maintain the original orientations and structural integrity of geometric patterns underlying the input image. Motivated by this, we make efforts to construct a rotation equivariant ASISR method in this study. Specifically, we elaborately redesign the basic architectures of INR and encoder modules, incorporating intrinsic rotation equivariance capabilities beyond those of conventional ASISR networks. Through such amelioration, the ASISR network can, for the first time, be implemented with end-to-end rotational equivariance maintained from input to output. We also provide a solid theoretical analysis to evaluate its intrinsic equivariance error, demonstrating its inherent nature of embedding such an equivariance structure. The superiority of the proposed method is substantiated by experiments conducted on both simulated and real datasets. We also validate that the proposed framework can be readily integrated into current ASISR methods in a plug & play manner to further enhance their performance.

[31] X-MoGen: Unified Motion Generation across Humans and Animals cs.CVPDF

Xuan Wang, Kai Ruan, Liyang Qian, Zhizhi Guo, Chang Su

TL;DR: X-MoGen提出首个统一的人类与动物文本驱动运动生成框架，通过两阶段架构和跨形态一致性损失实现跨物种运动生成，并构建了UniMo4D数据集支持训练。

Details

Motivation: 现有方法通常单独建模人类和动物运动，而跨物种联合建模能提供统一表示和更好泛化性。X-MoGen旨在解决形态差异带来的运动合理性挑战。

Result: 在UniMo4D数据集上，X-MoGen在已见和未见物种上均优于现有方法。

Insight: 统一跨物种运动生成框架可提升多样性和泛化性，形态一致性损失是解决形态差异的关键。

Abstract: Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose \textbf{X-MoGen}, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct \textbf{UniMo4D}, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.

[32] PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems cs.CVPDF

Qi Guo, Xiaojun Jia, Shanmin Pang, Simeng Qin, Lin Wang

TL;DR: 论文提出PhysPatch，一种针对多模态大语言模型（MLLM）自动驾驶系统的物理可实现且可迁移的对抗补丁攻击方法，通过联合优化补丁位置、形状和内容，显著提升攻击效果和实际应用性。

Details

Motivation: 多模态大语言模型（MLLMs）在自动驾驶系统中广泛应用，但其对对抗攻击的脆弱性可能带来安全隐患。现有补丁攻击方法主要针对目标检测模型，难以有效迁移到MLLM系统中。

Result: 实验表明，PhysPatch在多种MLLM系统上显著优于现有方法，能够有效引导系统产生目标对齐的感知和规划输出，同时确保补丁在现实场景中的物理可行性。

Insight: 针对复杂模型（如MLLMs）的对抗攻击需要从位置、形状和内容三方面联合优化，并结合物理可实现性约束，才能在实际场景中有效部署。

Abstract: Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. However, MLLMs are vulnerable to adversarial attacks, particularly adversarial patch attacks, which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models and perform poorly when transferred to MLLM-based systems due to the latter’s complex architectures and reasoning abilities. To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms prior methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.

[33] Multi-tracklet Tracking for Generic Targets with Adaptive Detection Clustering cs.CVPDF

Zewei Wu, Longhao Wang, Cui Wang, César Teixeira, Wei Ke

TL;DR: 提出了一个多轨迹跟踪框架（MTT），通过自适应聚类检测结果生成鲁棒的轨迹片段，并整合多线索关联以适应复杂场景中的未见过类别目标。

Details

Motivation: 在真实场景中，未见过类别的目标（如低置信度检测、弱运动或外观约束）对现有跟踪方法提出挑战。MTT旨在解决这些问题。

Result: 在通用多目标跟踪基准测试中表现出竞争力。

Insight: 通过生成和关联轨迹片段，MTT能够适应未见过类别目标的复杂场景，且在长期遮挡等挑战下表现鲁棒。

Abstract: Tracking specific targets, such as pedestrians and vehicles, has been the focus of recent vision-based multitarget tracking studies. However, in some real-world scenarios, unseen categories often challenge existing methods due to low-confidence detections, weak motion and appearance constraints, and long-term occlusions. To address these issues, this article proposes a tracklet-enhanced tracker called Multi-Tracklet Tracking (MTT) that integrates flexible tracklet generation into a multi-tracklet association framework. This framework first adaptively clusters the detection results according to their short-term spatio-temporal correlation into robust tracklets and then estimates the best tracklet partitions using multiple clues, such as location and appearance over time to mitigate error propagation in long-term association. Finally, extensive experiments on the benchmark for generic multiple object tracking demonstrate the competitiveness of the proposed framework.

[34] SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images cs.CVPDF

Dongchen Si, Di Wang, Erzhong Gao, Xiaolei Qin, Liu Zhao

TL;DR: SPEX 是一种多模态视觉-语言模型，用于从光谱遥感图像中提取土地覆盖信息，通过结合光谱先验和语言模型，提升了多光谱场景下的性能表现。

Details

Motivation: 现有视觉-语言模型在多光谱遥感图像的土地覆盖提取任务中表现不佳，主要是因为未能充分利用光谱信息。为了解决这一问题，作者提出了 SPEX。

Result: 在五个公开的多光谱数据集上，SPEX 在植被、建筑物和水体等土地覆盖类别提取任务中均优于现有方法。

Insight: 通过将光谱信息与语言模型结合，SPEX 不仅提升了性能，还增强了模型的解释能力和用户友好性，为遥感图像分析开辟了新方向。

Abstract: Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to achieve precise and flexible pixel-level interpretation. To the best of our knowledge, SPEX is the first multimodal vision-language model dedicated to land cover extraction in spectral remote sensing imagery. Extensive experiments on five public multispectral datasets demonstrate that SPEX consistently outperforms existing state-of-the-art methods in extracting typical land cover categories such as vegetation, buildings, and water bodies. Moreover, SPEX is capable of generating textual explanations for its predictions, thereby enhancing interpretability and user-friendliness. Code will be released at: https://github.com/MiliLab/SPEX.

[35] EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery cs.CVPDF

Bingyu Yang, Qingyao Tian, Yimeng Geng, Huai Liao, Xinyan Huang

TL;DR: EndoMatcher是一种通用的内窥镜图像匹配器，通过多领域预训练解决机器人辅助手术中的图像匹配问题，提出了一种双分支Vision Transformer和渐进式多目标训练策略，并在新数据集Endo-Mix6上验证了其性能。

Details

Motivation: 内窥镜图像匹配在机器人辅助手术中至关重要，但由于视觉条件复杂（如弱纹理、大视角变化）和标注数据稀缺，现有方法难以通用化。

Result: 在Hamlyn和Bladder数据集上，相比现有技术，EndoMatcher将内点匹配数量分别提高了140.69%和201.43%；在Gastro-Matching数据集上，匹配方向预测准确率提高了9.40%。

Insight: 通过大规模多领域数据和渐进式训练策略，可以有效解决内窥镜图像匹配中的域偏移和训练不平衡问题，显著提升通用化能力。

Abstract: Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with correspondence labels generated using Structure-from-Motion and simulated transformations. The diversity and scale of Endo-Mix6 introduce new challenges in training stability due to significant variations in dataset sizes, distribution shifts, and error imbalance. To address them, a progressive multi-objective training strategy is employed to promote balanced learning and improve representation quality across domains. This enables EndoMatcher to generalize across unseen organs and imaging conditions in a zero-shot fashion. Extensive zero-shot matching experiments demonstrate that EndoMatcher increases the number of inlier matches by 140.69% and 201.43% on the Hamlyn and Bladder datasets over state-of-the-art methods, respectively, and improves the Matching Direction Prediction Accuracy (MDPA) by 9.40% on the Gastro-Matching dataset, achieving dense and accurate matching under challenging endoscopic conditions. The code is publicly available at https://github.com/Beryl2000/EndoMatcher.

[36] VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization cs.CVPDF

Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin

TL;DR: VFlowOpt是一个面向大型多模态模型（LMMs）的视觉token剪枝框架，通过视觉信息流引导优化，显著减少计算开销，同时保持性能。

Details

Motivation: 现有的视觉token剪枝方法通常依赖注意力分数生成重要性图，但剪枝策略过于简单，导致性能下降明显。为此，VFlowOpt提出了更高效的重要性图生成和渐进式剪枝机制。

Result: 实验显示，VFlowOpt能剪枝90%的视觉token，性能损失可忽略，同时降低89%的KV-Cache内存占用，推理速度提升3.8倍。

Insight: 视觉信息流是优化多模态模型token剪枝的有效信号，结合渐进式剪枝与回收机制能显著平衡计算开销与性能。

Abstract: Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning strategies tailored to different LMMs. Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8 times faster inference.

[37] Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation cs.CV | I.2.10PDF

Jianming Liu, Wenlong Qiu, Haitao Wei

TL;DR: 本文提出了一种基于文本和视觉信息的源自由跨域少样本分割方法，通过任务特定注意力适配器和跨模态对齐模块，显著提升了跨域分割性能。

Details

Motivation: 由于数据隐私和数据传输成本的考量，开发不依赖源域数据的跨域少样本分割方法变得至关重要。

Result: 在1-shot和5-shot设置下，平均分割准确率分别提升2.18%和4.11%，优于现有方法。

Insight: 结合文本和视觉信息的跨模态对齐能有效提升跨域少样本分割的性能，尤其是在数据隐私受限的场景下。

Abstract: Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18% and 4.11%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.

[38] ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking cs.CV | cs.AI | cs.LGPDF

Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen

TL;DR: 该论文提出了一种基于推理的视觉语言跟踪框架ReasoningTrack，通过结合预训练视觉语言模型Qwen2.5-VL，采用SFT和强化学习GRPO优化推理和语言生成，显著提升了长期视觉语言跟踪的性能。

Details

Motivation: 现有视觉语言跟踪方法未能充分利用大模型优势，且缺乏对模型推理过程的解释。为此，作者提出结合推理和语言生成的方法，以提升跟踪效果。

Result: 在多数据集上的实验表明，提出的推理生成策略显著提升了跟踪性能。

Insight: 结合推理步骤和语言生成可以更好地适配目标变化，同时充分利用大模型的潜力，提升跟踪任务的灵活性和准确性。

Abstract: Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model’s reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack

[39] Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2 cs.CV | 68T45, 94A08 | I.2.10PDF

Semanur Küçük, Cosimo Della Santina, Angeliki Laskari

TL;DR: 本文研究了利用微调的SAM v2.1模型分割两相流中的不规则气泡，仅需100张标注图像即可实现高精度分割。

Details

Motivation: 多相流中的气泡分割在工业应用中至关重要，但传统方法和大多数基于学习的方法假设气泡为近球形，限制了其在变形、合并或破裂气泡场景中的有效性。

Result: 能够有效分割复杂的非球形气泡结构，解决了传统方法的局限性。

Insight: 现代视觉基础模型（如SAM）在小样本标注数据下也能实现高性能分割，为工业应用提供了新思路。

Abstract: Segmenting gas bubbles in multiphase flows is a critical yet unsolved challenge in numerous industrial settings, from metallurgical processing to maritime drag reduction. Traditional approaches-and most recent learning-based methods-assume near-spherical shapes, limiting their effectiveness in regimes where bubbles undergo deformation, coalescence, or breakup. This complexity is particularly evident in air lubrication systems, where coalesced bubbles form amorphous and topologically diverse patches. In this work, we revisit the problem through the lens of modern vision foundation models. We cast the task as a transfer learning problem and demonstrate, for the first time, that a fine-tuned Segment Anything Model SAM v2.1 can accurately segment highly non-convex, irregular bubble structures using as few as 100 annotated images.

[40] ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models cs.CVPDF

Yatong Lan, Jingfeng Chen, Yiru Wang, Lei He

TL;DR: 该论文提出了一种基于扩散模型的框架Arbiviewgen，用于从任意视角生成可控的相机图像，解决了因缺乏外推视图的真实数据而导致的高保真生成模型训练难题。其核心贡献包括FAVS和CVC-SSL两个组件。

Details

Motivation: 在自动驾驶领域中，任意视角图像生成具有重要意义，但由于缺乏外推视图的真实数据，训练高保真生成模型面临挑战。

Result: Arbiviewgen是首个能够在多种车辆配置下生成可控任意视图相机图像的方法，仅需多摄像头图像及其位姿即可训练。

Insight: 通过自监督学习和分层特征匹配，可以在缺乏外推视图真实数据的情况下，实现高保真的任意视角图像生成。

Abstract: Arbitrary viewpoint image generation holds significant potential for autonomous driving, yet remains a challenging task due to the lack of ground-truth data for extrapolated views, which hampers the training of high-fidelity generative models. In this work, we propose Arbiviewgen, a novel diffusion-based framework for the generation of controllable camera images from arbitrary points of view. To address the absence of ground-truth data in unseen views, we introduce two key components: Feature-Aware Adaptive View Stitching (FAVS) and Cross-View Consistency Self-Supervised Learning (CVC-SSL). FAVS employs a hierarchical matching strategy that first establishes coarse geometric correspondences using camera poses, then performs fine-grained alignment through improved feature matching algorithms, and identifies high-confidence matching regions via clustering analysis. Building upon this, CVC-SSL adopts a self-supervised training paradigm where the model reconstructs the original camera views from the synthesized stitched images using a diffusion model, enforcing cross-view consistency without requiring supervision from extrapolated data. Our framework requires only multi-camera images and their associated poses for training, eliminating the need for additional sensors or depth maps. To our knowledge, Arbiviewgen is the first method capable of controllable arbitrary view camera image generation in multiple vehicle configurations.

[41] Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models cs.CV | cs.AIPDF

Zane Xu, Jason Sun

TL;DR: 本文总结了八篇关于视觉语言模型（如CLIP）的零样本对抗鲁棒性的重要论文，分析了对抗微调（AFT）和免训练/测试时防御两种范式，并探讨了未来的研究方向。

Details

Motivation: 视觉语言模型在零样本学习中表现出色，但其对抗鲁棒性不足是一个重要挑战。如何在保持模型通用性的同时提升对抗鲁棒性，是亟待解决的问题。

Result: 研究发现，对齐保持方法（TeCoA）和嵌入空间重构（LAAT, TIMA）是有效的防御策略，而隐空间净化（CLIPure）也展示了潜力。

Insight: 未来的研究方向应包括结合多种防御策略的混合方法，以及更高效的对抗预训练技术，以平衡鲁棒性和零样本泛化能力。

Abstract: This report synthesizes eight seminal papers on the zero-shot adversarial robustness of vision-language models (VLMs) like CLIP. A central challenge in this domain is the inherent trade-off between enhancing adversarial robustness and preserving the model’s zero-shot generalization capabilities. We analyze two primary defense paradigms: Adversarial Fine-Tuning (AFT), which modifies model parameters, and Training-Free/Test-Time Defenses, which preserve them. We trace the evolution from alignment-preserving methods (TeCoA) to embedding space re-engineering (LAAT, TIMA), and from input heuristics (AOM, TTC) to latent-space purification (CLIPure). Finally, we identify key challenges and future directions including hybrid defense strategies and adversarial pre-training.

[42] RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding cs.CV | cs.AIPDF

Tianchen Fang, Guiru Liu

TL;DR: RegionMed-CLIP 是一个区域感知的多模态对比学习预训练框架，通过整合局部病理信号和全局语义表示，解决了医学图像理解中标注数据不足和依赖全局特征的挑战。

Details

Motivation: 医学图像理解面临高质量标注数据稀缺和过度依赖全局图像特征的问题，这些特征可能忽略临床重要的细微病理区域。

Result: 在图像-文本检索、零样本分类和视觉问答任务中，RegionMed-CLIP 显著优于现有视觉语言模型。

Insight: 区域感知的对比预训练对提升多模态医学图像理解至关重要，RegionMed-CLIP 为此提供了稳健的基础。

Abstract: Medical image understanding plays a crucial role in enabling automated diagnosis and data-driven clinical decision support. However, its progress is impeded by two primary challenges: the limited availability of high-quality annotated medical data and an overreliance on global image features, which often miss subtle but clinically significant pathological regions. To address these issues, we introduce RegionMed-CLIP, a region-aware multimodal contrastive learning framework that explicitly incorporates localized pathological signals along with holistic semantic representations. The core of our method is an innovative region-of-interest (ROI) processor that adaptively integrates fine-grained regional features with the global context, supported by a progressive training strategy that enhances hierarchical multimodal alignment. To enable large-scale region-level representation learning, we construct MedRegion-500k, a comprehensive medical image-text corpus that features extensive regional annotations and multilevel clinical descriptions. Extensive experiments on image-text retrieval, zero-shot classification, and visual question answering tasks demonstrate that RegionMed-CLIP consistently exceeds state-of-the-art vision language models by a wide margin. Our results highlight the critical importance of region-aware contrastive pre-training and position RegionMed-CLIP as a robust foundation for advancing multimodal medical image understanding.

[43] SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion cs.CV | cs.AIPDF

Xiaoyang Zhang, Zhen Hua, Yakun Ju, Wei Zhou, Jun Liu

TL;DR: SGDFuse 是一种基于 SAM 引导的扩散模型，用于红外和可见光图像融合，通过语义掩码优化融合过程，提升图像质量和任务性能。

Details

Motivation: 现有方法缺乏对场景的深层语义理解，导致关键目标丢失，且融合过程易引入伪影和细节损失。

Result: 在主观和客观评估中表现优异，且适应性强，显著提升图像融合质量。

Insight: 显式语义方向性和分层次去噪生成是提升图像融合质量的关键。

Abstract: Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model’s coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.

[44] B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding cs.CVPDF

Changho Choi, Youngwoo Shin, Gyojin Han, Dong-Jae Lee, Junmo Kim

TL;DR: 论文提出了B4DL，一个专门为4D LiDAR设计的基准测试，用于训练和评估多模态大语言模型（MLLM）在时空理解任务中的表现。作者还提出了一个可扩展的数据生成流程和首个直接处理原始4D LiDAR数据的MLLM模型。

Details

Motivation: 动态室外环境的理解需要捕捉复杂的物体互动及其随时间的变化。现有的4D LiDAR数据在高维处理和多模态语言模型中的应用尚未充分探索，缺乏高质量的标注和适配的模型架构。

Result: 论文提供了一套完整的解决方案，包括生成的4D LiDAR视频、数据集和推理输出，支持动态室外环境的时空推理任务。

Insight: 4D LiDAR在多模态语言模型中的应用具有潜力，尤其是在动态场景的理解中。通过引入专门的基准和模型架构，可以进一步推动这一领域的研究和实际应用。

Abstract: Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://mmb4dl.github.io/mmb4dl/

[45] Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection cs.CVPDF

Xiaoyang Zhang, Guodong Fan, Guang-Yong Chen, Zhen Hua, Jinjiang Li

TL;DR: 该论文提出了一种基于小波变换的双频编码方法（WGDF），用于遥感图像的变化检测，通过结合高频和低频特征来增强对边缘细节和全局结构的感知，显著提升了检测精度和鲁棒性。

Details

Motivation: 现有的大部分变化检测方法主要依赖空间域建模，特征表示多样性有限，难以捕捉细微变化区域。论文观察到小波域中的频域特征建模可以放大频率分量的细粒度差异，从而增强对边缘变化的感知。

Result: 在多个遥感数据集上的实验表明，WGDF显著缓解了边缘模糊问题，检测精度和鲁棒性优于现有方法。

Insight: 频域特征建模（尤其是小波域）可以有效捕捉空间域中难以察觉的细微变化，高频和低频特征的结合能够同时提升局部细节和全局结构的表征能力。

Abstract: Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management. Despite the remarkable progress of deep learning in recent years, most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions. We observe that frequency-domain feature modeling particularly in the wavelet domain an amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain. Thus, we propose a method called Wavelet-Guided Dual-Frequency Encoding (WGDF). Specifically, we first apply Discrete Wavelet Transform (DWT) to decompose the input images into high-frequency and low-frequency components, which are used to model local details and global structures, respectively. In the high-frequency branch, we design a Dual-Frequency Feature Enhancement (DFFE) module to strengthen edge detail representation and introduce a Frequency-Domain Interactive Difference (FDID) module to enhance the modeling of fine-grained changes. In the low-frequency branch, we exploit Transformers to capture global semantic relationships and employ a Progressive Contextual Difference Module (PCDM) to progressively refine change regions, enabling precise structural semantic characterization. Finally, the high- and low-frequency features are synergistically fused to unify local sensitivity with global discriminability. Extensive experiments on multiple remote sensing datasets demonstrate that WGDF significantly alleviates edge ambiguity and achieves superior detection accuracy and robustness compared to state-of-the-art methods. The code will be available at https://github.com/boshizhang123/WGDF.

[46] MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs cs.CV | cs.CLPDF

Yufei Gao, Jiaying Fei, Nuo Chen, Ruirui Chen, Guohang Yan

TL;DR: MELLA针对低资源语言的多模态大语言模型（MLLMs）提出了一种兼顾语言能力和文化相关性的数据集构建方法，通过双源策略显著提升了模型性能。

Details

Motivation: 当前MLLMs在高资源语言中表现优异，但在低资源语言中效果较差。现有方法多局限于文本模态或机器翻译，忽略了多模态信息和文化相关性，而这两者对低资源语言用户至关重要。

Result: 在八种语言上，MELLA显著提升了多种MLLM主干的性能，模型生成的描述更具丰富性（thick descriptions）。性能提升源自文化和语言两方面能力的增强。

Insight: 单纯依赖机器翻译或文本模态无法充分满足低资源语言的需求，文化相关性和语言能力的结合是实现高效MLLMs的关键。

Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce “thin descriptions”, they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing “thick descriptions”. We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.

[47] CoCAViT: Compact Vision Transformer with Robust Global Coordination cs.CVPDF

Xuyang Wang, Lingjuan Miao, Zhiqiang Zhou

TL;DR: CoCAViT提出了一种紧凑的视觉Transformer，通过引入Coordinator-patch Cross Attention (CoCA)机制增强小模型的泛化能力，显著提升了OOD数据性能。

Details

Motivation: 现有高效小模型在非分布数据（OOD）上表现不佳，作者希望通过改进架构设计提升其泛化性能。

Result: CoCAViT-28M在ImageNet-1K上达到84.0% top-1准确率，在COCO和ADE20K任务上表现优异，且保持低延迟。

Insight: 通过合理的架构设计，小模型也能实现高效和强泛化性，为实时视觉任务提供了新思路。

Abstract: In recent years, large-scale visual backbones have demonstrated remarkable capabilities in learning general-purpose features from images via extensive pre-training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in-domain benchmarks. However, we observe that for smaller models, the performance drop on out-of-distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator-patch Cross Attention (CoCA) mechanism, featuring dynamic, domain-aware global tokens that enhance local-global feature modeling and adaptively capture robust patterns across domains with minimal computational overhead. Integrating these advancements, we present CoCAViT, a novel visual backbone designed for robust real-time visual representation. Extensive experiments empirically validate our design. At a resolution of 224*224, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant gains on multiple OOD benchmarks, compared to competing models. It also attains 52.2 mAP on COCO object detection and 51.3 mIOU on ADE20K semantic segmentation, while maintaining low latency.

[48] mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering cs.CV | cs.AIPDF

Xu Yuan, Liangbo Ning, Wenqi Fan, Qing Li

TL;DR: 论文提出了mKG-RAG框架，通过结合多模态知识图谱（KG）增强检索增强生成（RAG）方法，显著提升了知识密集型视觉问答（VQA）任务的性能。

Details

Motivation: 现有基于RAG的VQA方法因依赖非结构化文档且忽略知识元素间的结构关系，常引入无关或误导性内容。多模态KG的引入为这一问题提供了解决方案。

Result: 实验表明，mKG-RAG显著优于现有方法，成为知识密集型VQA的新SOTA。

Insight: 结构化知识表征和高效检索策略是提升知识密集型VQA性能的关键。

Abstract: Recently, Retrieval-Augmented Generation (RAG) has been proposed to expand internal knowledge of Multimodal Large Language Models (MLLMs) by incorporating external knowledge databases into the generation process, which is widely used for knowledge-based Visual Question Answering (VQA) tasks. Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relationships among knowledge elements frequently introduce irrelevant or misleading content, reducing answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks to enhance the generation by introducing structured multimodal knowledge. Therefore, in this paper, we propose a novel multimodal knowledge-augmented generation framework (mKG-RAG) based on multimodal KGs for knowledge-intensive VQA tasks. Specifically, our approach leverages MLLM-powered keyword extraction and vision-text matching to distill semantically consistent and modality-aligned entities/relationships from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. In addition, a dual-stage retrieval strategy equipped with a question-aware multimodal retriever is introduced to improve retrieval efficiency while refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art for knowledge-based VQA.

[49] Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting cs.CVPDF

Frank Ruis, Gertjan Burghouts, Hugo Kuijf

TL;DR: 论文提出了一种基于文本反转（Textual Inversion）的方法，用于高效适配开放词汇目标检测器，避免了传统微调导致的遗忘问题。

Details

Motivation: 尽管大规模预训练视觉语言模型（VLMs）在目标检测任务上表现出色，但针对特定目标的微调通常会丧失模型的零样本能力和自然语言查询功能。本文旨在解决这一问题。

Result: 实验表明，该方法在各种定量和定性实验中优于基线方法，且保留了原始模型的零样本能力。

Insight: 通过解耦token学习和模型权重更新，可以高效适配VLMs，同时避免遗忘问题，为开放词汇目标检测提供了新思路。

Abstract: Recent progress in large pre-trained vision language models (VLMs) has reached state-of-the-art performance on several object detection benchmarks and boasts strong zero-shot capabilities, but for optimal performance on specific targets some form of finetuning is still necessary. While the initial VLM weights allow for great few-shot transfer learning, this usually involves the loss of the original natural language querying and zero-shot capabilities. Inspired by the success of Textual Inversion (TI) in personalizing text-to-image diffusion models, we propose a similar formulation for open-vocabulary object detection. TI allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples. The learned tokens are completely compatible with the original VLM weights while keeping them frozen, retaining the original model’s benchmark performance, and leveraging its existing capabilities such as zero-shot domain transfer (e.g., detecting a sketch of an object after training only on real photos). The storage and gradient calculations are limited to the token embedding dimension, requiring significantly less compute than full-model fine-tuning. We evaluated whether the method matches or outperforms the baseline methods that suffer from forgetting in a wide variety of quantitative and qualitative experiments.

[50] 3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering cs.CVPDF

Junyu Zhou, Yuyang Huang, Wenrui Dai, Junni Zou, Ziyang Zheng

TL;DR: 3DGabSplat提出了一种基于3D Gabor的新型基元，通过多方向3D频率响应提升场景高频细节表示能力，同时优化了渲染效率和内存占用。

Details

Motivation: 3DGS（3D Gaussian Splatting）虽然实现了实时高保真渲染，但其高斯函数本质上是低通滤波器，难以捕捉高频细节，且存在冗余基元和内存开销大的问题。

Result: 实验表明，3DGabSplat在PSNR上提升1.35 dB，同时减少基元数量和内存占用，在真实和合成场景中均优于3DGS及其变体。

Insight: 通过频率自适应机制和多方向3D Gabor基元，3DGabSplat在高频细节表示和效率优化上取得了显著突破。

Abstract: Recent prominence in 3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis. However, 3DGS resorts to the Gaussian function that is low-pass by nature and is restricted in representing high-frequency details in 3D scenes. Moreover, it causes redundant primitives with degraded training and rendering efficiency and excessive memory overhead. To overcome these limitations, we propose 3D Gabor Splatting (3DGabSplat) that leverages a novel 3D Gabor-based primitive with multiple directional 3D frequency responses for radiance field representation supervised by multi-view images. The proposed 3D Gabor-based primitive forms a filter bank incorporating multiple 3D Gabor kernels at different frequencies to enhance flexibility and efficiency in capturing fine 3D details. Furthermore, to achieve novel view rendering, an efficient CUDA-based rasterizer is developed to project the multiple directional 3D frequency components characterized by 3D Gabor-based primitives onto the 2D image plane, and a frequency-adaptive mechanism is presented for adaptive joint optimization of primitives. 3DGabSplat is scalable to be a plug-and-play kernel for seamless integration into existing 3DGS paradigms to enhance both efficiency and quality of novel view synthesis. Extensive experiments demonstrate that 3DGabSplat outperforms 3DGS and its variants using alternative primitives, and achieves state-of-the-art rendering quality across both real-world and synthetic scenes. Remarkably, we achieve up to 1.35 dB PSNR gain over 3DGS with simultaneously reduced number of primitives and memory consumption.

[51] Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision cs.CV | cs.CLPDF

Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang

TL;DR: Uni-CoT提出了一种统一的多模态推理框架，通过两级CoT设计（宏观和微观）实现文本和视觉的连贯推理，显著降低计算开销，并在多个基准测试中取得领先性能。

Details

Motivation: 现有CoT方法在视觉-语言任务中难以有效建模视觉状态转换或出现不连贯的视觉轨迹，限制了多模态推理的效果。

Result: 在WISE、RISE和KRIS基准测试中达到SOTA性能，展现出强泛化能力。

Insight: 通过统一模型和结构化训练，Uni-CoT实现了高效的多模态推理，同时显著降低了计算成本。

Abstract: Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/

[52] PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation cs.CV | cs.AIPDF

Kang Liu, Zhuoqi Ma, Zikang Fang, Yunan Li, Kun Xie

TL;DR: PriorRG通过两阶段训练流程（对比预训练和粗到细解码）整合患者特定先验知识，提升胸部X光报告生成的临床准确性和流畅性。

Details

Motivation: 现有方法多忽略患者特定的临床背景和先前影像，无法捕捉诊断意图或疾病进展，PriorRG旨在模仿真实临床流程解决这一问题。

Result: 在MIMIC-CXR和MIMIC-ABN数据集上显著超越SOTA方法，BLEU和F1分数均有提升。

Insight: 患者先验知识对胸部X光报告生成至关重要，模仿临床流程的框架能有效提升生成质量。

Abstract: Chest X-ray report generation aims to reduce radiologists’ workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge – including clinical context (e.g., symptoms, medical history) and the most recent prior image – which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder’s hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance.

[53] Test-Time Reinforcement Learning for GUI Grounding via Region Consistency cs.CV | cs.AI | cs.CLPDF

Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong

TL;DR: 该论文提出了一种通过区域一致性（Region Consistency）进行推理时优化和强化学习的方法（GUI-RC和GUI-RCPO），以提升GUI grounding任务中模型定位的准确性，无需额外标注数据。

Details

Motivation: 现有GUI grounding方法依赖大量标注数据或强化学习的标记奖励，成本高且可用性受限。研究发现模型对同一GUI元素的多次预测空间重叠模式隐含了信心信号，可用于更准确的定位。

Result: GUI-RC在ScreenSpot基准上将准确率提升2-3%，GUI-RCPO进一步优化后达到85.14%。

Insight: 推理时调整和自监督强化学习是GUI grounding任务中尚未充分挖掘的方向，可显著减少对标注数据的依赖。

Abstract: Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.

[54] CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation cs.CVPDF

Hamza Kalisch, Fabian Hörst, Jens Kleesiek, Ken Herrmann, Constantin Seibold

TL;DR: 提出CT-GRAPH，一种分层图注意力网络，用于基于解剖学指导的CT报告生成，通过建模器官关系提升报告准确性，显著优于现有方法。

Details

Motivation: 医疗影像在诊断中至关重要，但现有方法未能充分捕捉细粒度器官关系，限制了报告生成的准确性。

Result: 在大规模胸部CT数据集CT-RATE上，F1分数绝对提升7.9%，显著优于现有方法。

Insight: 显式建模器官关系和解剖学知识对提升医疗报告生成性能至关重要，同时证明了预训练特征编码器的有效性。

Abstract: As medical imaging is central to diagnostic processes, automating the generation of radiology reports has become increasingly relevant to assist radiologists with their heavy workloads. Most current methods rely solely on global image features, failing to capture fine-grained organ relationships crucial for accurate reporting. To this end, we propose CT-GRAPH, a hierarchical graph attention network that explicitly models radiological knowledge by structuring anatomical regions into a graph, linking fine-grained organ features to coarser anatomical systems and a global patient context. Our method leverages pretrained 3D medical feature encoders to obtain global and organ-level features by utilizing anatomical masks. These features are further refined within the graph and then integrated into a large language model to generate detailed medical reports. We evaluate our approach for the task of report generation on the large-scale chest CT dataset CT-RATE. We provide an in-depth analysis of pretrained feature encoders for CT report generation and show that our method achieves a substantial improvement of absolute 7.9% in F1 score over current state-of-the-art methods. The code is publicly available at https://github.com/hakal104/CT-GRAPH.

[55] UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation cs.CV | cs.AI | cs.LGPDF

Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh

TL;DR: 该论文提出了一种名为UNCAGE的无训练方法，通过对比注意力引导改进基于Masked Generative Transformers的文本到图像生成任务，提升了组合忠实度和文本-图像对齐性能。

Details

Motivation: 现有的文本到图像生成技术（如Diffusion Models和Autoregressive Models）在组合生成任务中存在局限性，尤其是Masked Generative Transformers作为新兴方法在此问题上缺乏研究。因此，作者提出了UNCAGE来解决这一挑战。

Result: UNCAGE在多个基准测试和指标上均表现优异，且几乎没有推理开销。

Insight: 注意力图在生成任务中具有潜力，可以引导模型更专注于关键对象，从而提升组合生成的质量。

Abstract: Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.

[56] From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization cs.CV | cs.SD | eess.ASPDF

Farah Wahida, M. A. P. Chamikara, Yashothara Shanmugarasa, Mohan Baruwal Chhetri, Thilina Ranbaduge

TL;DR: 该论文提出了一种名为TrueBiometric的新方法，通过视觉-语言触发检测和基于噪声的中和技术，实现对抗后门攻击的弹性人脸识别系统。

Details

Motivation: 现有的后门攻击防御方法在准确识别和消除中毒图像时面临挑战，且可能影响数据效用。TrueBiometric旨在解决这一问题。

Result: 实验结果表明，TrueBiometric能够100%准确地检测和校正中毒图像，同时不影响干净图像的识别准确率。

Insight: 该研究表明，视觉-语言模型和噪声中和技术的结合为后门攻击防御提供了一种高效且实用的解决方案。

Abstract: Biometric systems, such as face recognition systems powered by deep neural networks (DNNs), rely on large and highly sensitive datasets. Backdoor attacks can subvert these systems by manipulating the training process. By inserting a small trigger, such as a sticker, make-up, or patterned mask, into a few training images, an adversary can later present the same trigger during authentication to be falsely recognized as another individual, thereby gaining unauthorized access. Existing defense mechanisms against backdoor attacks still face challenges in precisely identifying and mitigating poisoned images without compromising data utility, which undermines the overall reliability of the system. We propose a novel and generalizable approach, TrueBiometric: Trustworthy Biometrics, which accurately detects poisoned images using a majority voting mechanism leveraging multiple state-of-the-art large vision language models. Once identified, poisoned samples are corrected using targeted and calibrated corrective noise. Our extensive empirical results demonstrate that TrueBiometric detects and corrects poisoned images with 100% accuracy without compromising accuracy on clean images. Compared to existing state-of-the-art approaches, TrueBiometric offers a more practical, accurate, and effective solution for mitigating backdoor attacks in face recognition systems.

[57] Smoothing Slot Attention Iterations and Recurrences cs.CVPDF

Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen

TL;DR: 该论文提出SmoothSA方法，通过预热初始查询和区分视频帧间的处理方式，优化Slot Attention（SA）在迭代和循环中的性能。

Details

Motivation: 现有的Slot Attention（SA）在初始帧的冷启动查询缺乏样本特定信息，影响聚合精度，而后续帧的查询处理方式与初始帧不同，需要差异化处理。

Result: 在对象发现、识别和下游任务中验证了方法的有效性。

Insight: 预热和差异化处理是优化SA在图像和视频中性能的关键。

Abstract: Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame’s slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video’s first frame; Also, non-first frames’ queries are already sample-specific thus require transforms different from the first frame’s aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video’s first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method’s effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our code is available in the supplement.

[58] Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions cs.CV | cs.AI | cs.LGPDF

Hubert Baniecki, Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke Hüllermeier

TL;DR: 本文提出了一种基于博弈论的方法FIxLIP，用于解释视觉-语言编码器中的相似性，通过加权Banzhaf交互指标捕捉跨模态交互，优于传统的一阶显著性方法。

Details

Motivation: 现有的一阶显著性方法无法充分捕捉视觉-语言编码器中复杂的跨模态交互，因此需要更高阶的解释方法。

Result: FIxLIP在解释质量和计算效率上优于传统一阶方法，并能有效比较不同模型（如CLIP与SigLIP-2）。

Insight: 1. 博弈论为多模态交互分析提供了灵活且高效的工具；2. 高阶交互解释对理解复杂模型至关重要。

Abstract: Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model’s similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, like the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on MS COCO and ImageNet-1k benchmarks validate that second-order methods like FIxLIP outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models like CLIP vs. SigLIP-2 and ViT-B/32 vs. ViT-L/16.

[59] F2PASeg: Feature Fusion for Pituitary Anatomy Segmentation in Endoscopic Surgery cs.CV | cs.SY | eess.IV | eess.SYPDF

Lumin Chen, Zhiying Wu, Tianye Lei, Xuexue Bai, Ming Feng

TL;DR: F2PASeg提出了一种用于垂体解剖结构分割的特征融合方法，通过结合高分辨率图像特征和深度语义嵌入，提升了手术场景中的分割鲁棒性。

Details

Motivation: 垂体手术中解剖结构的分割对手术安全性至关重要，但像素级标注的垂体手术视频数据集稀缺，且术中存在的遮挡、相机运动和出血等问题导致特征表示不一致。

Result: 实验表明F2PASeg能实时分割关键解剖结构，适用于术中手术规划。

Insight: 特征融合和数据增强是解决术中分割问题的有效方法。

Abstract: Pituitary tumors often cause deformation or encapsulation of adjacent vital structures. Anatomical structure segmentation can provide surgeons with early warnings of regions that pose surgical risks, thereby enhancing the safety of pituitary surgery. However, pixel-level annotated video stream datasets for pituitary surgeries are extremely rare. To address this challenge, we introduce a new dataset for Pituitary Anatomy Segmentation (PAS). PAS comprises 7,845 time-coherent images extracted from 120 videos. To mitigate class imbalance, we apply data augmentation techniques that simulate the presence of surgical instruments in the training data. One major challenge in pituitary anatomy segmentation is the inconsistency in feature representation due to occlusions, camera motion, and surgical bleeding. By incorporating a Feature Fusion module, F2PASeg is proposed to refine anatomical structure segmentation by leveraging both high-resolution image features and deep semantic embeddings, enhancing robustness against intraoperative variations. Experimental results demonstrate that F2PASeg consistently segments critical anatomical structures in real time, providing a reliable solution for intraoperative pituitary surgery planning. Code: https://github.com/paulili08/F2PASeg.

[60] SMOL-MapSeg: Show Me One Label cs.CVPDF

Yunshuang Yuan, Frank Thiemann, Thorsten Dahms, Monika Sester

TL;DR: SMOL-MapSeg 是一种基于 ON-Demand 声明性知识提示的方法，用于准确分割历史地图中的语义信息，并支持少样本学习以适应新类别，性能优于 UNet 基线。

Details

Motivation: 历史地图的语义分割面临挑战，因为传统预训练基础模型依赖现代或特定领域图像的预定义概念，而历史地图中的模式缺乏一致性。

Result: SMOL-MapSeg 在分割性能上优于 UNet，并能通过少样本学习适应未见类别。

Insight: 声明性知识提示为历史地图等缺乏一致性的领域提供了灵活的模型适应性，扩展了基础模型的应用范围。

Abstract: Historical maps are valuable for studying changes to the Earth’s surface. With the rise of deep learning, models like UNet have been used to extract information from these maps through semantic segmentation. Recently, pre-trained foundation models have shown strong performance across domains such as autonomous driving, medical imaging, and industrial inspection. However, they struggle with historical maps. These models are trained on modern or domain-specific images, where patterns can be tied to predefined concepts through common sense or expert knowledge. Historical maps lack such consistency – similar concepts can appear in vastly different shapes and styles. To address this, we propose On-Need Declarative (OND) knowledge-based prompting, which introduces explicit prompts to guide the model on what patterns correspond to which concepts. This allows users to specify the target concept and pattern during inference (on-need inference). We implement this by replacing the prompt encoder of the foundation model SAM with our OND prompting mechanism and fine-tune it on historical maps. The resulting model is called SMOL-MapSeg (Show Me One Label). Experiments show that SMOL-MapSeg can accurately segment classes defined by OND knowledge. It can also adapt to unseen classes through few-shot fine-tuning. Additionally, it outperforms a UNet-based baseline in average segmentation performance.

[61] Symmetry Understanding of 3D Shapes via Chirality Disentanglement cs.CVPDF

Weikang Wang, Tobias Weißberg, Nafie El Amrani, Florian Bernard

TL;DR: 该论文提出了一种无监督的手性特征提取方法，用于3D形状分析中的左右对称部分区分，通过结合2D基础模型，提升了形状匹配和分割等任务的性能。

Details

Motivation: 手性信息（区分左右的信息）在计算机视觉中广泛存在，但在3D形状分析中研究不足。现有形状描述符无法区分左右对称部分，亟需开发手性感知的特征提取方法。

Result: 在多个数据集上进行了定量和定性实验，下游任务（如左右区分、形状匹配和部分分割）结果表明，提取的手性特征具有显著效果和实用价值。

Insight: 2D基础模型可以为3D形状分析提供有价值的手性信息，这种跨模态的特征提取方法为解决对称性理解问题提供了新思路。

Abstract: Chirality information (i.e. information that allows distinguishing left from right) is ubiquitous for various data modes in computer vision, including images, videos, point clouds, and meshes. While chirality has been extensively studied in the image domain, its exploration in shape analysis (such as point clouds and meshes) remains underdeveloped. Although many shape vertex descriptors have shown appealing properties (e.g. robustness to rigid-body transformations), they are often not able to disambiguate between left and right symmetric parts. Considering the ubiquity of chirality information in different shape analysis problems and the lack of chirality-aware features within current shape descriptors, developing a chirality feature extractor becomes necessary and urgent. Based on the recent Diff3F framework, we propose an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. We evaluated the extracted chirality features through quantitative and qualitative experiments across diverse datasets. Results from downstream tasks including left-right disentanglement, shape matching, and part segmentation demonstrate their effectiveness and practical utility. Project page: https://wei-kang-wang.github.io/chirality/

[62] MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips cs.CVPDF

Shibo Wang, Haonan He, Maria Parelli, Christoph Gebhardt, Zicong Fan

TL;DR: MagicHOI提出了一种从单目视频片段中重建手与物体交互的新方法，利用3D先验解决部分目标不可见的问题，显著提升了重建效果。

Details

Motivation: 现有方法依赖物体模板或假设物体完全可见，但在实际场景中，固定视角和静态抓握会导致物体部分不可见，重建结果不准确。

Result: MagicHOI显著优于现有方法，新视角合成扩散先验有效规范化了不可见区域。

Insight: 在缺少配对3D数据的情况下，利用生成模型作为先验可以有效提升重建任务的性能。

Abstract: Most RGB-based hand-object reconstruction methods rely on object templates, while template-free methods typically assume full object visibility. This assumption often breaks in real-world settings, where fixed camera viewpoints and static grips leave parts of the object unobserved, resulting in implausible reconstructions. To overcome this, we present MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited viewpoint variation. Our key insight is that, despite the scarcity of paired 3D hand-object data, large-scale novel view synthesis diffusion models offer rich object supervision. This supervision serves as a prior to regularize unseen object regions during hand interactions. Leveraging this insight, we integrate a novel view synthesis model into our hand-object reconstruction framework. We further align hand to object by incorporating visible contact constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. We also show that novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.

[63] Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events cs.CVPDF

Lin Zhu, Ruonan Liu, Xiao Wang, Lizhi Wang, Hua Huang

TL;DR: 这篇论文提出了一种受物理启发的自监督预训练框架，用于处理事件相机的稀疏和噪声数据，通过三个阶段揭示潜在信息，并在多个下游任务中表现优异。

Details

Motivation: 事件相机数据固有的稀疏性和噪声性限制了其特征提取的有效性，需要一种自监督预训练方法以充分利用这些数据。

Result: 框架在对象识别、语义分割和光流估计等任务中表现优于现有方法，展现出鲁棒性和一致性。

Insight: 通过物理启发的设计，框架能够从稀疏和噪声事件数据中提取更多有用的信息，为事件相机的进一步应用提供了新的思路。

Abstract: Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT-Vision/EventPretrain.

[64] FS-IQA: Certified Feature Smoothing for Robust Image Quality Assessment cs.CVPDF

Ekaterina Shumitskaya, Dmitriy Vatolin, Anastasia Antsiferova

TL;DR: 论文提出了一种新的图像质量评估（IQA）认证防御方法，通过在特征空间而非输入空间添加噪声，既能保持图像保真度，又能提供鲁棒性保证。

Details

Motivation: 现有方法通常在输入图像中直接注入高斯噪声，这会降低图像质量且缺乏灵活性。本文希望在保持图像质量的同时，为IQA模型提供鲁棒性认证。

Result: 在基准数据集上验证，相比现有方法，推理时间减少99.5%（无认证）和20.6%（带认证），与主观质量评分的相关性提升高达30.9%。

Insight: 特征空间的噪声添加是一种既高效又保真的防御策略，适用于多种IQA任务，且在认证和非认证场景下均显著优于现有方法。

Abstract: We propose a novel certified defense method for Image Quality Assessment (IQA) models based on randomized smoothing with noise applied in the feature space rather than the input space. Unlike prior approaches that inject Gaussian noise directly into input images, often degrading visual quality, our method preserves image fidelity while providing robustness guarantees. To formally connect noise levels in the feature space with corresponding input-space perturbations, we analyze the maximum singular value of the backbone network’s Jacobian. Our approach supports both full-reference (FR) and no-reference (NR) IQA models without requiring any architectural modifications, suitable for various scenarios. It is also computationally efficient, requiring a single backbone forward pass per image. Compared to previous methods, it reduces inference time by 99.5% without certification and by 20.6% when certification is applied. We validate our method with extensive experiments on two benchmark datasets, involving six widely-used FR and NR IQA models and comparisons against five state-of-the-art certified defenses. Our results demonstrate consistent improvements in correlation with subjective quality scores by up to 30.9%.

[65] Optimal Brain Connection: Towards Efficient Structural Pruning cs.CVPDF

Shaowu Chen, Wei Ma, Binhua Huang, Qingyuan Wang, Guoxin Wang

TL;DR: 本文提出了一种名为Optimal Brain Connection的结构剪枝框架，通过Jacobian准则评估参数显著性，并使用等效剪枝机制保留原始连接贡献，有效提升了剪枝后模型的性能。

Details

Motivation: 现有结构剪枝方法大多忽视参数间的相互关联，限制了剪枝效果。本文旨在通过考虑层内和层间的参数依赖关系，提升剪枝的高效性和模型性能。

Result: 实验结果显示，Jacobian准则在多种剪枝指标中表现最优，等效剪枝机制显著缓解了模型微调时的性能下降。

Insight: 参数间的交互作用对结构剪枝至关重要，保留剪枝连接的贡献可以有效提升剪枝后模型的鲁棒性和性能。

Abstract: Structural pruning has been widely studied for its effectiveness in compressing neural networks. However, existing methods often neglect the interconnections among parameters. To address this limitation, this paper proposes a structural pruning framework termed Optimal Brain Connection. First, we introduce the Jacobian Criterion, a first-order metric for evaluating the saliency of structural parameters. Unlike existing first-order methods that assess parameters in isolation, our criterion explicitly captures both intra-component interactions and inter-layer dependencies. Second, we propose the Equivalent Pruning mechanism, which utilizes autoencoders to retain the contributions of all original connection–including pruned ones–during fine-tuning. Experimental results demonstrate that the Jacobian Criterion outperforms several popular metrics in preserving model performance, while the Equivalent Pruning mechanism effectively mitigates performance degradation after fine-tuning. Code: https://github.com/ShaowuChen/Optimal_Brain_Connection

[66] When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework cs.CVPDF

Haoyu Liu, Chaoyu Gong, Mengke He, Jiate Li, Kai Han

TL;DR: 本文提出了一种轻量级的SSTGNN框架，通过图神经网络联合分析视频的空间、频谱和时间信息，显著提升了Deepfake检测的性能和泛化能力。

Details

Motivation: 随着生成式视频模型的普及，检测AI生成和篡改视频的需求日益迫切，而现有方法往往因依赖单一信息类型或模型庞大而难以泛化。

Result: 在多个基准数据集上表现优异，参数数量比现有最优模型少42.4倍，兼具轻量化和强泛化能力。

Insight: 结构化图表示能更全面地捕捉视频篡改的多维度特征，为Deepfake检测提供了新思路。

Abstract: The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and temporal differential modeling into a graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong robustness against unseen manipulations. Remarkably, SSTGNN accomplishes these results with up to 42.4$\times$ fewer parameters than state-of-the-art models, making it highly lightweight and scalable for real-world deployment.

[67] AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety cs.CV | I.2.10; I.2.7; H.3.3; H.4.3; K.4.1PDF

Adi Levi, Or Levi, Sardhendu Mishra, Jonathan Morra

TL;DR: 本文提出了一种利用多模态大语言模型（MLLMs）进行品牌安全内容审核的新方法，并通过比较MLLMs与人类审核员的性能，展示了MLLMs在这一任务中的潜力与局限性。

Details

Motivation: 随着在线视频内容的爆炸式增长，人工审核的效率和心理健康问题日益突出，亟需探索自动化的解决方案。多模态内容审核需要结合视觉与文本信息的细致理解，但当前研究对此关注不足。

Result: 实验表明，MLLMs在品牌安全分类任务中表现接近人类审核员，同时在成本效率上具有显著优势，但也存在一定的局限性和失败案例。

Insight: MLLMs在多模态内容审核中展现出潜力，但仍需改进对复杂或模糊内容的理解能力；数据集的公开有助于推动更高效和负责任的品牌安全研究。

Abstract: As the volume of video content online grows exponentially, the demand for moderation of unsafe videos has surpassed human capabilities, posing both operational and mental health challenges. While recent studies demonstrated the merits of Multimodal Large Language Models (MLLMs) in various video understanding tasks, their application to multimodal content moderation, a domain that requires nuanced understanding of both visual and textual cues, remains relatively underexplored. In this work, we benchmark the capabilities of MLLMs in brand safety classification, a critical subset of content moderation for safe-guarding advertising integrity. To this end, we introduce a novel, multimodal and multilingual dataset, meticulously labeled by professional reviewers in a multitude of risk categories. Through a detailed comparative analysis, we demonstrate the effectiveness of MLLMs such as Gemini, GPT, and Llama in multimodal brand safety, and evaluate their accuracy and cost efficiency compared to professional human reviewers. Furthermore, we present an in-depth discussion shedding light on limitations of MLLMs and failure cases. We are releasing our dataset alongside this paper to facilitate future research on effective and responsible brand safety and content moderation.

[68] Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis cs.CVPDF

Kunyu Feng, Yue Ma, Xinhua Zhang, Boshi Liu, Yikuang Yuluo

TL;DR: 本文提出了一种名为Follow-Your-Instruction的多模态大语言模型（MLLM）框架，用于自动生成高质量的2D、3D和4D数据，解决了传统数据采集成本高、扩展性差的问题。

Details

Motivation: 随着AI生成内容（AIGC）需求的增长，高质量、多样化且可扩展的数据变得至关重要。然而，采集大规模真实数据成本高昂且耗时，限制了下游应用的发展。

Result: 实验表明，合成的数据显著提升了基线模型的性能，证明了该框架在生成智能领域中作为可扩展数据引擎的潜力。

Insight: 结合多模态大语言模型和视觉语言模型，实现自动化数据合成，为生成任务提供了一种高效且可扩展的解决方案。

Abstract: With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction’s potential as a scalable and effective data engine for generative intelligence.

Haijing Liu, Tao Pu, Hefeng Wu, Keze Wang, Liang Lin

TL;DR: 这篇论文提出了一种名为DART的双自适应优化迁移框架，用于开放词汇多标签识别任务。DART通过两个协同的自适应模块增强冻结的视觉语言预训练模型，解决了细粒度定位和类别依赖关系的建模问题。

Details

Motivation: 开放词汇多标签识别任务需要在弱监督下实现精确的类别内定位，并捕获复杂的类别间依赖关系。现有的视觉语言预训练模型在这方面存在局限性。

Result: 实验结果表明，DART在多个开放词汇多标签识别基准上达到了新的最高性能。

Insight: 结合外部LLM的结构化知识和视觉语言预训练模型的能力，可以显著提升开放词汇任务的性能，特别是在类别间依赖关系和细粒度定位方面。

Abstract: Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image, requiring both precise intra-class localization to pinpoint objects and effective inter-class reasoning to model complex category dependencies. While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. DART enhances a frozen VLP backbone via two synergistic adaptive modules. For intra-class refinement, an Adaptive Refinement Module (ARM) refines patch features adaptively, coupled with a novel Weakly Supervised Patch Selecting (WPS) loss that enables discriminative localization using only image-level labels. Concurrently, for inter-class transfer, an Adaptive Transfer Module (ATM) leverages a Class Relationship Graph (CRG), constructed using structured knowledge mined from a Large Language Model (LLM), and employs graph attention network to adaptively transfer relational information between class representations. DART is the first framework, to our knowledge, to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while simultaneously performing adaptive intra-class refinement under weak supervision for OV-MLR. Extensive experiments on challenging benchmarks demonstrate that our DART achieves new state-of-the-art performance, validating its effectiveness.

[70] WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction cs.CVPDF

Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian

TL;DR: WeTok提出了一种高效的视觉分词器，通过分组无查找量化（GQ）和生成式解码（GD）实现了高保真度的视觉重建，显著优于现有方法。

Details

Motivation: 现有的视觉分词器在压缩比和重建保真度之间存在不足的权衡，限制了其在高压缩比下的表现。

Result: 在ImageNet 50k验证集上，WeTok取得了最低的零样本rFID（0.12），最高压缩模型（压缩比768）的rFID为3.49，优于现有方法。

Insight: 无查找量化和生成式解码的结合是提升高压缩比下视觉重建质量的关键。

Abstract: Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.

[71] LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model cs.CVPDF

Tao Sun, Oliver Liu, JinJin Li, Lan Ma

TL;DR: LLaVA-RE是一个基于多模态大语言模型（MLLM）的二元图像-文本相关性评估框架，通过详细的任务指令和多模态上下文样本实现高效评估，并在新提出的数据集上验证了其有效性。

Details

Motivation: 图像-文本相关性评估是多模态生成式AI中的基础问题，但现有方法难以处理多样化的文本格式和场景相关性定义。多模态大语言模型（MLLMs）因其灵活性和任务适应性成为理想解决方案。

Result: 实验证明了LLaVA-RE的有效性，展示了MLLM在相关性评估任务中的优势。

Insight: 多模态大语言模型能够通过任务指令和上下文样本灵活处理复杂文本格式和场景相关性问题，为相关性评估提供了新方向。

Abstract: Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., Relevant'' vs. Not Relevant’’, is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. In addition, we propose a novel binary relevancy data set that covers various tasks. Experimental results validate the effectiveness of our framework.

[72] Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity cs.CVPDF

Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li

TL;DR: Hi3DEval提出了一种分层次的3D生成内容评估框架，结合对象级和部件级评估，提升对3D资产质量的全面分析能力。

Details

Motivation: 现有3D生成内容的质量评估主要依赖基于图像的指标，且仅限于对象级别，无法捕捉空间一致性、材质真实性和高保真局部细节。

Result: 实验表明，Hi3DEval在建模3D特性和对齐人类偏好方面优于现有图像指标，为人工评估提供了可扩展的替代方案。

Insight: 分层次评估和自动化工具可以显著提升3D生成内容的质量评估能力，尤其在材质真实性和局部细节方面。

Abstract: Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.

[73] MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes cs.CVPDF

Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang

TL;DR: MOSEv2是一个更具挑战性的视频对象分割数据集，针对复杂场景设计，包含多种真实世界挑战，显著降低了现有方法的性能。

Details

Motivation: 现有视频对象分割数据集（如DAVIS和YouTube-VOS）主要关注显著性、主导性和孤立对象，无法充分反映真实场景的复杂性。为了推动VOS在更真实环境中的研究，MOSEv2被提出。

Result: 在MOSEv2上，现有方法性能显著下降（如SAM2从76.4%降至50.9%），表明其难以应对真实世界的复杂性。

Insight: 当前VOS方法在现有数据集上表现优异，但在复杂场景下仍有较大改进空间，MOSEv2为未来研究提供了重要基准。

Abstract: Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities. MOSEv2 is publicly available at https://MOSE.video.

[74] FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing cs.CVPDF

Mohammed Talha Alam, Fahad Shamshad, Fakhri Karray, Karthik Nandakumar

TL;DR: FaceAnonyMixer是一个可取消的人脸生成框架，通过在预训练的生成模型的潜在空间中混合真实人脸和合成代码，实现隐私保护，同时保持识别性能。

Details

Motivation: 随着人脸识别技术的发展，隐私问题日益突出。现有的人脸匿名化方法往往无法满足生物特征模板保护的要求（如可撤销性、不可链接性和不可逆性），因此需要一种新的解决方案。

Result: 在基准测试中，FaceAnonyMixer表现优异，识别准确率优于现有方法，商业API上的表现提升了11%。

Insight: 通过潜在空间混合和精心设计的损失函数，可以在保护隐私的同时保留人脸识别的实用性，为生物特征模板保护提供了新的技术路径。

Abstract: Advancements in face recognition (FR) technologies have amplified privacy concerns, necessitating methods that protect identity while maintaining recognition utility. Existing face anonymization methods typically focus on obscuring identity but fail to meet the requirements of biometric template protection, including revocability, unlinkability, and irreversibility. We propose FaceAnonyMixer, a cancelable face generation framework that leverages the latent space of a pre-trained generative model to synthesize privacy-preserving face images. The core idea of FaceAnonyMixer is to irreversibly mix the latent code of a real face image with a synthetic code derived from a revocable key. The mixed latent code is further refined through a carefully designed multi-objective loss to satisfy all cancelable biometric requirements. FaceAnonyMixer is capable of generating high-quality cancelable faces that can be directly matched using existing FR systems without requiring any modifications. Extensive experiments on benchmark datasets demonstrate that FaceAnonyMixer delivers superior recognition accuracy while providing significantly stronger privacy protection, achieving over an 11% gain on commercial API compared to recent cancelable biometric methods. Code is available at: https://github.com/talha-alam/faceanonymixer.

cs.CL [Back]

[75] Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM cs.CL | cs.AI | cs.SD | eess.ASPDF

Thomas Thebaud, Yen-Ju Lu, Matthew Wiesner, Peter Viechnicki, Najim Dehak

TL;DR: 论文探讨了利用冻结的大语言模型（LLM）增强对话标注，添加说话者特征（如年龄、性别、情感）的元数据标签。方法结合了冻结的音频基础模型（如Whisper或WavLM）与冻结的LLAMA语言模型，无需任务特定微调即可推断这些属性。

Details

Motivation: 在对话转录后处理中，现有方法主要关注语法和可读性改进。论文提出补充步骤，即通过添加说话者特征标签进一步丰富转录内容。

Result: 在说话者特征推断任务上表现优异，同时保持模块化和效率；LLAMA模型在x-向量比较中达到8.8%等错误率。

Insight: 冻结模型通过轻量级连接器可实现高效的多模态任务处理，展示了LLM在对话标注中的扩展潜力。

Abstract: In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.

[76] Pitch Accent Detection improves Pretrained Automatic Speech Recognition cs.CL | cs.SD | eess.ASPDF

David Sasu, Natalie Schluter

TL;DR: 论文表明，通过引入联合ASR和音高重音检测模型，利用半监督语音表示的ASR系统性能可以通过辅助的音高重音检测模块提升。该方法在音高重音检测任务中取得了显著改进，F1分数差距缩小了41%，同时在有限资源微调下，联合训练的ASR性能将WER降低了28.3%。

Details

Motivation: 当前预训练的ASR系统可能忽略了重要的韵律特征（如音高重音），这影响了其性能。论文旨在通过联合训练ASR和音高重音检测模块，证明保留或重新学习这类特征的重要性。

Result: 1. 音高重音检测的F1分数显著提升（差距缩小41%）。2. 在LibriSpeech数据集上，ASR的WER降低了28.3%（有限资源微调）。

Insight: 1. 韵律特征（如音高重音）对ASR性能至关重要。2. 联合训练能够有效利用辅助任务提升主任务性能。3. 在资源有限的场景下，该方法尤其有效。

Abstract: We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.

[77] Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History cs.CL | cs.AIPDF

Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos, Mahmood Hegazy, Alberto Tosato

TL;DR: 该论文提出了PERSIST框架，用于评估大语言模型在人格测量上的稳定性，发现即使是大型模型也存在显著的响应不一致性，且传统干预方法可能加剧不稳定性，表明当前LLM在行为一致性方面存在根本性局限。

Details

Motivation: 大语言模型的安全部署需要一致的行为模式，但当前对其人格特质的理解不足。论文旨在通过系统评估揭示LLM在人格测量上的稳定性问题。

Result: 1. 即使是400B+模型也存在显著的响应变异性（SD>0.4）；2. 轻微提问顺序调整可导致人格测量变化高达20%；3. 传统干预方法可能增加不稳定性；4. LLM-adapted工具的稳定性与人类中心工具无异。

Insight: 当前LLM在行为一致性方面存在根本性限制，人格对齐策略可能不足以满足安全关键应用的需求。

Abstract: Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD > 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.

[78] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory cs.CL | cs.AI | cs.MAPDF

Jun Liu, Zhenglun Kong, Changdi Yang, Fan Yang, Tianqi Li

TL;DR: RCR-Router是一个针对多智能体大语言模型系统的模块化、角色感知的上下文路由框架，动态选择语义相关的记忆子集以减少token消耗，并通过迭代优化共享记忆提高协作效率。

Details

Motivation: 现有多智能体LLM系统的协调方案多依赖静态或全上下文路由策略，导致高token消耗、冗余记忆暴露和跨轮次适应能力不足。RCR-Router旨在解决这些问题。

Result: 在HotPotQA等基准测试中，RCR-Router显著减少token使用（最高30%）并保持或提高答案质量。

Insight: 结构化记忆路由和输出感知评估对提升可扩展多智能体LLM系统至关重要。

Abstract: Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks – HotPotQA, MuSiQue, and 2WikiMultihop – demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems.

[79] Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering cs.CL | cs.AI | cs.CVPDF

Louie Hong Yao, Nicholas Jarvis, Tianyu Jiang

TL;DR: 论文提出了一种通过动词意义聚类解决视觉活动识别评估中的动词歧义问题的方法，改进了传统的精确匹配评估方式。

Details

Motivation: 由于动词语义的歧义性和图像解释的多义性，传统的视觉活动识别评估方法（依赖单一标准答案）无法全面反映模型性能，需要一种更鲁棒的评估方式。

Result: 在imSitu数据集上，每张图像平均对应2.8个意义聚类簇，聚类评估方法与人类判断更一致。

Insight: 聚类评估方法更符合人类对图像的多义性理解，为模型性能评估提供了更细粒度的视角。

Abstract: Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs verb sense clusters, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to an average of 2.8 sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our cluster-based evaluation with standard evaluation methods. Additionally, our human alignment analysis suggests that the cluster-based evaluation better aligns with human judgements, offering a more nuanced assessment of model performance.

Song Wang, Yishu Wei, Haotian Ma, Max Lovitt, Kelly Deng

TL;DR: 论文提出了一种多阶段的大语言模型框架，用于从非结构化文本中提取自杀相关的社会健康决定因素（SDoH），与其他先进模型（如BioBERT和GPT-3.5-turbo）相比，表现更优，同时通过解释性设计提升了模型的可解释性和实用性。

Details

Motivation: 现有数据驱动方法在提取自杀相关的SDoH时面临长尾分布、关键压力因素识别困难及模型可解释性不足的挑战。

Result: 实验表明，该框架在SDoH提取和相关上下文检索任务中均表现更优，同时通过解释性设计提高了标注速度和准确性。

Insight: 1) 分阶段设计显著提升了任务的细粒度处理能力；2) 模型可解释性是实用化的重要环节；3) 小型专用模型可以在性能和成本之间找到平衡。

Abstract: Background: Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches to this goal face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability. Methods: We present a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text. Our approach was compared to other state-of-the-art language models (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1). We also evaluated how the model’s explanations help people annotate SDoH factors more quickly and accurately. The analysis included both automated comparisons and a pilot user study. Results: We show that our proposed framework demonstrated performance boosts in the overarching task of extracting SDoH factors and in the finer-grained tasks of retrieving relevant context. Additionally, we show that fine-tuning a smaller, task-specific model achieves comparable or better performance with reduced inference costs. The multi-stage design not only enhances extraction but also provides intermediate explanations, improving model explainability. Conclusions: Our approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts. These advancements have the potential to support early identification of individuals at risk and inform more effective prevention strategies.

[81] Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning cs.CL | cs.AIPDF

Kun Peng, Cong Cao, Hao Peng, Zhifeng Hao, Lei Jiang

TL;DR: 该论文提出了一种通过结构熵最小化分割对话的方法，用于目标-方面-观点-情感四元组提取，以解决现有方法在处理多轮、多参与者对话中的噪声问题。

Details

Motivation: 现有的对话情感四元组提取方法假设情感元素在对话中均匀分布，但实际上对话包含多个语义独立的子对话，跨对话学习词关系会引入噪声。

Result: 实验表明，该方法在DiaASQ任务上取得SOTA性能，且计算成本显著降低。

Insight: 对话语义分割和两阶段框架能有效减少噪声，提升情感四元组提取的准确性和效率。

Abstract: Dialogues Aspect-based Sentiment Quadruple Extraction (DiaASQ) aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue. Existing methods typically learn word relations across entire dialogues, assuming a uniform distribution of sentiment elements. However, we find that dialogues often contain multiple semantically independent sub-dialogues without clear dependencies between them. Therefore, learning word relationships across the entire dialogue inevitably introduces additional noise into the extraction process. To address this, our method focuses on partitioning dialogues into semantically independent sub-dialogues. Achieving completeness while minimizing these sub-dialogues presents a significant challenge. Simply partitioning based on reply relationships is ineffective. Instead, we propose utilizing a structural entropy minimization algorithm to partition the dialogues. This approach aims to preserve relevant utterances while distinguishing irrelevant ones as much as possible. Furthermore, we introduce a two-step framework for quadruple extraction: first extracting individual sentiment elements at the utterance level, then matching quadruples at the sub-dialogue level. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in DiaASQ with much lower computational costs.

[82] Evaluation of LLMs in AMR Parsing cs.CL | cs.AIPDF

Shu Han Ho

TL;DR: 该论文评估了四种解码器架构的大语言模型（LLMs）在AMR解析中的表现，发现简单微调即可达到与复杂SOTA解析器相当的性能，其中LLaMA 3.2在语义性能上表现突出。

Details

Motivation: 研究动机在于探索如何通过简单微调解码器架构的LLMs来实现高性能的AMR解析，避免复杂的解析流程。

Result: 结果显示，LLaMA 3.2在微调后表现最优（SMATCH F1: 0.804），接近SOTA解析器Graphene Smatch（0.854）。此外，LLaMA 3.2在语义性能上领先，而Phi 3.5在结构有效性上更优。

Insight: 论文指出，解码器架构的LLMs通过简单微调即可实现高性能AMR解析，避免了复杂模型设计的必要性。同时，不同模型在语义和结构表现上各有优势，暗示未来可以结合其优势进一步提升性能。

Abstract: Meaning Representation (AMR) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity.

[83] Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations cs.CLPDF

Aditya Kishore, Gaurav Kumar, Jasabanta Patro

TL;DR: 论文提出了一个名为”MultiCheck”的统一框架，用于细粒度的多模态事实核查，通过结合文本和图像的专用编码器及跨模态关系融合模块，显著提升了多模态虚假信息的检测效果。

Details

Motivation: 随着多模态虚假信息的增加，仅依赖文本证据的事实核查系统面临挑战，需要一种能够同时处理文本和图像信息的统一框架。

Result: 在Factify 2数据集上取得了加权F1分数0.84，大幅优于基线。

Insight: 显式的多模态推理在多模态事实核查中非常有效，为复杂现实场景中的可扩展和可解释的事实核查提供了潜力。

Abstract: The growing rate of multimodal misinformation, where claims are supported by both text and images, poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we have proposed a unified framework for fine-grained multimodal fact verification called “MultiCheck”, designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios.

[84] Towards Assessing Medical Ethics from Knowledge to Practice cs.CL | cs.AIPDF

Chang Hong, Minghao Wu, Qingying Xiao, Yuchi Wang, Xiang Wan

TL;DR: 该论文提出PrinciplismQA基准，用于系统评估大语言模型在医疗伦理推理上的表现，揭示模型在知识与应用间的差距。

Details

Motivation: 随着大语言模型在医疗领域的应用增加，现有基准未能充分评估其伦理推理能力，需要更全面的评估工具。

Result: 实验显示模型在动态应用伦理原则（如受益原则）时表现不佳，领域微调虽能提升能力，但仍需更深入的伦理知识对齐。

Insight: 前沿闭源模型在通用能力上领先，但医疗伦理的特殊性要求更精细的模型对齐。该基准为医疗AI的伦理评估提供了可扩展框架。

Abstract: The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs’ alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models’ ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models’ overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.

[85] ATLANTIS at SemEval-2025 Task 3: Detecting Hallucinated Text Spans in Question Answering cs.CLPDF

Catherine Kobus, François Lancelot, Marion-Cécile Martin, Nawal Ould Amer

TL;DR: ATLANTIS团队在SemEval-2025 Task 3中专注于检测问答系统中的幻觉文本片段，采用了多种方法并取得了优异表现。

Details

Motivation: 大型语言模型（LLMs）在自然语言生成（NLG）任务中表现突出，但容易产生幻觉（不准确或误导性内容）。解决这一问题对提高问答系统的可靠性至关重要。

Result: 在西班牙语任务中排名最高，在英语和德语任务中表现竞争性，验证了上下文集成和微调模型的有效性。

Insight: 研究表明，通过有效结合上下文和精细调整模型训练，可以显著减少问答系统中的幻觉问题，少样本提示和微调是提升模型表现的关键。

Abstract: This paper presents the contributions of the ATLANTIS team to SemEval-2025 Task 3, focusing on detecting hallucinated text spans in question answering systems. Large Language Models (LLMs) have significantly advanced Natural Language Generation (NLG) but remain susceptible to hallucinations, generating incorrect or misleading content. To address this, we explored methods both with and without external context, utilizing few-shot prompting with a LLM, token-level classification or LLM fine-tuned on synthetic data. Notably, our approaches achieved top rankings in Spanish and competitive placements in English and German. This work highlights the importance of integrating relevant context to mitigate hallucinations and demonstrate the potential of fine-tuned models and prompt engineering.

[86] Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation cs.CL | cs.AIPDF

Haonan Shangguan, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang

TL;DR: 该论文提出了一种轻量级模型MulCoT-RD，用于资源受限环境中多模态情感推理与分类联合任务（JMSRC），通过链式思维增强与蒸馏技术，实现了高性能与强泛化能力。

Details

Motivation: 当前多模态情感分析（MSA）主要依赖参数庞大的语言模型，忽略了资源受限环境下的自主推理需求，因此需要一种轻量级方法解决这一挑战。

Result: 实验表明，仅3B参数的MulCoT-RD在四个数据集上表现优异，兼具泛化能力和可解释性。

Insight: 轻量模型通过蒸馏技术可以实现与大型模型接近的性能，同时适应资源受限场景。

Abstract: The surge in rich multimodal content on social media platforms has greatly advanced Multimodal Sentiment Analysis (MSA), with Large Language Models (LLMs) further accelerating progress in this field. Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) LLMs for sentiment classification, overlooking autonomous multimodal sentiment reasoning generation in resource-constrained environments. Therefore, we focus on the Resource-Limited Joint Multimodal Sentiment Reasoning and Classification task, JMSRC, which simultaneously performs multimodal sentiment reasoning chain generation and sentiment classification only with a lightweight model. We propose a Multimodal Chain-of-Thought Reasoning Distillation model, MulCoT-RD, designed for JMSRC that employs a “Teacher-Assistant-Student” distillation paradigm to address deployment constraints in resource-limited environments. We first leverage a high-performance Multimodal Large Language Model (MLLM) to generate the initial reasoning dataset and train a medium-sized assistant model with a multi-task learning mechanism. A lightweight student model is jointly trained to perform efficient multimodal sentiment reasoning generation and classification. Extensive experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.

[87] CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL cs.CLPDF

Sijie Wang, Quanjiang Guo, Kai Zhao, Yawei Zhang, Xin Li

TL;DR: CodeBoost是一个基于代码片段的强化学习框架，通过最大化利用代码知识提升代码LLM的性能，避免了对人工标注指令的依赖。

Details

Motivation: 现有的代码LLM通常依赖于人工标注的指令进行强化学习后训练，但收集高质量的指令成本高且难以扩展，而代码片段资源丰富但未充分利用。

Result: 在多个代码LLM和基准测试中验证了CodeBoost的优越性，表明其是一个可扩展且高效的后训练框架。

Insight: 利用未标注的代码片段可以有效提升代码LLM性能，未来可在多语言和更大规模数据上探索潜力。

Abstract: Code large language models (LLMs) have become indispensable tools for building efficient and automated coding pipelines. Existing models are typically post-trained using reinforcement learning (RL) from general-purpose LLMs using “human instruction-final answer” pairs, where the instructions are usually from manual annotations. However, collecting high-quality coding instructions is both labor-intensive and difficult to scale. On the other hand, code snippets are abundantly available from various sources. This imbalance presents a major bottleneck in instruction-based post-training. We propose CodeBoost, a post-training framework that enhances code LLMs purely from code snippets, without relying on human-annotated instructions. CodeBoost introduces the following key components: (1) maximum-clique curation, which selects a representative and diverse training corpus from code; (2) bi-directional prediction, which enables the model to learn from both forward and backward prediction objectives; (3) error-aware prediction, which incorporates learning signals from both correct and incorrect outputs; (4) heterogeneous augmentation, which diversifies the training distribution to enrich code semantics; and (5) heterogeneous rewarding, which guides model learning through multiple reward types including format correctness and execution feedback from both successes and failures. Extensive experiments across several code LLMs and benchmarks verify that CodeBoost consistently improves performance, demonstrating its effectiveness as a scalable and effective training pipeline.

[88] ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs cs.CLPDF

Dongxu Zhang, Ning Yang, Jihua Zhu, Jinnan Yang, Miao Xin

TL;DR: 论文挑战了早期错误对推理链影响更大的假设，提出了“晚期脆弱性”现象，并通过ASCoT方法自适应修正推理链，显著提升模型准确性。

Details

Motivation: 尽管CoT提示显著提升了LLMs的推理能力，但推理链的可靠性仍存挑战。论文通过实验发现，晚期错误比早期错误更易导致最终答案错误，需针对性解决。

Result: 在GSM8K和MATH等基准测试中，ASCoT显著提升了准确性，优于CoT等基线方法。

Insight: 需针对LLM推理中的特定失败模式设计自适应修正机制，而非统一的验证策略，晚期阶段的错误修复是提升可靠性的关键。

Abstract: Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs), yet the reliability of these reasoning chains remains a critical challenge. A widely held “cascading failure” hypothesis suggests that errors are most detrimental when they occur early in the reasoning process. This paper challenges that assumption through systematic error-injection experiments, revealing a counter-intuitive phenomenon we term “Late-Stage Fragility”: errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning. To address this specific vulnerability, we introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method. ASCoT employs a modular pipeline in which an Adaptive Verification Manager (AVM) operates first, followed by the Multi-Perspective Self-Correction Engine (MSCE). The AVM leverages a Positional Impact Score function I(k) that assigns different weights based on the position within the reasoning chains, addressing the Late-Stage Fragility issue by identifying and prioritizing high-risk, late-stage steps. Once these critical steps are identified, the MSCE applies robust, dual-path correction specifically to the failure parts. Extensive experiments on benchmarks such as GSM8K and MATH demonstrate that ASCoT achieves outstanding accuracy, outperforming strong baselines, including standard CoT. Our work underscores the importance of diagnosing specific failure modes in LLM reasoning and advocates for a shift from uniform verification strategies to adaptive, vulnerability-aware correction mechanisms.

[89] Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue cs.CLPDF

Sukannya Purkayastha, Nils Dycke, Anne Lauscher, Iryna Gurevych

TL;DR: 该论文探讨了元评审（meta-reviewing）作为文档支持对话的决策过程，提出了一种基于大语言模型（LLM）生成高质量合成数据的方法，并训练了专用于元评审的对话助手，验证了其在提升评审效率方面的有效性。

Details

Motivation: 元评审是同行评审中的关键步骤，传统研究将其视为对评审报告的总结问题。然而，作者认为元评审更是一个需要权衡评审意见并在更广泛背景下做出决策的过程，因此需要对话助理协助决策者。

Result: 生成的合成数据质量更高，专用于元评审的对话助理性能优于通用LLM助理，且在真实场景中能显著提升评审效率。

Insight: 元评审不仅是总结问题，更是决策过程，对话助理可通过生成高质量合成数据和领域适配训练显著提升其实际应用效果。

Abstract: Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform \emph{off-the-shelf} LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.\footnote{Code and Data: https://github.com/UKPLab/arxiv2025-meta-review-as-dialog

[90] Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression cs.CL | cs.AI | cs.LGPDF

Jiameng Huang, Baijiong Lin, Guhao Feng, Jierun Chen, Di He

TL;DR: 提出了一种名为CGRS的新方法，通过动态抑制大推理语言模型中的反思行为，减少冗余推理步骤，从而在不影响准确性的前提下显著降低推理成本。

Details

Motivation: 大规模推理语言模型（LRLMs）在长链推理中会因反思行为（如触发词“Wait”和“Alternatively”）导致过度思考问题，增加推理成本并降低实用性。因此需要一种高效的方法来抑制冗余反思。

Result: 在四个推理基准测试（AIME24、AMC23、MATH500和GPQA-D）中，CGRS平均减少18.5%至41.9%的token使用量，同时保持了准确性。在不同模型架构和规模下均表现一致。

Insight: 通过动态抑制高置信度时的反思行为，可以在不牺牲推理质量的情况下显著优化推理效率，为LRLMs的实际应用提供了实用价值。

Abstract: Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., “Wait” and “Alternatively”) to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model’s generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS’s effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS’s practical value for efficient reasoning.

[91] Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025 cs.CLPDF

Samy Ateia, Udo Kruschwitz

TL;DR: 研究了在生物医学领域的专业搜索任务中，大型语言模型（LLMs）通过自我反馈机制改进生成结果的能力，初步结果显示不同模型和任务的表现存在差异。

Details

Motivation: 专业搜索任务（如生物医学研究）需要用户高度参与和透明性，但现有的自主搜索系统可能降低用户参与度且与专家需求不符。BioASQ挑战赛为研究这些问题提供了平台。

Result: 初步结果显示，自我反馈策略在不同模型和任务中表现不一，推理模型可能更有能力生成有效的反馈。

Insight: LLM的自我修正能力具有潜力，但未来需进一步比较LLM生成的反馈与人类专家直接输入的效率差异。

Abstract: Agentic Retrieval Augmented Generation (RAG) and ‘deep research’ systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.

[92] The TUB Sign Language Corpus Collection cs.CLPDF

Eleftherios Avramidis, Vera Czehmann, Fabian Deckert, Lorenz Hufe, Aljoscha Lipski

TL;DR: 该论文介绍了包含12种手语的平行语料库，涵盖超过1,300小时的视频和大量字幕，重点包括8种拉丁美洲手语的首次平行语料，以及规模扩大10倍的德国手语语料。

Details

Motivation: 填补手语研究领域的语料空白，尤其是拉丁美洲手语的平行数据，同时通过大规模数据促进手语识别和翻译的研究。

Result: 构建了包含4,381个视频文件、1.3M字幕和14M标记的语料库，覆盖12种手语。

Insight: 大规模多语言手语数据集的公开将推动手语识别、翻译和语言学研究的进展。

Abstract: We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~~M subtitles containing 14~~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.

[93] TASE: Token Awareness and Structured Evaluation for Multilingual Language Models cs.CLPDF

Chenzhuo Zhao, Xinda Wang, Yue Huang, Junting Lu, Ziqian Liu

TL;DR: 该论文提出了TASE，一个多语言基准测试，用于评估大语言模型（LLMs）在细粒度符号（token）理解和结构化推理方面的能力，揭示了LLMs在这些任务上的局限性和改进方向。

Details

Motivation: 当前LLMs在高层次语义任务上表现优异，但在需要精确性和控制的细粒度和符号级理解任务上表现欠佳。TASE旨在填补这一空白，为LLMs的符号感知和结构化推理能力提供评估标准。

Result: 实验表明，人类表现远优于当前LLMs，揭示了后者在符号级推理上的持续弱点。TASE为未来改进低层次语言理解和跨语言泛化提供了诊断工具。

Insight: 符号级理解和结构化推理是LLMs的短板，TASE的开源工具和数据集为未来研究提供了重要资源。

Abstract: While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning–capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs’ ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at https://github.com/cyzcz/Tase .

[94] LAG: Logic-Augmented Generation from a Cartesian Perspective cs.CL | cs.AIPDF

Yilin Xiao, Chuang Zhou, Qinggang Zhang, Su Dong, Shengyuan Chen

TL;DR: 这篇论文提出了一种名为Logic-Augmented Generation (LAG)的新范式，通过系统化的问题分解和依赖感知推理，改进了知识增强生成的逻辑性和鲁棒性。

Details

Motivation: 大型语言模型（LLMs）在知识密集型任务中常出现幻觉问题，而传统的检索增强生成（RAG）因缺乏逻辑结构组织而难以应对复杂推理场景。受笛卡尔方法的启发，作者提出了LAG，旨在通过逻辑分解和依赖关系解决这一问题。

Result: 实验表明，LAG显著提升了推理的鲁棒性，减少了幻觉问题，并使LLM的问题解决更符合人类认知。

Insight: 通过引入逻辑结构和分解依赖，LAG为知识增强生成提供了一种更系统化的方法，优于传统RAG系统。

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la m'ethode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems.

[95] The World According to LLMs: How Geographic Origin Influences LLMs’ Entity Deduction Capabilities cs.CL | cs.AIPDF

Harsh Nishant Lalai, Raj Sanjay Shah, Jiaxin Pei, Sashank Varma, Yi-Chia Wang

TL;DR: 这篇论文通过一个名为Geo20Q+的数据集和20 Questions游戏框架，系统地评估了LLMs在地理实体推断能力上的偏差，发现LLMs对Global North和Global West的实体推断能力明显优于Global South和Global East，且语言对性能差距影响较小。

Details

Motivation: 研究LLMs在地理和文化上的隐性偏差，通过主动提问的20 Questions游戏揭示其推理过程中的地理差异，而非依赖传统的有偏问题触发。

Result: LLMs在Global North和Global West的实体推断表现显著更好，语言对性能差距影响较小，而预训练数据频率和Wikipedia浏览量仅轻微相关。

Insight: 通过自由形式的评估框架（如20 Questions游戏）可以更有效地揭示LLMs中的隐性偏差，这些偏差在传统提示方法中难以被发现。

Abstract: Large Language Models (LLMs) have been extensively tuned to mitigate explicit biases, yet they often exhibit subtle implicit biases rooted in their pre-training data. Rather than directly probing LLMs with human-crafted questions that may trigger guardrails, we propose studying how models behave when they proactively ask questions themselves. The 20 Questions game, a multi-turn deduction task, serves as an ideal testbed for this purpose. We systematically evaluate geographic performance disparities in entity deduction using a new dataset, Geo20Q+, consisting of both notable people and culturally significant objects (e.g., foods, landmarks, animals) from diverse regions. We test popular LLMs across two gameplay configurations (canonical 20-question and unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs are substantially more successful at deducing entities from the Global North than the Global South, and the Global West than the Global East. While Wikipedia pageviews and pre-training corpus frequency correlate mildly with performance, they fail to fully explain these disparities. Notably, the language in which the game is played has minimal impact on performance gaps. These findings demonstrate the value of creative, free-form evaluation frameworks for uncovering subtle biases in LLMs that remain hidden in standard prompting setups. By analyzing how models initiate and pursue reasoning goals over multiple turns, we find geographic and cultural disparities embedded in their reasoning processes. We release the dataset (Geo20Q+) and code at https://sites.google.com/view/llmbias20q/home.

[96] Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs cs.CL | cs.CY | I.2.7; J.4PDF

Franziska Weeber, Tanise Ceron, Sebastian Padó

TL;DR: 该论文研究了多语言大语言模型（MLLMs）中政治观点是否在西方语言之间传递，发现未对齐模型在政治观点上的跨语言差异很少，而对齐则会显著且均匀地影响所有语言的观点。

Details

Motivation: 公共舆论调查显示政治观点存在跨文化差异，但尚不清楚这些差异是否在多语言大语言模型中表现为跨语言差异。

Result: 未对齐模型的政治观点跨语言差异极少；对齐操作会均匀影响所有语言的观点。

Insight: MLLMs中西方语言间的政治观点是高度共享的，对齐操作可能难以实现社会文化和政治上的精确对齐。

Abstract: Public opinion surveys show cross-cultural differences in political opinions between socio-cultural contexts. However, there is no clear evidence whether these differences translate to cross-lingual differences in multilingual large language models (MLLMs). We analyze whether opinions transfer between languages or whether there are separate opinions for each language in MLLMs of various sizes across five Western languages. We evaluate MLLMs’ opinions by prompting them to report their (dis)agreement with political statements from voting advice applications. To better understand the interaction between languages in the models, we evaluate them both before and after aligning them with more left or right views using direct preference optimization and English alignment data only. Our findings reveal that unaligned models show only very few significant cross-lingual differences in the political opinions they reflect. The political alignment shifts opinions almost uniformly across all five languages. We conclude that in Western language contexts, political opinions transfer between languages, demonstrating the challenges in achieving explicit socio-linguistic, cultural, and political alignment of MLLMs.

[97] MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy cs.CLPDF

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang

TL;DR: MathSmith 是一种合成高难度数学问题的框架，通过从零生成问题并采用强化学习优化，提升了大型语言模型在数学推理中的表现。

Details

Motivation: 当前数学推理任务的进展受限于高质量高难度训练数据的稀缺性，而现有合成方法依赖模板转换，缺乏多样性和可扩展性。

Result: 在五个数学推理基准（包括 GSM8K 和 OlympiadBench）上，MathSmith 在短链和长链推理场景中均优于基线，并展示了可扩展性和通用性。

Insight: 高难度合成数据对提升语言模型的数学推理能力具有潜力，且可通过针对性问题生成模块进一步优化特定概念的表现。

Abstract: Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.

[98] Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models cs.CL | cs.AIPDF

Haitao Hong, Yuchen Yan, Xingyu Wu, Guiyang Hou, Wenqi Zhang

TL;DR: 论文提出Cooper框架，通过联合优化策略模型和奖励模型解决强化学习中规则奖励缺乏鲁棒性和模型奖励易受奖励攻击的问题，实验表明Cooper有效缓解奖励攻击并提升性能。

Details

Motivation: 现有强化学习中规则奖励鲁棒性差，模型奖励易受奖励攻击，导致性能受限。

Result: Cooper缓解了奖励攻击，Qwen2.5-1.5B-Instruct上平均准确率提升0.54%。

Insight: 动态更新奖励模型是解决奖励攻击的有效途径，为强化学习中奖励模型的集成提供了新思路。

Abstract: Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.

[99] OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks cs.CL | cs.AIPDF

Zixuan Wang, Dingming Li, Hongxing Li, Shuo Chen, Yuchen Yan

TL;DR: OmniEAR是一个用于评估大语言模型在具身任务中物理交互、工具使用和多智能体协作推理能力的框架，揭示了现有模型在动态能力获取和自主协作方面的不足。

Details

Motivation: 现有大语言模型在抽象推理上表现优异，但具身任务中的推理能力尚未充分探索，因此需要一个全面的评估框架。

Result: 性能从显式指令的85-96%下降到工具推理的56-85%和隐式协作的63-85%，复合任务失败率超50%。微调对单智能体任务效果显著但对多智能体任务提升有限。

Insight: 具身推理对现有模型提出了根本性挑战，完整环境信息反而影响协作性能，揭示模型无法过滤任务相关约束。

Abstract: Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.

[100] Learning to Reason for Factuality cs.CLPDF

Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas Oğuz, Rulin Shao

TL;DR: 该论文提出了一种新的奖励函数，结合事实精确性、回答详细程度和相关性，解决了推理大语言模型（R-LLM）在长文本事实性任务中的幻觉问题。通过在线强化学习（RL），模型在多个基准测试中显著降低了幻觉率并提升了回答质量。

Details

Motivation: R-LLM在复杂推理任务中表现优异，但在长文本事实性任务中容易产生幻觉（hallucinations）。直接使用自动评估框架（如FActScore）作为在线RL的奖励会导致奖励攻击（reward hacking），例如生成不详细或不相关的回答。

Result: 在六个长文本事实性基准测试中，模型平均减少23.1%的幻觉率，提升23%的回答详细程度，且未降低回答的整体有用性。

Insight: 单纯的自动事实性评估不足以优化长文本回答质量，需要结合其他维度的奖励设计。在线RL在解决奖励攻击问题时有潜力提升模型的综合表现。

Abstract: Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.

[101] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages cs.CL | cs.AIPDF

Mehrdad Zakershahrak, Samira Ghodratnama

TL;DR: H-NET++是一种针对形态丰富语言的无分词器语言建模方法，通过动态分块和层次化设计提升了计算效率和语言建模性能。

Details

Motivation: 形态丰富语言（MRLs）中单词通常包含多个字节，传统字节级语言模型计算成本高且表现不佳。论文提出H-NET++以解决这些问题。

Result: 在波斯语1.4B语料上，H-NET++相比BPE-based GPT-2-fa压缩效率提升12%，5.4pp ParsGLUE改进，ZWNJ鲁棒性提高53%。

Insight: 层次动态分块不仅适用于无分词器建模，还能自动学习语言形态结构，为MRLs提供高效解决方案。

Abstract: Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handling of orthographic artifacts (e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks align with Persian morphology without explicit supervision, demonstrating that hierarchical dynamic chunking provides an effective tokenizer-free solution for MRLs while maintaining computational efficiency.

cs.MM [Back]

[102] JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering cs.MM | cs.AI | cs.CL | cs.CR | I.2.7; K.4.1; K.6.5PDF

Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang

TL;DR: 该论文提出了一种名为JPS的新方法，通过视觉扰动和文本引导的协作攻击多模态大语言模型，不仅提高了攻击成功率（ASR），还确保生成的响应满足攻击者的恶意意图。

Details

Motivation: 当前针对多模态大语言模型的越狱攻击研究主要集中在最大化攻击成功率（ASR），但忽视了攻击生成的响应是否真正满足恶意意图，导致低质量的输出。为了解决这一问题，作者提出了JPS方法。

Result: 实验显示JPS在ASR和MIFR上均达到最优性能，验证了其有效性。

Insight: 1. 攻击成功率的提升不仅需要绕过安全机制，还需确保输出的内容质量；2. 视觉和文本的多模态协作可以显著增强攻击效果；3. 新提出的MIFR指标为评估攻击质量提供了更全面的视角。

Abstract: Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker’s malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by “steering prompt” optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers’ intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.}

cs.AR [Back]

[103] Understanding and Mitigating Errors of LLM-Generated RTL Code cs.AR | cs.CL | cs.LGPDF

Jiazheng Zhang, Cheng Liu, Huawei Li

TL;DR: 论文分析了LLM生成RTL代码中的错误原因，并提出针对性纠正技术，显著提升了生成准确率。

Details

Motivation: LLM在RTL代码生成中的潜力未能充分发挥，错误率高且原因不明确，亟需系统性分析和改进方法。

Result: 改进后的框架在VerilogEval基准上达到91.0%准确率，比基线方法提升32.7%。

Insight: LLM在RTL生成中的错误多源于领域知识不足或输入歧义，而非模型推理能力；针对性补充知识或规范化输入是改进关键。

Abstract: Despite the promising potential of large language model (LLM) based register-transfer-level (RTL) code generation, the overall success rate remains unsatisfactory. Errors arise from various factors, with limited understanding of specific failure causes hindering improvement. To address this, we conduct a comprehensive error analysis and manual categorization. Our findings reveal that most errors stem not from LLM reasoning limitations, but from insufficient RTL programming knowledge, poor understanding of circuit concepts, ambiguous design descriptions, or misinterpretation of complex multimodal inputs. Leveraging in-context learning, we propose targeted error correction techniques. Specifically, we construct a domain-specific knowledge base and employ retrieval-augmented generation (RAG) to supply necessary RTL knowledge. To mitigate ambiguity errors, we introduce design description rules and implement a rule-checking mechanism. For multimodal misinterpretation, we integrate external tools to convert inputs into LLM-compatible meta-formats. For remaining errors, we adopt an iterative debugging loop (simulation-error localization-correction). Integrating these techniques into an LLM-based framework significantly improves performance. We incorporate these error correction techniques into a foundational LLM-based RTL code generation framework, resulting in significantly improved performance. Experimental results show that our enhanced framework achieves 91.0% accuracy on the VerilogEval benchmark, surpassing the baseline code generation approach by 32.7%, demonstrating the effectiveness of our methods.

cs.IR [Back]

[104] Navigating Through Paper Flood: Advancing LLM-based Paper Evaluation through Domain-Aware Retrieval and Latent Reasoning cs.IR | cs.CLPDF

Wuqiang Zheng, Yiyan Xu, Xinyu Lin, Chongming Gao, Wenjie Wang

TL;DR: 该论文提出了PaperEval框架，通过结合领域感知检索和潜在推理机制，提升了基于LLM的论文评估能力，在实验和实际应用中均表现出色。

Details

Motivation: 学术出版物数量激增，但现有基于LLM的论文评估方法受限于过时的领域知识和有限的推理能力。

Result: 在两个数据集上优于现有方法，实际部署中吸引大量用户关注，验证了框架的有效性。

Insight: 结合领域上下文和深度推理能够显著提升论文评估的质量和实用性。

Abstract: With the rapid and continuous increase in academic publications, identifying high-quality research has become an increasingly pressing challenge. While recent methods leveraging Large Language Models (LLMs) for automated paper evaluation have shown great promise, they are often constrained by outdated domain knowledge and limited reasoning capabilities. In this work, we present PaperEval, a novel LLM-based framework for automated paper evaluation that addresses these limitations through two key components: 1) a domain-aware paper retrieval module that retrieves relevant concurrent work to support contextualized assessments of novelty and contributions, and 2) a latent reasoning mechanism that enables deep understanding of complex motivations and methodologies, along with comprehensive comparison against concurrently related work, to support more accurate and reliable evaluation. To guide the reasoning process, we introduce a progressive ranking optimization strategy that encourages the LLM to iteratively refine its predictions with an emphasis on relative comparison. Experiments on two datasets demonstrate that PaperEval consistently outperforms existing methods in both academic impact and paper quality evaluation. In addition, we deploy PaperEval in a real-world paper recommendation system for filtering high-quality papers, which has gained strong engagement on social media – amassing over 8,000 subscribers and attracting over 10,000 views for many filtered high-quality papers – demonstrating the practical effectiveness of PaperEval.

cs.GR [Back]

[105] Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off cs.GR | cs.AI | cs.CV | cs.LGPDF

Seungyong Lee, Jeong-gi Kwak

TL;DR: Voost是一个统一的、可扩展的扩散变换器框架，通过联合学习虚拟试穿和试脱任务，实现了双向一致性推理，无需任务特定网络或额外标签。

Details

Motivation: 虚拟试穿和试脱任务中，服装与人体的对应关系在姿态和外观变化下建模困难，亟需统一且灵活的框架。

Result: 在多基准测试中，Voost在对齐精度、视觉逼真度和泛化能力上均优于现有方法。

Insight: 联合学习双向任务可通过相互监督提升模型性能，注意力温度缩放和自我纠正采样技术有效解决了推理中的挑战。

Abstract: Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

[106] Laplacian Analysis Meets Dynamics Modelling: Gaussian Splatting for 4D Reconstruction cs.GR | cs.CV | cs.MMPDF

Yifan Zhou, Beizhen Zhao, Pengcheng Wu, Hao Wang

TL;DR: 论文提出了一种动态3D高斯溅射框架，通过混合显隐式函数解决了现有方法在动态场景重建中的过平滑和特征碰撞问题，实现了更高的重建保真度。

Details

Motivation: 现有动态3D高斯溅射方法在保留运动细节和变形一致性之间存在谱冲突，导致过平滑或特征碰撞。论文旨在解决这一挑战，提升动态场景重建质量。

Result: 实验表明，该方法在复杂动态场景重建中表现优异，达到最新的性能水平。

Insight: 频谱感知和动态高斯属性的结合是解决动态3D重建中关键问题的有效途径。

Abstract: While 3D Gaussian Splatting (3DGS) excels in static scene modeling, its extension to dynamic scenes introduces significant challenges. Existing dynamic 3DGS methods suffer from either over-smoothing due to low-rank decomposition or feature collision from high-dimensional grid sampling. This is because of the inherent spectral conflicts between preserving motion details and maintaining deformation consistency at different frequency. To address these challenges, we propose a novel dynamic 3DGS framework with hybrid explicit-implicit functions. Our approach contains three key innovations: a spectral-aware Laplacian encoding architecture which merges Hash encoding and Laplacian-based module for flexible frequency motion control, an enhanced Gaussian dynamics attribute that compensates for photometric distortions caused by geometric deformation, and an adaptive Gaussian split strategy guided by KDTree-based primitive control to efficiently query and optimize dynamic areas. Through extensive experiments, our method demonstrates state-of-the-art performance in reconstructing complex dynamic scenes, achieving better reconstruction fidelity.

[107] RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer cs.GR | cs.CV | cs.SD | eess.ASPDF

Fangyu Du, Taiqing Li, Ziwei Zhang, Qian Qiao, Tan Yu

TL;DR: RAP是一个实时音频驱动肖像动画框架，通过混合注意力机制和静态-动态训练推理范式，实现了高质量的音画同步，同时满足实时性要求。

Details

Motivation: 现有方法虽然能生成高质量的说话头视频，但由于计算复杂度高，无法满足实时部署的需求。RAP旨在在紧凑的潜在空间中保留精细的时空细节，解决音频-视觉同步问题。

Result: 实验表明，RAP在实时性约束下达到了最先进的性能，同时保持了高视觉保真度。

Insight: 通过避免显式运动建模和引入混合注意力机制，RAP在高效性和高质量之间取得了平衡，为实时音频驱动动画提供了新思路。

Abstract: Audio-driven portrait animation aims to synthesize realistic and natural talking head videos from an input audio signal and a single reference image. While existing methods achieve high-quality results by leveraging high-dimensional intermediate representations and explicitly modeling motion dynamics, their computational complexity renders them unsuitable for real-time deployment. Real-time inference imposes stringent latency and memory constraints, often necessitating the use of highly compressed latent representations. However, operating in such compact spaces hinders the preservation of fine-grained spatiotemporal details, thereby complicating audio-visual synchronization RAP (Real-time Audio-driven Portrait animation), a unified framework for generating high-quality talking portraits under real-time constraints. Specifically, RAP introduces a hybrid attention mechanism for fine-grained audio control, and a static-dynamic training-inference paradigm that avoids explicit motion supervision. Through these techniques, RAP achieves precise audio-driven control, mitigates long-term temporal drift, and maintains high visual fidelity. Extensive experiments demonstrate that RAP achieves state-of-the-art performance while operating under real-time constraints.

[108] Point cloud segmentation for 3D Clothed Human Layering cs.GR | cs.CVPDF

Davide Garavaso, Federico Masi, Pietro Musoni, Umberto Castellani

TL;DR: 该论文提出了一种新的3D点云分割范式，用于同时将3D点分配给不同的衣物层（即衣物分层的语义分割），并生成了合成数据集以支持模型训练。

Details

Motivation: 在3D衣物建模中，高质量的语义分割是重建衣物层的关键步骤，但现有方法多为场景理解设计，无法处理衣物层的重叠问题。

Result: 实验表明，该方法在合成和真实数据集上均能有效识别衣物层。

Insight: 衣物分层的语义分割是3D衣物建模中的重要步骤，点级别的多类别分配优于传统不交叠分割。

Abstract: 3D Cloth modeling and simulation is essential for avatars creation in several fields, such as fashion, entertainment, and animation. Achieving high-quality results is challenging due to the large variability of clothed body especially in the generation of realistic wrinkles. 3D scan acquisitions provide more accuracy in the representation of real-world objects but lack semantic information that can be inferred with a reliable semantic reconstruction pipeline. To this aim, shape segmentation plays a crucial role in identifying the semantic shape parts. However, current 3D shape segmentation methods are designed for scene understanding and interpretation and only few work is devoted to modeling. In the context of clothed body modeling the segmentation is a preliminary step for fully semantic shape parts reconstruction namely the underlying body and the involved garments. These parts represent several layers with strong overlap in contrast with standard segmentation methods that provide disjoint sets. In this work we propose a new 3D point cloud segmentation paradigm where each 3D point can be simultaneously associated to different layers. In this fashion we can estimate the underlying body parts and the unseen clothed regions, i.e., the part of a cloth occluded by the clothed-layer above. We name this segmentation paradigm clothed human layering. We create a new synthetic dataset that simulates very realistic 3D scans with the ground truth of the involved clothing layers. We propose and evaluate different neural network settings to deal with 3D clothing layering. We considered both coarse and fine grained per-layer garment identification. Our experiments demonstrates the benefit in introducing proper strategies for the segmentation on the garment domain on both the synthetic and real-world scan datasets.

[109] Physically Controllable Relighting of Photographs cs.GR | cs.CV | I.4PDF

Chris Careaga, Yağız Aksoy

TL;DR: 本文提出了一种自监督的图像重光照方法，通过结合传统渲染的物理准确性和神经渲染的逼真外观，实现了对图像光照的完全物理控制。

Details

Motivation: 现有方法在野外图像重光照中通常缺乏物理精确性和用户可控性，而本文旨在通过结合物理渲染和神经渲染的优势，提供更灵活的控制和更高的真实感。

Result: 实验表明，该方法能够生成高质量的重光照结果，同时提供物理精确性和用户可控性。

Insight: 将物理渲染与神经渲染结合，可以在保持真实感的同时实现高度可控的光照编辑，为野外图像处理开辟了新方向。

Abstract: We present a self-supervised approach to in-the-wild image relighting that enables fully controllable, physically based illumination editing. We achieve this by combining the physical accuracy of traditional rendering with the photorealistic appearance made possible by neural rendering. Our pipeline works by inferring a colored mesh representation of a given scene using monocular estimates of geometry and intrinsic components. This representation allows users to define their desired illumination configuration in 3D. The scene under the new lighting can then be rendered using a path-tracing engine. We send this approximate rendering of the scene through a feed-forward neural renderer to predict the final photorealistic relighting result. We develop a differentiable rendering process to reconstruct in-the-wild scene illumination, enabling self-supervised training of our neural renderer on raw image collections. Our method represents a significant step in bringing the explicit physical control over lights available in typical 3D computer graphics tools, such as Blender, to in-the-wild relighting.

econ.GN [Back]

[110] Federal Reserve Communication and the COVID-19 Pandemic econ.GN | cs.CL | cs.IT | math.IT | q-fin.EC | stat.AP | stat.MLPDF

Jonathan Benchimol, Sophia Kazinnik, Yossi Saadon

TL;DR: 该研究分析了美联储在COVID-19疫情期间的沟通策略，发现其更侧重于金融稳定和市场波动，且应对措施较以往危机更显反应性。

Details

Motivation: 探讨美联储在经济危机期间的沟通策略演变，特别是COVID-19疫情期间的独特应对方式。

Result: 发现美联储的沟通更注重金融稳定，且沟通内容与政策行动更显反应性，金融稳定情绪的下降预示了宽松政策的出台。

Insight: 央行沟通策略在危机中会持续演变，且非常规货币政策的沟通已成为美联储的‘新常态’。

Abstract: In this study, we examine the Federal Reserve’s communication strategies during the COVID-19 pandemic, comparing them with communication during previous periods of economic stress. Using specialized dictionaries tailored to COVID-19, unconventional monetary policy (UMP), and financial stability, combined with sentiment analysis and topic modeling techniques, we identify a distinct focus in Fed communication during the pandemic on financial stability, market volatility, social welfare, and UMP, characterized by notable contextual uncertainty. Through comparative analysis, we juxtapose the Fed’s communication during the COVID-19 crisis with its responses during the dot-com and global financial crises, examining content, sentiment, and timing dimensions. Our findings reveal that Fed communication and policy actions were more reactive to the COVID-19 crisis than to previous crises. Additionally, declining sentiment related to financial stability in interest rate announcements and minutes anticipated subsequent accommodative monetary policy decisions. We further document that communicating about UMP has become the “new normal” for the Fed’s Federal Open Market Committee meeting minutes and Chairman’s speeches since the Global Financial Crisis, reflecting an institutional adaptation in communication strategy following periods of economic distress. These findings contribute to our understanding of how central bank communication evolves during crises and how communication strategies adapt to exceptional economic circumstances.

cs.RO [Back]

[111] Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation cs.RO | cs.CL | cs.HC | cs.LG | cs.MA | I.2.9; I.2.7; I.2.6PDF

Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray

TL;DR: 论文提出了一种名为MICoBot的系统，通过混合主动对话范式实现人机协作，能够在任务执行中动态调整策略，显著提升任务成功率和用户体验。

Details

Motivation: 长期人机协作需要适应不同人类伙伴的行为和需求，因此需要一种灵活的通信机制，使双方能够协调任务执行。

Result: 在18名人类参与者的27小时实验中，MICoBot显著优于纯LLM基线和其它任务分配模型。

Insight: 动态任务分配和自然语言交互是提升人机协作效率的关键。

Abstract: Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot’s capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We apply a Mixed-Initiative dialog paradigm to Collaborative human-roBot teaming and propose MICoBot, a system that handles the common scenario where both agents, using natural language, take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot’s capabilities (measured by a simulation-pretrained affordance model) and the human’s estimated availability to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. Our extensive evaluations in simulation and real-world – on a physical robot with 18 unique human participants over 27 hours – demonstrate the ability of our method to effectively collaborate with diverse human users, yielding significantly improved task success and user experience than a pure LLM baseline and other agent allocation models. See additional videos and materials at https://robin-lab.cs.utexas.edu/MicoBot/.

[112] Learning to See and Act: Task-Aware View Planning for Robotic Manipulation cs.RO | cs.CVPDF

Yongjie Bai, Zhouxia Wang, Yang Liu, Weixing Chen, Ziliang Chen

TL;DR: 该论文提出了任务感知视图规划（TAVP），通过主动视图规划和任务特定表示学习，提升机器人操作的鲁棒性和泛化能力。

Details

Motivation: 现有的多任务机器人操作视觉-语言-动作（VLA）模型依赖静态视图和共享视觉编码器，导致3D感知不足和任务干扰，限制了模型的性能和泛化能力。

Result: 在RLBench任务上的实验表明，TAVP优于现有固定视图方法，表现更优。

Insight: 主动规划和任务特定表示学习是提升机器人多任务操作性能的关键。

Abstract: Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning. TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.

[113] DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model cs.RO | cs.CVPDF

Rui Yu, Xianghang Zhang, Runkai Zhao, Huaicheng Yan, Meng Wang

TL;DR: DistillDrive is an end-to-end knowledge distillation-based autonomous driving model that enhances multi-mode motion feature learning through diversified instance imitation, achieving significant improvements in collision reduction and closed-loop performance.

Details

Motivation: Existing end-to-end autonomous driving models focus excessively on ego-vehicle status and lack planning-oriented understanding, limiting decision-making robustness.

Result: Achieves a 50% reduction in collision rate and a 3-point improvement in closed-loop performance on nuScenes and NAVSIM datasets.

Insight: Combining knowledge distillation with reinforcement learning and generative modeling can significantly enhance motion feature learning and decision-making robustness in autonomous driving.

Abstract: End-to-end autonomous driving has been recently seen rapid development, exerting a profound influence on both industry and academia. However, the existing work places excessive focus on ego-vehicle status as their sole learning objectives and lacks of planning-oriented understanding, which limits the robustness of the overall decision-making prcocess. In this work, we introduce DistillDrive, an end-to-end knowledge distillation-based autonomous driving model that leverages diversified instance imitation to enhance multi-mode motion feature learning. Specifically, we employ a planning model based on structured scene representations as the teacher model, leveraging its diversified planning instances as multi-objective learning targets for the end-to-end model. Moreover, we incorporate reinforcement learning to enhance the optimization of state-to-decision mappings, while utilizing generative modeling to construct planning-oriented instances, fostering intricate interactions within the latent space. We validate our model on the nuScenes and NAVSIM datasets, achieving a 50% reduction in collision rate and a 3-point improvement in closed-loop performance compared to the baseline model. Code and model are publicly available at https://github.com/YuruiAI/DistillDrive

Jianpeng Yao, Xiaopan Zhang, Yu Xia, Zejin Wang, Amit K. Roy-Chowdhury

TL;DR: 这篇论文提出了一种改进机器人群体导航安全性的方法，通过结合适应性共形推断和约束强化学习，处理不确定性以实现鲁棒性。在分布内和分布外场景中均表现出优越性能。

Details

Motivation: 现有的基于强化学习的群体导航方法在面对分布偏移时性能下降严重，需要一种能够处理不确定性并适应分布变化的鲁棒导航策略。

Result: 分布内场景中，成功率96.93%，比之前最优方法高8.80%，碰撞减少3.72倍，侵入轨迹减少2.43倍；在三种分布外场景中均表现出强鲁棒性。

Insight: 处理不确定性是提升群体导航鲁棒性的关键，适应性共形推断和约束强化学习的结合可以有效应对分布偏移问题。

Abstract: Mobile robots navigating in crowds trained using reinforcement learning are known to suffer performance degradation when faced with out-of-distribution scenarios. We propose that by properly accounting for the uncertainties of pedestrians, a robot can learn safe navigation policies that are robust to distribution shifts. Our method augments agent observations with prediction uncertainty estimates generated by adaptive conformal inference, and it uses these estimates to guide the agent’s behavior through constrained reinforcement learning. The system helps regulate the agent’s actions and enables it to adapt to distribution shifts. In the in-distribution setting, our approach achieves a 96.93% success rate, which is over 8.80% higher than the previous state-of-the-art baselines with over 3.72 times fewer collisions and 2.43 times fewer intrusions into ground-truth human future trajectories. In three out-of-distribution scenarios, our method shows much stronger robustness when facing distribution shifts in velocity variations, policy changes, and transitions from individual to group dynamics. We deploy our method on a real robot, and experiments show that the robot makes safe and robust decisions when interacting with both sparse and dense crowds. Our code and videos are available on https://gen-safe-nav.github.io/.

[115] Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation cs.RO | cs.CVPDF

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen

TL;DR: Genie Envisioner (GE) 是一个统一的机器人操作基础平台，集成策略学习、评估与仿真，通过视频生成框架实现，包含GE-Base（视频扩散模型）、GE-Act（动作映射解码器）和GE-Sim（神经仿真器）。

Details

Motivation: 当前机器人操作任务需要高效的策略学习与仿真工具，现有系统通常分散且难以泛化。GE旨在提供一个统一的框架，集成视觉与动作的动态建模。

Result: GE实现了跨多样化机器人的精确策略推断，支持闭环策略开发，并通过EWMBench验证了其视觉保真度、物理一致性和指令-动作对齐能力。

Insight: GE通过统一建模视觉与动作的动态，为通用机器人操作提供了一个可扩展的框架，为未来具身智能研究提供了实用基础。

Abstract: We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

cs.DB [Back]

[116] Making Prompts First-Class Citizens for Adaptive LLM Pipelines cs.DB | cs.AI | cs.CLPDF

Ugur Cetintemel, Shu Chen, Alexander W. Lee, Deepti Raghavan

TL;DR: 本文提出SPEAR，一种将提示（prompt）作为结构化、自适应的一等公民的语言和运行时系统，以解决LLM流程中提示管理的不足。SPEAR支持运行时动态调整提示和结构化组织提示片段，并定义了一套提示代数规则。

Details

Motivation: 现代LLM流程中，提示作为核心元素仍是一个脆弱的、不透明的字符串，与数据流脱节，限制了重用、优化和运行时控制。

Result: 初步实验验证了动态提示调整优于静态提示和代理重试，并展示了提示级优化的效果（如操作符融合）。

Insight: 通过结构化管理提示，可以显著提升LLM流程的灵活性、重用性和优化潜力。

Abstract: Modern LLM pipelines increasingly resemble data-centric systems: they retrieve external context, compose intermediate outputs, validate results, and adapt based on runtime feedback. Yet, the central element guiding this process – the prompt – remains a brittle, opaque string, disconnected from the surrounding dataflow. This disconnect limits reuse, optimization, and runtime control. In this paper, we describe our vision and an initial design for SPEAR, a language and runtime that fills this prompt management gap by making prompts structured, adaptive, and first-class components of the execution model. SPEAR enables (1) runtime prompt refinement – modifying prompts dynamically in response to execution-time signals such as confidence, latency, or missing context; and (2) structured prompt management – organizing prompt fragments into versioned views with support for introspection and logging. SPEAR defines a prompt algebra that governs how prompts are constructed and adapted within a pipeline. It supports multiple refinement modes (manual, assisted, and automatic), giving developers a balance between control and automation. By treating prompt logic as structured data, SPEAR enables optimizations such as operator fusion, prefix caching, and view reuse. Preliminary experiments quantify the behavior of different refinement modes compared to static prompts and agentic retries, as well as the impact of prompt-level optimizations such as operator fusion.

cs.AI [Back]

[117] Prescriptive Agents based on Rag for Automated Maintenance (PARAM) cs.AI | cs.CL | cs.LG | cs.MA | eess.SPPDF

Chitranshu Harbola, Anupam Purwar

TL;DR: 本文提出了一种基于大型语言模型（LLM）的智能维护系统PARAM，结合振动频率分析和多智能体生成技术，提供可执行的维护建议，实现高精度的异常检测和上下文相关的维护指导。

Details

Motivation: 工业机械维护需要及时干预以预防灾难性故障并优化运行效率。传统方法通常局限于异常检测，缺乏可操作的维护建议。本文旨在结合LLM和多智能体技术，填补这一空白。

Result: 实验结果表明，系统能够有效检测异常并提供上下文相关的维护指导，成功弥合了状态监测与可操作维护规划之间的差距。

Insight: LLM结合多智能体技术可显著提升工业维护的智能化和可扩展性，为跨工业领域的机械设备提供统一的维护框架。

Abstract: Industrial machinery maintenance requires timely intervention to prevent catastrophic failures and optimize operational efficiency. This paper presents an integrated Large Language Model (LLM)-based intelligent system for prescriptive maintenance that extends beyond traditional anomaly detection to provide actionable maintenance recommendations. Building upon our prior LAMP framework for numerical data analysis, we develop a comprehensive solution that combines bearing vibration frequency analysis with multi agentic generation for intelligent maintenance planning. Our approach serializes bearing vibration data (BPFO, BPFI, BSF, FTF frequencies) into natural language for LLM processing, enabling few-shot anomaly detection with high accuracy. The system classifies fault types (inner race, outer race, ball/roller, cage faults) and assesses severity levels. A multi-agentic component processes maintenance manuals using vector embeddings and semantic search, while also conducting web searches to retrieve comprehensive procedural knowledge and access up-to-date maintenance practices for more accurate and in-depth recommendations. The Gemini model then generates structured maintenance recommendations includes immediate actions, inspection checklists, corrective measures, parts requirements, and timeline specifications. Experimental validation in bearing vibration datasets demonstrates effective anomaly detection and contextually relevant maintenance guidance. The system successfully bridges the gap between condition monitoring and actionable maintenance planning, providing industrial practitioners with intelligent decision support. This work advances the application of LLMs in industrial maintenance, offering a scalable framework for prescriptive maintenance across machinery components and industrial sectors.

[118] Can Large Language Models Integrate Spatial Data? Empirical Insights into Reasoning Strengths and Computational Weaknesses cs.AI | cs.CLPDF

Bin Han, Robert Wolfe, Anat Caspi, Bill Howe

TL;DR: 本文探讨了大语言模型（LLMs）在整合城市空间数据中的应用，发现其具有空间推理能力，但在宏观环境与计算几何任务结合上表现不佳。通过提供相关特征和采用‘review-and-refine’方法，LLMs可以生成高性能结果。

Details

Motivation: 传统基于规则的整合方法无法覆盖所有边缘情况，而机器学习方法需要大量标注数据。本文探索LLMs作为替代方案的潜力。

Result: LLMs在空间数据整合中表现出潜力，但需辅助方法改进其在计算几何任务上的表现。

Insight: LLMs可作为传统规则的灵活替代方案，但其成功依赖于多模态整合和后续训练等未来研究方向。

Abstract: We explore the application of large language models (LLMs) to empower domain experts in integrating large, heterogeneous, and noisy urban spatial datasets. Traditional rule-based integration methods are unable to cover all edge cases, requiring manual verification and repair. Machine learning approaches require collecting and labeling of large numbers of task-specific samples. In this study, we investigate the potential of LLMs for spatial data integration. Our analysis first considers how LLMs reason about environmental spatial relationships mediated by human experience, such as between roads and sidewalks. We show that while LLMs exhibit spatial reasoning capabilities, they struggle to connect the macro-scale environment with the relevant computational geometry tasks, often producing logically incoherent responses. But when provided relevant features, thereby reducing dependence on spatial reasoning, LLMs are able to generate high-performing results. We then adapt a review-and-refine method, which proves remarkably effective in correcting erroneous initial responses while preserving accurate responses. We discuss practical implications of employing LLMs for spatial data integration in real-world contexts and outline future research directions, including post-training, multi-modal integration methods, and support for diverse data formats. Our findings position LLMs as a promising and flexible alternative to traditional rule-based heuristics, advancing the capabilities of adaptive spatial data integration.

[119] Cognitive Duality for Adaptive Web Agents cs.AI | cs.CL | cs.MAPDF

Jiarun Liu, Chunhong Zhang, Zheng Hu

TL;DR: 论文提出了一种基于人类认知双系统理论的Web代理框架，结合了离线模仿学习和在线探索，显著提高了性能和效率。

Details

Motivation: 当前的Web代理框架通常在离线模仿学习和在线探索之间缺乏有效整合，而人类认知的双过程理论为解决这一问题提供了新的思路。

Result: 在WebArena上的评估显示，CogniWeb实现了43.96%的成功率，同时显著减少75%的token使用量。

Insight: 人类认知的双过程理论可以用于设计更高效的AI代理，尤其是在动态和复杂的环境中。

Abstract: Web navigation represents a critical and challenging domain for evaluating artificial general intelligence (AGI), demanding complex decision-making within high-entropy, dynamic environments with combinatorially explosive action spaces. Current approaches to building autonomous web agents either focus on offline imitation learning or online exploration, but rarely integrate both paradigms effectively. Inspired by the dual-process theory of human cognition, we derive a principled decomposition into fast System 1 and slow System 2 cognitive processes. This decomposition provides a unifying perspective on existing web agent methodologies, bridging the gap between offline learning of intuitive reactive behaviors and online acquisition of deliberative planning capabilities. We implement this framework in CogniWeb, a modular agent architecture that adaptively toggles between fast intuitive processing and deliberate reasoning based on task complexity. Our evaluation on WebArena demonstrates that CogniWeb achieves competitive performance (43.96% success rate) while maintaining significantly higher efficiency (75% reduction in token usage).

[120] QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering cs.AI | cs.CL | cs.CVPDF

Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, Qing Li

TL;DR: QA-Dragon提出了一个动态RAG系统，通过领域路由器和搜索路由器，结合文本和图像检索，提升了知识密集型VQA任务中的多模态推理能力。

Details

Motivation: 现有的RAG方法仅从文本或图像中单独检索，限制了复杂查询（如多跳推理或实时知识）的处理能力，QA-Dragon旨在解决这一问题。

Result: 在Meta CRAG-MM挑战赛中，QA-Dragon显著优于基线模型，单源任务提升5.06%，多源任务提升6.35%，多轮任务提升5.03%。

Insight: 动态检索策略和混合多模态检索能够更有效地支持复杂查询，尤其是在需要实时知识和多跳推理的场景中。

Abstract: Retrieval-Augmented Generation (RAG) has been introduced to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge into the generation process, and it has become a widely adopted approach for knowledge-intensive Visual Question Answering (VQA). However, existing RAG methods typically retrieve from either text or images in isolation, limiting their ability to address complex queries that require multi-hop reasoning or up-to-date factual knowledge. To address this limitation, we propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA. Specifically, QA-Dragon introduces a domain router to identify the query’s subject domain for domain-specific reasoning, along with a search router that dynamically selects optimal retrieval strategies. By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively. We evaluate our QA-Dragon on the Meta CRAG-MM Challenge at KDD Cup 2025, where it significantly enhances the reasoning performance of base models under challenging scenarios. Our framework achieves substantial improvements in both answer accuracy and knowledge overlap scores, outperforming baselines by 5.06% on the single-source task, 6.35% on the multi-source task, and 5.03% on the multi-turn task.

[121] A Novel Architecture for Symbolic Reasoning with Decision Trees and LLM Agents cs.AI | cs.CLPDF

Andrew Kiruluta

TL;DR: 论文提出了一种集成决策树符号推理与大型语言模型（LLM）能力的混合架构，通过多智能体协作实现高效且可解释的推理。

Details

Motivation: 现有方法在符号推理与神经模块的结合上较为松散，缺乏统一性和可解释性。论文旨在通过紧密集成的架构，提升推理的一致性和泛化能力。

Result: 在ProofWriter、GSM8k和ARC等基准测试中表现优异，分别提升7.2%、5.3%和6.0%的准确率。在临床决策和科学发现等应用中也验证了其有效性。

Insight: 该架构为通用神经-符号推理提供了可扩展且可解释的方案，结合了符号推理的精确性和LLM的泛化能力。

Abstract: We propose a hybrid architecture that integrates decision tree-based symbolic reasoning with the generative capabilities of large language models (LLMs) within a coordinated multi-agent framework. Unlike prior approaches that loosely couple symbolic and neural modules, our design embeds decision trees and random forests as callable oracles within a unified reasoning system. Tree-based modules enable interpretable rule inference and causal logic, while LLM agents handle abductive reasoning, generalization, and interactive planning. A central orchestrator maintains belief state consistency and mediates communication across agents and external tools, enabling reasoning over both structured and unstructured inputs. The system achieves strong performance on reasoning benchmarks. On \textit{ProofWriter}, it improves entailment consistency by +7.2% through logic-grounded tree validation. On GSM8k, it achieves +5.3% accuracy gains in multistep mathematical problems via symbolic augmentation. On \textit{ARC}, it boosts abstraction accuracy by +6.0% through integration of symbolic oracles. Applications in clinical decision support and scientific discovery show how the system encodes domain rules symbolically while leveraging LLMs for contextual inference and hypothesis generation. This architecture offers a robust, interpretable, and extensible solution for general-purpose neuro-symbolic reasoning.

cs.SE [Back]

[122] Posterior-GRPO: Rewarding Reasoning Processes in Code Generation cs.SE | cs.AI | cs.CL | cs.LGPDF

Lishui Fan, Yu Zhang, Mouxiang Chen, Zhongxin Liu

TL;DR: 该论文提出了一个结合代码生成中间推理过程质量的强化学习框架，通过开发新的基准和优化奖励模型，有效缓解了奖励滥用问题，提升了代码生成性能。

Details

Motivation: 现有基于强化学习的代码生成方法仅依赖测试用例的最终结果奖励，忽视了中间推理过程的质量，容易导致奖励滥用问题。

Result: 7B参数的模型在代码生成任务中表现优异，比基线方法提升4.5%，达到与GPT-4-Turbo相当的性能。

Insight: 在强化学习中结合推理过程质量，并通过条件性奖励避免滥用，可显著提升生成任务的性能。

Abstract: Reinforcement learning (RL) has significantly advanced code generation for large language models (LLMs). However, current paradigms rely on outcome-based rewards from test cases, neglecting the quality of the intermediate reasoning process. While supervising the reasoning process directly is a promising direction, it is highly susceptible to reward hacking, where the policy model learns to exploit the reasoning reward signal without improving final outcomes. To address this, we introduce a unified framework that can effectively incorporate the quality of the reasoning process during RL. First, to enable reasoning evaluation, we develop LCB-RB, a benchmark comprising preference pairs of superior and inferior reasoning processes. Second, to accurately score reasoning quality, we introduce an Optimized-Degraded based (OD-based) method for reward model training. This method generates high-quality preference pairs by systematically optimizing and degrading initial reasoning paths along curated dimensions of reasoning quality, such as factual accuracy, logical rigor, and coherence. A 7B parameter reward model with this method achieves state-of-the-art (SOTA) performance on LCB-RB and generalizes well to other benchmarks. Finally, we introduce Posterior-GRPO (P-GRPO), a novel RL method that conditions process-based rewards on task success. By selectively applying rewards to the reasoning processes of only successful outcomes, P-GRPO effectively mitigates reward hacking and aligns the model’s internal reasoning with final code correctness. A 7B parameter model with P-GRPO achieves superior performance across diverse code generation tasks, outperforming outcome-only baselines by 4.5%, achieving comparable performance to GPT-4-Turbo. We further demonstrate the generalizability of our approach by extending it to mathematical tasks. Our models, dataset, and code are publicly available.

cs.LG [Back]

[123] R-Zero: Self-Evolving Reasoning LLM from Zero Data cs.LG | cs.AI | cs.CLPDF

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li

TL;DR: R-Zero是一个完全自主的框架，能够从零数据生成训练数据，通过两个独立模型（Challenger和Solver）的协同进化，实现无需人工标注的自适应学习，显著提升了推理能力。

Details

Motivation: 现有自进化LLM仍依赖人工标注数据，限制了其超越人类智能的能力。R-Zero旨在通过完全自主的数据生成和学习方法突破这一瓶颈。

Result: 在数学推理和通用领域推理基准上显著提升了模型能力（如Qwen3-4B-Base分别提升+6.49和+7.54）。

Insight: 完全自主的数据生成和协同进化机制为LLM自进化提供了新思路，有望推动AI向超智能方向发展。

Abstract: Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

[124] Exploring Superior Function Calls via Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Bingguang Hao, Maolin Wang, Zengzhuang Xu, Yicheng Chen, Cunyin Peng

TL;DR: 本文提出了一种新颖的强化学习框架，通过基于熵的探索策略优化函数调用任务，解决了探索不足、推理结构缺失和参数提取验证不足等关键问题，并在实验中取得了显著性能提升。

Details

Motivation: 当前训练方法在开发大型语言模型的函数调用能力时表现不足，监督微调依赖表面模式匹配，而标准强化学习方法难以处理结构化函数调用的复杂动作空间。

Result: 在Berkeley Function Calling Leaderboard上，整体准确率达到86.02%，在复杂多函数场景中比标准GRPO高出6%。

Insight: 预训练编码模型在结构化语言生成能力上为强化学习提供了优势起点，说明语言生成能力对函数调用任务的强化学习有积极影响。

Abstract: Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel reinforcement learning framework designed to enhance group relative policy optimization through strategic entropy based exploration specifically tailored for function calling tasks. Our approach addresses three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction. Our two-stage data preparation pipeline ensures high-quality training samples through iterative LLM evaluation and abstract syntax tree validation. Extensive experiments on the Berkeley Function Calling Leaderboard demonstrate that this framework achieves state-of-the-art performance among open-source models with 86.02% overall accuracy, outperforming standard GRPO by up to 6% on complex multi-function scenarios. Notably, our method shows particularly strong improvements on code-pretrained models, suggesting that structured language generation capabilities provide an advantageous starting point for reinforcement learning in function calling tasks. We will release all the code, models and dataset to benefit the community.

[125] FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance cs.LG | cs.AI | cs.CLPDF

Mengao Zhang, Jiayu Fu, Tanya Warrier, Yuwen Wang, Tianhui Tan

TL;DR: 该论文提出了FAITH框架，用于评估金融领域大型语言模型（LLMs）在表格数据上的内在幻觉问题，通过上下文感知的掩码跨度预测任务，为金融生成式AI系统提供了更可靠的评估方法。

Details

Motivation: 金融领域对LLMs的准确性和可靠性要求极高，即使是微小的数值错误也可能影响决策和法规遵从性。然而，现有幻觉评测基准很少涵盖金融数据的独特需求。

Result: 实验结果表明，FAITH框架能够有效捕捉LLMs在金融表格数据中的幻觉问题，为金融领域的LLM评测提供了可靠方法。

Insight: 该研究强调了对金融领域LLMs进行针对性评测的重要性，为构建更可信赖的金融生成式AI系统奠定了基础。

Abstract: Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance. Accurate extraction and precise calculation from tabular data are essential for reliable financial analysis, since even minor numerical errors can undermine decision-making and regulatory compliance. Financial applications have unique requirements, often relying on context-dependent, numerical, and proprietary tabular data that existing hallucination benchmarks rarely capture. In this study, we develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs, conceptualized as a context-aware masked span prediction task over real-world financial documents. Our main contributions are: (1) a novel, automated dataset creation paradigm using a masking strategy; (2) a new hallucination evaluation dataset derived from S&P 500 annual reports; and (3) a comprehensive evaluation of intrinsic hallucination patterns in state-of-the-art LLMs on financial tabular data. Our work provides a robust methodology for in-house LLM evaluation and serves as a critical step toward building more trustworthy and reliable financial Generative AI systems.

[126] ALScope: A Unified Toolkit for Deep Active Learning cs.LG | cs.CVPDF

Chenkai Wu, Yuanyuan Qi, Xiaohao Yang, Jueqing Lu, Gang Liu

TL;DR: ALScope是一个统一的深度主动学习工具包，整合了10个CV和NLP数据集及21种代表性算法，支持灵活配置实验条件，发现算法性能在不同场景下差异显著，非标准场景（如数据不平衡和开放集）下算法仍有改进空间。

Details

Motivation: 针对深度主动学习（DAL）领域缺乏统一评估平台的问题，ALScope旨在提供一个标准化工具，以公平、系统地评估算法在分布偏移和数据不平衡等复杂场景下的表现。

Result: 实验表明：（1）DAL算法性能因领域和任务设置差异显著；（2）非标准场景下算法表现有待改进；（3）部分算法性能优异但选择时间较长。

Insight: ALScope揭示了DAL算法在复杂场景下的局限性，为未来研究方向（如高效性和非标准场景适应性）提供了重要启示。

Abstract: Deep Active Learning (DAL) reduces annotation costs by selecting the most informative unlabeled samples during training. As real-world applications become more complex, challenges stemming from distribution shifts (e.g., open-set recognition) and data imbalance have gained increasing attention, prompting the development of numerous DAL algorithms. However, the lack of a unified platform has hindered fair and systematic evaluation under diverse conditions. Therefore, we present a new DAL platform ALScope for classification tasks, integrating 10 datasets from computer vision (CV) and natural language processing (NLP), and 21 representative DAL algorithms, including both classical baselines and recent approaches designed to handle challenges such as distribution shifts and data imbalance. This platform supports flexible configuration of key experimental factors, ranging from algorithm and dataset choices to task-specific factors like out-of-distribution (OOD) sample ratio, and class imbalance ratio, enabling comprehensive and realistic evaluation. We conduct extensive experiments on this platform under various settings. Our findings show that: (1) DAL algorithms’ performance varies significantly across domains and task settings; (2) in non-standard scenarios such as imbalanced and open-set settings, DAL algorithms show room for improvement and require further investigation; and (3) some algorithms achieve good performance, but require significantly longer selection time.

[127] Don’t Reach for the Stars: Rethinking Topology for Resilient Federated Learning cs.LG | cs.CVPDF

Mirko Konstantin, Anirban Mukhopadhyay

TL;DR: 论文提出了一种去中心化的联邦学习框架LIGHTYEAR，通过P2P拓扑结构解决传统星形拓扑的局限性，利用局部验证集计算一致性分数以个性化选择更新，提高了鲁棒性和性能。

Details

Motivation: 传统联邦学习的星形拓扑存在单点故障、个性化不足和鲁棒性差等问题，尤其是面对异构数据或恶意客户端时表现不佳。

Result: 在两种数据集上的实验表明，该方法在客户级性能上优于集中式基线和其他P2P方法，尤其在对抗性和异构条件下表现突出。

Insight: 去中心化的P2P拓扑结合局部验证和正则化是提升联邦学习鲁棒性和个性化性能的有效途径。

Abstract: Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy by keeping data local. Traditional FL approaches rely on a centralized, star-shaped topology, where a central server aggregates model updates from clients. However, this architecture introduces several limitations, including a single point of failure, limited personalization, and poor robustness to distribution shifts or vulnerability to malfunctioning clients. Moreover, update selection in centralized FL often relies on low-level parameter differences, which can be unreliable when client data is not independent and identically distributed, and offer clients little control. In this work, we propose a decentralized, peer-to-peer (P2P) FL framework. It leverages the flexibility of the P2P topology to enable each client to identify and aggregate a personalized set of trustworthy and beneficial updates.This framework is the Local Inference Guided Aggregation for Heterogeneous Training Environments to Yield Enhancement Through Agreement and Regularization (LIGHTYEAR). Central to our method is an agreement score, computed on a local validation set, which quantifies the semantic alignment of incoming updates in the function space with respect to the clients reference model. Each client uses this score to select a tailored subset of updates and performs aggregation with a regularization term that further stabilizes the training. Our empirical evaluation across two datasets shows that the proposed approach consistently outperforms both centralized baselines and existing P2P methods in terms of client-level performance, particularly under adversarial and heterogeneous conditions.

[128] Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning cs.LG | cs.CVPDF

Yue Duan, Taicai Chen, Lei Qi, Yinghuan Shi

TL;DR: 论文提出了一种名为USP的分而治之框架，旨在协同增强半监督持续学习（SSCL）中的三个关键方面：学习塑性、未标注数据学习及记忆稳定性，通过三种策略分别解决这三个问题，实验结果显示其显著优于现有方法。

Details

Motivation: 半监督持续学习（SSCL）结合了标注和未标注数据的优势，同时应对持续数据流的挑战，但现有方法通常只关注其部分问题。USP框架旨在系统性解决未标注学习、记忆稳定性和学习塑性之间的协同优化问题。

Result: 综合实验表明USP在SSCL任务中表现最佳，最终准确率提升高达5.94%，验证了其有效性。

Insight: 1. 分而治之策略能有效协同优化SSCL中的多目标问题；2. 利用特征空间预留和伪标签策略可以显著提升对未标注数据的利用；3. 类均值锚定蒸馏为记忆稳定性提供了新思路。

Abstract: Semi-supervised continual learning (SSCL) seeks to leverage both labeled and unlabeled data in a sequential learning setup, aiming to reduce annotation costs while managing continual data arrival. SSCL introduces complex challenges, including ensuring effective unlabeled learning (UL), while balancing memory stability (MS) and learning plasticity (LP). Previous SSCL efforts have typically focused on isolated aspects of the three, while this work presents USP, a divide-and-conquer framework designed to synergistically enhance these three aspects: (1) Feature Space Reservation (FSR) strategy for LP, which constructs reserved feature locations for future classes by shaping old classes into an equiangular tight frame; (2) Divide-and-Conquer Pseudo-labeling (DCP) approach for UL, which assigns reliable pseudo-labels across both high- and low-confidence unlabeled data; and (3) Class-mean-anchored Unlabeled Distillation (CUD) for MS, which reuses DCP’s outputs to anchor unlabeled data to stable class means for distillation to prevent forgetting. Comprehensive evaluations show USP outperforms prior SSCL methods, with gains up to 5.94% in the last accuracy, validating its effectiveness. The code is available at https://github.com/NJUyued/USP4SSCL.

[129] Adapting Vision-Language Models Without Labels: A Comprehensive Survey cs.LG | cs.AI | cs.CVPDF

Hao Dong, Lijun Sheng, Jian Liang, Ran He, Eleni Chatzi

TL;DR: 这篇文章是一篇关于无监督视觉-语言模型(VLM)适应方法的全面综述，提出了基于未标注视觉数据可用性和性质的分类法，将现有方法分为四种范式，并总结了核心方法、基准测试以及未来研究方向。

Details

Motivation: 视觉-语言模型虽然在多个任务中表现出强大的泛化能力，但在未经任务特定适应的情况下直接应用于下游场景时性能不佳。因此，文章旨在填补无监督VLM适应方法的系统性综述空白。

Result: 文章不仅总结了现有方法的进展，还指出了领域内的开放挑战，并提供了未来研究的潜在方向。

Insight: 无监督VLM适应是一个快速发展的领域，未来的研究需要关注数据效率、计算开销以及更复杂的跨模态适应任务。

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks. However, their performance often remains suboptimal when directly applied to specific downstream scenarios without task-specific adaptation. To enhance their utility while preserving data efficiency, recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data. Despite the growing interest in this area, there remains a lack of a unified, task-oriented survey dedicated to unsupervised VLM adaptation. To bridge this gap, we present a comprehensive and structured overview of the field. We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms: Data-Free Transfer (no data), Unsupervised Domain Transfer (abundant data), Episodic Test-Time Adaptation (batch data), and Online Test-Time Adaptation (streaming data). Within this framework, we analyze core methodologies and adaptation strategies associated with each paradigm, aiming to establish a systematic understanding of the field. Additionally, we review representative benchmarks across diverse applications and highlight open challenges and promising directions for future research. An actively maintained repository of relevant literature is available at https://github.com/tim-learn/Awesome-LabelFree-VLMs.

Table of Contents

cs.CV [Back]

[1] ACM Multimedia Grand Challenge on ENT Endoscopy Analysis cs.CVPDF

[2] CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework cs.CV | cs.AIPDF

[3] VER-Bench: Evaluating MLLMs on Reasoning with Fine-Grained Visual Evidence cs.CVPDF

[4] Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications cs.CVPDF

[5] Revealing Temporal Label Noise in Multimodal Hateful Video Classification cs.CV | cs.AIPDF

[6] Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations cs.CVPDF

[7] Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens cs.CV | cs.AI | cs.LGPDF

[8] Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models cs.CVPDF

[9] TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring cs.CV | cs.AIPDF

[10] AdvDINO: Domain-Adversarial Self-Supervised Representation Learning for Spatial Proteomics cs.CV | cs.AIPDF

[11] CSRAP: Enhanced Canvas Attention Scheduling for Real-Time Mission Critical Perception cs.CVPDF

[12] Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression cs.CVPDF

[13] Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion cs.CVPDF

[14] Unified modality separation: A vision-language framework for unsupervised domain adaptation cs.CVPDF

[15] Modeling Rapid Contextual Learning in the Visual Cortex with Fast-Weight Deep Autoencoder Networks cs.CVPDF

[16] Attribute Guidance With Inherent Pseudo-label For Occluded Person Re-identification cs.CVPDF

[17] CRAM: Large-scale Video Continual Learning with Bootstrapped Compression cs.CV | cs.LG | cs.PFPDF

[18] Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation cs.CVPDF

[19] AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content cs.CV | eess.IVPDF

[20] Skin-SOAP: A Weakly Supervised Framework for Generating Structured SOAP Notes cs.CV | cs.AI | cs.LGPDF

[21] HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID cs.CVPDF

[22] Finding Needles in Images: Can Multimodal LLMs Locate Fine Details? cs.CVPDF

[23] DualMat: PBR Material Estimation via Coherent Dual-Path Diffusion cs.CVPDF

[24] Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks cs.CV | cs.AI | cs.LG | eess.IVPDF

[25] FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer cs.CVPDF

[26] PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation cs.CVPDF

[27] Sculpting Margin Penalty: Intra-Task Adapter Merging and Classifier Calibration for Few-Shot Class-Incremental Learning cs.CVPDF

[28] FedGIN: Federated Learning with Dynamic Global Intensity Non-linear Augmentation for Organ Segmentation using Multi-modal Images cs.CV | cs.AIPDF

[29] Deep Learning-based Animal Behavior Analysis: Insights from Mouse Chronic Pain Models cs.CVPDF

[30] Rotation Equivariant Arbitrary-scale Image Super-Resolution cs.CVPDF

[31] X-MoGen: Unified Motion Generation across Humans and Animals cs.CVPDF

[32] PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems cs.CVPDF

[33] Multi-tracklet Tracking for Generic Targets with Adaptive Detection Clustering cs.CVPDF

[34] SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images cs.CVPDF

[35] EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery cs.CVPDF

[36] VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization cs.CVPDF

[37] Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation cs.CV | I.2.10PDF

[38] ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking cs.CV | cs.AI | cs.LGPDF

[39] Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2 cs.CV | 68T45, 94A08 | I.2.10PDF

[40] ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models cs.CVPDF

[41] Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models cs.CV | cs.AIPDF

[42] RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding cs.CV | cs.AIPDF

[43] SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion cs.CV | cs.AIPDF

[44] B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding cs.CVPDF

[45] Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection cs.CVPDF

[46] MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs cs.CV | cs.CLPDF

[47] CoCAViT: Compact Vision Transformer with Robust Global Coordination cs.CVPDF

[48] mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering cs.CV | cs.AIPDF

[49] Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting cs.CVPDF

[50] 3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering cs.CVPDF

[51] Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision cs.CV | cs.CLPDF

[52] PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation cs.CV | cs.AIPDF

[53] Test-Time Reinforcement Learning for GUI Grounding via Region Consistency cs.CV | cs.AI | cs.CLPDF

[54] CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation cs.CVPDF

[55] UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation cs.CV | cs.AI | cs.LGPDF

[56] From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization cs.CV | cs.SD | eess.ASPDF

[57] Smoothing Slot Attention Iterations and Recurrences cs.CVPDF

[58] Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions cs.CV | cs.AI | cs.LGPDF

[59] F2PASeg: Feature Fusion for Pituitary Anatomy Segmentation in Endoscopic Surgery cs.CV | cs.SY | eess.IV | eess.SYPDF

[60] SMOL-MapSeg: Show Me One Label cs.CVPDF

[61] Symmetry Understanding of 3D Shapes via Chirality Disentanglement cs.CVPDF

[62] MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips cs.CVPDF

[63] Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events cs.CVPDF

[64] FS-IQA: Certified Feature Smoothing for Robust Image Quality Assessment cs.CVPDF

[65] Optimal Brain Connection: Towards Efficient Structural Pruning cs.CVPDF

[66] When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework cs.CVPDF

[67] AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety cs.CV | I.2.10; I.2.7; H.3.3; H.4.3; K.4.1PDF

[68] Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis cs.CVPDF

[69] DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition cs.CVPDF

[70] WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction cs.CVPDF

[71] LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model cs.CVPDF

[72] Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity cs.CVPDF

[73] MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes cs.CVPDF

[74] FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing cs.CVPDF

cs.CL [Back]

[75] Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM cs.CL | cs.AI | cs.SD | eess.ASPDF

[76] Pitch Accent Detection improves Pretrained Automatic Speech Recognition cs.CL | cs.SD | eess.ASPDF

[77] Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History cs.CL | cs.AIPDF